First Name*

Last Name*

Email ID

Phone*

College - Where did you study?*

One of the IITs

One of the NITs

One of the BITs

One of the IIITs

One of the NIDs

Agnel Charities' FR. C. Rodrigues Institute of Technology, Vashi, Navi Mumbai

Atal Bihari Vajpayee Indian Institute of Information Technology & Management Gwalior (IIIT)

B M S College of Engineering Basavanagudi,Bangalore(BMSCE)

B.R.A.C.T's Vishwakarma Institute of Information Technology, Kondhwa(VIIT)

Bansilal Ramnath Agarawal Charitable Trust's Vishwakarma Institute of Technology, Bibwewadi, Pune (VIT Pune)

Bhartiya Vidya Bhavan's Sardar Patel Institute of Technology , Andheri, Mumbai (SPIT)

Bhilai Institute of Technology, Bhilai House, Durg(BIT)

Bhilai Institute of Technology.

Birla Institute of Technology, Goa

Birla Institute of Technology, Hydrabad

Birla Institute of Technology, Mesra, Ranchi

Birla Institute of Technology, Pilani, Rajasthan

CHAITANYA BHARATHI INSTITUTE OF TECHNOLOGY(CBIT)

Coimbatore Institute Of Technology(CIT) (Autonomous)

College of Engineering, Pune (COEP)

CV Raman Global University

Dayananda Sagar College of Engineering Bangalore (DSCE)

Delhi Technological University, DTU Delhi

Desai University, (DDU), Nadiad

Dhirubhai Ambani Institute of Info. & Comm. Tech.,(DA-IICT)

Don Bosco Institute of Technology, Mumbai

Dr. Ambedkar Institute Of Technology Bangalore

Faculty Of Technology & Engineering(MSU), Vadodara

Faculty Of Technology And Engineering(GIA), Dharmsinh

Fr. Conceicao Rodrigues College of Engineering, Bandra,Mumbai

Garv Institute of Management & Technology.

Government College of Engineering, Amravati

Govt Engineering College, Bilaspur.

Govt Engineering College, Raipur.

Govt. Engineering College, Raipur (GEC Raipur)

IIIT Hyderabad

Indian Institute of Art and Design(IIAD), Delhi

Indian Institute of Engineering Science and Technology, Shibpur (IIEST Shibpur)

Indian Institute of Information Technology (IIIT) Pune

Indian Institute of Information Technology (IIIT)Kota, Rajasthan

Indian Institute of Information Technology Surat (IIIT)

Indian Institute of Information Technology(IIIT) Kilohrad, Sonepat, Haryana

Indian Institute of Information Technology(IIIT), Vadodara, Gujrat

Indian Institute of Information Technology, Design & Manufacturing, Kancheepuram (IIIT)

Indian Institute of Technology (BHU) Varanasi

Indian Institute of Technology (ISM) Dhanbad

Indian Institute of Technology Bhilai

Indian Institute of Technology Bhubaneswar

Indian Institute of Technology Bombay

Indian Institute of Technology Delhi

Indian Institute of Technology Dharwad

Indian Institute of Technology Gandhinagar

Indian Institute of Technology Goa

Indian Institute of Technology Guwahati

Indian Institute of Technology Hyderabad

Indian Institute of Technology Indore

Indian Institute of Technology Jammu

Indian Institute of Technology Jodhpur

Indian Institute of Technology Kanpur

Indian Institute of Technology Kharagpur

Indian Institute of Technology Madras

Indian Institute of Technology Mandi

Indian Institute of Technology Palakkad

Indian Institute of Technology Patna

Indian Institute of Technology Roorkee

Indian Institute of Technology Ropar

Indian Institute of Technology Tirupati

Indraprastha Institute of Information Technology Delhi (IIIT-Delhi)

INSTITUTE OF ENGINEERING & TECHNOLOGY,LUCKNOW (0052)(IET Lucknow)

Institute of Engineering and Management, Kolkata

Institute of Engineering and Technology, DAVV, Indore (1996)

Institute Of Technology, Nirma University Of Science & Technology, Ahmedabad

International Institute of Information Technology, Bhubaneswar

International Institute of Information Technology, Naya Raipur

Jabalpur Engineering College, Jabalpur, (JEC) (1947)

Jadavpur Uni

Jadavpur University

JSS Science and Technology University(Formerly SJCE) Mysore

K J Somaiya Institute of Engineering and Information Technology, Sion, Mumbai

K.J.Somaiya College of Engineering, Vidyavihar, Mumbai

Kalinga Institute of Industrial Technology

L.D.College Of Engineering, Ahmedabad (LDCE)

M S Ramaiah Institute of Technology Bangalore (MSRIT)

Madhav Institute of Technology & Science, Gwalior (1957)

MAEER’S MIT, Pune

Maharashtra Academy of Engineering and Educational Research

Maharashtra Institute of Technology (MIT)

Malaviya National Institute of Technology Jaipur

Manipal Institute of Technology (MIT)

Maulana Abul Kalam Azad University of Technology, Kolkata

Maulana Azad National Institute of Tehnology Bhopal

MIT Academy of Engineering,Alandi, Pune

MKSSS's Cummins College of Engineering for Women, Karvenagar,Pune

Motilal Nehru National Institute of Technology Allahabad

National Institute of Design(NID)

National Institute of Technology Calicut

National Institute of Technology Delhi

National Institute of Technology Durgapur

National Institute of Technology Hamirpur

National Institute of Technology Jalandhar

National Institute of Technology Karnataka, Surathkal

National Institute of Technology Patna

National Institute of Technology Raipur

National Institute of Technology, Andhra Pradesh

National Institute of Technology, Jamshedpur

National Institute of Technology, Kurukshreta

National Institute of Technology, Rourkela

National Institute of Technology, Silchar

National Institute of Technology, Tiruchirappalli

National Institute of Technology, Warangal

Netaji Subhas University of Technology, New Delhi (NSUT Delhi)

O U COLLEGE OF ENGG HYDERABAD (UCE)

P E S University (Electronic City Campus) Bangalore(PES)

P E S University (Ring Road Campus) Bangalore(PES)

Pandit Deendayal Petroleum University ,Gandhinagar(PDPU)

Pimpri Chinchwad Education Trust, Pimpri Chinchwad College of Engineering, Pune(PCCOE)

PSG College of Engineering and Technology

Pt. Dwarka Prasad Mishra Indian Institute of Information Technology, Design & Manufacture Jabalpur

Pune Institute of Computer Technology, Dhankavdi, Pune(PICT)

Punjab Engineering College, Chandigarh (PEC)

R. V. College of Engineering Bangalore(RVCE)

Sardar Patel Institute of Technology, Andheri, Mumbai

Sardar Vallabhbhai National Institute of Technology, Surat

School of Engineering and Applied Science, Ahmedabad (SEAS)

Shri G.S. Institute of Technology & Science, Indore (M.P.) (1952)

Shri Guru Gobind Singhji Institute of Engineering and Technology, Nanded

Shri Shankaracharya Technical Campus,(Shri Shankaracharya Group of Institutions).

Shri Vile Parle Kelvani Mandal's Dwarkadas J. Sanghvi College of Engineering, Vile Parle,Mumbai (DJSCE)

Silicon Institute of Technology

Sir M.Visveswaraya Institute of Technology Hunasemaranahalli,Bangalore,

SOA ITER, Bhubaneshwar

Sri Jayachamarajendra College of Engineering(Const. of JSS Univ.) Mysore

Sri Sivasubramaniya Nadar College Of Engg (Autonomous) (SSN)

Srishti Institute of Art and Design, Bangaluru

SSN CoE, Kalavakkam

Symbiosis Institute of Design(SID),Pune

The National Institute of Engineering Mysore (NIE)

Thiagarajar College Of Engineering (Autonomous) (TCE)

University Institute of Technology RGPV, Bhopal (1986)

University of Kalyani, Kalyani

University Visveswariah College of Engineering Bangalore (UVCE)

VASAVI COLLEGE OF ENGINEERING (VCE)

Veer Surendra Sai University of Technology

Veermata Jijabai Technological Institute(VJTI), Matunga, Mumbai

Vellore Institute of Technology(VIT Vellore)

Vidyalankar Institute of Technology,Wadala, Mumbai

Vishwakarma Government Engineering College, Chandkheda,Gandhinagar (VGECG)

Visvesvaraya National Institute of Technology, Nagpur

Vivekanand Education Society's Institute of Technology, Chembur, Mumbai

Walchand College of Engineering, Sangli (WCE)

Field of Study (Graduation)*

BTech

BDES/MDES

BCA

BSc

Others

Upload your CV*

Yes, I would like Talentica Software to contact me. Click here to read our full Privacy Policy.

First Name*

Last Name*

Email ID

Phone*

Message

Yes, I would like Talentica Software to contact me. Click here to read our full Privacy Policy.

Benchmark-hacking is the new p-hacking

December 12, 2025

Debasish Ray Chawdhuri

December 12, 2025

Debasish Ray Chawdhuri

P-hacking has been a persistent problem in scientific studies for decades, especially as the pressure to publish in academia continues to rise. Now, we’re seeing a very similar issue with AI benchmarking. In this article, I will explore what p-hacking is, why it happens despite scientists’ good intentions, and how it relates to the more modern problem of benchmark-hacking with LLMs.

What is p-value?

In short, it is an extension of the proof-by-contradiction argument into a probabilistic setting. Instead of saying a statement is true because assuming the opposite leads to a logical contradiction or implies that an impossible event has occurred, we say a statement is most likely true because assuming the opposite would force us to accept that a highly improbable event has occurred, that is, the observed result would be extremely unlikely if we assumed the negation of what we are trying to prove.

Let us say that Tom and Mary are playing a game of coin toss. They toss an unbiased coin, and the loser pays the winner five dollars. I agree that it is a very dumb and boring game, but bear with me. Mary notices that after playing three tosses, Tom has won twice. She suspects that the coin is biased in favor of Tom. Tom, of course, vehemently denies it and claims that he is just lucky.

Does Mary have a case? Well, if you think about it, that is the most balanced the results could have been — after three tosses, one of the players had to get at least two wins. Tom points this out, and they continue playing.

Now, after playing 30 times, Mary notices that Tom has won 20 times. Tom again denies any wrongdoing. The question is: does Mary have enough evidence to claim that Tom is cheating? Well, it depends — after all, it is not impossible for Tom to win all the games even with a completely unbiased coin. And we should stick to the principle of “innocent until proven otherwise” for any claims of cheating. But that cannot work in practice, because then cheating would never be caught. So, we use a probabilistic standard of evidence —

“Innocent until the probability of such a lopsided result from a fair game is at most p.”

Basically, we consider the result too lopsided in favor of Tom if the probability of getting such an extreme result — or even more extreme results — is too small to ever actually occur in practice. The probability is considered too small if it is at most p. This is also called the p-value, since statisticians are the most creative people when it comes to naming concepts.

Of course, if the threshold p is too large, we run the risk of convicting an innocent Tom of cheating; and if it is too small, Mary cannot have much of a shot at ever convicting Tom. The correct choice of p depends on the context of the claim. Most people in data analysis, psychology, or sociology use the threshold of p ≤ 0.05. Such a large value can have its own problems, but I will discuss that later.

For now, let’s come back to Mary’s accusation against Tom. With the definition settled, all Mary has to show is that Tom’s wins are too lopsided. What is the probability of Tom getting 20 or more wins out of 30 tosses? The probability of Tom winning exactly r times is \binom{30}{r}/2^{30}. So, the probability of getting the same or a more extreme result is —

This is less than 0.05, so Mary indeed does have a case and has successfully demonstrated that Tom is cheating, but Tom, of course, still denies it and claims that 0.05 is too high a threshold.

Let us now look at Mary’s argument a little more closely. Mary’s claim is that Tom is cheating. This is what statisticians would call the alternate hypothesis. The standard belief is that Tom is innocent — this is what statisticians call the null hypothesis.

The argument is similar to proof-by-contradiction. In non-probabilistic logic, to prove that Tom is cheating, Mary must show that if we assume Tom to be innocent, the observed results would be impossible. In probabilistic logic, Mary instead only has to show that if we assume Tom to be innocent, the observed events would be extremely improbable (as opposed to impossible).

What is p-hacking?

Probabilities are delicate by nature — a slight change in conditions can change the probabilities a lot. To give you an idea, let’s assume that Tom and Mary play this tossing game every day and the coin is tossed exactly 30 times. Now, if Tom is honest and Mary is trying to find a reason to claim that Tom is cheating, how likely do you think it is that there would be a day when Tom gets at least 20 wins? If she keeps trying, this is bound to happen eventually. Then she can write about this incident.

It is not uncommon for data analysts to inadvertently use a similar technique to find patterns that look statistically significant when, in fact, there is no real pattern at all. If an analyst gets a dataset and tries 20 arbitrary models on a dataset that does not have any pattern, there is still a probability of 1-0.95^{20}=0.64 of getting a pattern that would be inside the 0.05 threshold for the p-value. To avoid this, there must be a test dataset that has been held out and can be used exactly once to check the model’s p-value. But this is almost never done, because you can’t simply throw away data after trying every model that you have generated.

It is actually very difficult to avoid p-hacking even with the best intentions, because publication culture is biased toward successful results. You get published only if you show a statistically significant finding. That, of course, means that if you keep running different experiments, you will publish only when your calculated p-value is small enough. But that calculation assumes that each experiment is independent of the others and that no additional experiments are considered when computing the p-value of any single experiment.

However, the researcher is trying multiple theories and publishing whenever one of the calculations yields a small-enough p-value. Since the p-value is never zero, by the law of large numbers such a low-probability event will eventually occur — and then get published as a significant result. This problem is particularly severe in fields that accept a relatively high p-value, like 5%.

The result is that scientific literature becomes riddled with findings that appear true but are actually false positives. Of course, science eventually tries to correct this through systematic studies or meta-analyses that consider effects replicated by many researchers.

On the origin of LLMs by means of benchmark selection

Charles Darwin shook the world with his book On the Origin of Species by Means of Natural Selection, where he showed that even under random variation, a consistent selection pressure would shape organisms to fit their environment because the ones that were less fit would not survive. This is also the mechanism behind how LLMs get published: LLMs are selected by a similar process, except that instead of natural selection, the selection pressure is the scores on public benchmarks.

This might have been fine if the benchmarks were actually random samples drawn from real-world problems to be solved. The problem is that we do not know what kind of benchmark can truly reflect real-world cases, and, moreover, benchmarks are static rather than continually sampled from real-world tasks. Let me explain in details –

Benchmarks are surrogates

The benchmarks hardly ever test the real-world use cases of LLMs. They are mostly arbitrary tests that humans assume only intelligent humans would be able to perform, and therefore treat as tests of intelligence for LLMs. But we have seen in many cases that LLMs can pass these extremely difficult tasks while failing at basic tasks or reasoning that seem trivial to humans. The reason this happens is that we have a good understanding of human failure modes—both because we are constantly exposed to other humans and because we can introspect about ourselves.

Intelligence is not a one-dimensional trait; it is multi-faceted, and we often don’t think about testing abilities that humans perform effortlessly. However, when we select which LLMs to publish or pursue, we rely on these benchmarks to guide that selection. This means we are selecting for the wrong things, and it may result in a degradation of real-world performance.

Benchmarks are static

The other, and arguably more insidious, problem is that benchmarks are by nature static. Even if they were sampled from the distribution of real-world tasks, any finite benchmark cannot on its own reflect all real-world scenarios. This results in LLMs being optimized—through selection pressure—for certain things while being compromised on the aspects that the benchmarks fail to capture.

Conclusion

These two reasons explain why AI coding software has not improved significantly, even though LLMs show consistent performance gains on benchmarks. While it is true that the tooling around LLMs has improved substantially over the last year, the core capability of coding—and the quality of the generated code—has not improved much, and the failure modes remain essentially the same. I would even go so far as to say that even if LLMs reach superhuman benchmark performance, AI coding agents would still lag behind humans in real-world coding work, because the benchmarks are merely surrogate tests and are themselves static.