The AI-Statistics Debate: Verification, Not Trust

A methods section either survives an audit or it doesn't. Where the number came from is a separate question.

Every few months, a new argument breaks out about whether a language model can be trusted to run a t-test or choose the right one. One camp says AI-generated statistics are a fabrication risk that has no place near a manuscript. The other says a well-prompted model is already as accurate as a harried resident doing it by hand. Both camps are arguing about the wrong thing. A methods section was never supposed to run on trust in the first place. Peer review does not ask a reviewer to trust that the author is competent; it asks whether the reported statistic follows from the reported test given the reported data. That check does not care whether a language model or a postdoc at midnight produced the number. It only cares whether the number can be retraced.

The same tool, four accuracy rates

Ruta and colleagues tested ChatGPT against a fixed dataset (the 2019 National Inpatient Sample) across eight statistical methods, running each one ten times at three levels of prompt specificity: Basic, Intermediate, and Advanced.¹ Correctness meant agreement with Python, SAS, and RStudio to within 1%. Method selection accuracy rose from 47.5% at the Basic tier to 85.0% at Intermediate to 92.5% at Advanced. Accuracy on the actual inferential output, the test statistic and P value themselves, rose from 32.5% to 81.3% to 92.5% across those same three tiers.

Same model, same data, same afternoon. The only thing that moved was how precisely the user specified what they wanted. That is exactly the information a manuscript cannot show you. A methods section states which test was used and reports a statistic and a P value; it does not, and structurally cannot, disclose how the prompt was phrased, how many attempts preceded the number that made it into the paper, or whether that number was the first output or the fourth. An accuracy figure attached to a tool name in a validation study describes a range across conditions. It says nothing about which condition produced any one number sitting in front of you.

Getting the test right is not the same as producing a checkable answer

A second study, run independently, makes a related point from a different angle. Shukla and colleagues had five biostatistics experts score six large language models, including ChatGPT, Claude, Gemini, and three others, against 20 standardized hypothesis-testing scenarios.² On the headline metric, every model picked the correct test in all 20 scenarios: 100% accuracy on selection, across the board. But when the same experts rated the quality of the reasoning offered to justify each choice, ChatGPT ranked lowest of the six on statistical reasoning despite matching the others on the final answer — a descriptive ranking on expert scores; the between-model difference did not itself reach statistical significance, which is exactly the kind of distinction this article is about.

That gap matters more than the headline number. A correct test choice wrapped in reasoning a reviewer cannot follow is functionally the same as a black box: you can see the output, but you cannot audit the path that produced it. That is the property that should concern a methods section, not whether the underlying model is capable enough on average.

Humans fail the identical audit

None of this makes the case that human-computed statistics are the safer default. García-Berthou and Alcaraz checked whether the test statistics reported in published papers were internally consistent with their own reported P values, across every article in four consecutive volumes of Nature and a random sample of papers from BMJ, all published before large language models existed.³ Roughly 11% to 12% of the individual statistical results they checked were incongruent: the reported P value did not match what the reported test statistic and degrees of freedom implied. At least one such error turned up in about a quarter to more than a third of the papers examined. Every one of those manuscripts had passed peer review, and most had a co-author whose job was the statistics.

The comparison that matters is not AI versus human. It's verifiable versus not. A rushed coauthor transposing digits at 1 a.m. and a language model returning its lowest-accuracy-tier answer produce the identical failure mode: a number that looks exactly as credible as a correct one, sitting in a paragraph that gives the reader no way to tell the difference.

Verification, not the source, is the standard

The fix is not a policy on whether AI is allowed near your data. It's an audit trail: the same test, applied the same way, reproducible from the raw data by anyone who has it, independent of who or what ran the numbers the first time. That is the bar a methods section should clear regardless of whether the analysis came from a spreadsheet macro, a graduate student, or a chat window. StatsPlease computes results deterministically from your uploaded data rather than generating a plausible-sounding number, so the statistic that ends up in your methods section is exactly reproducible the next time anyone reruns it.

You might also read

References

Ruta MR, Gaidici T, Irwin C, Lifshitz J. ChatGPT for Univariate Statistics: Validation of AI-Assisted Data Analysis in Healthcare Research. Journal of Medical Internet Research. 2025;27:e63550. https://doi.org/10.2196/63550
Shukla M, Pandey D, Kaur S, Agarwal M, Goyal A, Sharma H. Evaluating the Accuracy and Explanatory Quality of Large Language Models ChatGPT, Claude, DeepSeek, Gemini, Grok, and Le Chat in Statistical Test Selection for Hypothesis Testing Decisions. Cureus. 2025;17(10):e94949. https://doi.org/10.7759/cureus.94949
García-Berthou E, Alcaraz C. Incongruence between test statistics and P values in medical papers. BMC Medical Research Methodology. 2004;4:13. https://doi.org/10.1186/1471-2288-4-13

Make the number checkable.

Upload your dataset to StatsPlease and get the test, the exact P-value, and the effect size computed deterministically from your data — the same numbers on every rerun, in any tool. Computed, not generated.

Try StatsPlease free

The AI-Statistics Debate Is Asking the Wrong Question. It Should Be About Verification, Not Trust

The same tool, four accuracy rates

Getting the test right is not the same as producing a checkable answer

Humans fail the identical audit

Verification, not the source, is the standard

References