← Back to Blog

Interpretation · ~11 min read

What a P-Value Actually Means (and the One Sentence Almost Every Paper Gets Wrong)

Statistics for clinical researchers and surgical trainees

In short

A P-value is the probability of seeing data this extreme, or more extreme, if the null hypothesis were true. It is not the probability that the null hypothesis is true, not the probability your result happened "by chance," and not a measure of how large or clinically important an effect is. That reversal, stating a probability about the hypothesis when the number is actually a probability about the data, is the single most common statistical error in the published literature, and the American Statistical Association named it directly in its 2016 statement.

Open the discussion section of almost any clinical paper reporting P = .03 and you will find a sentence close to this one: "there is only a 3% probability that this result occurred by chance." It reads naturally. It sounds precise. It is also not what a P-value says, and it is the error the American Statistical Association's board sat down specifically to correct.1

This is not a pedantic distinction. The correct statement and the common misstatement point in different logical directions: one describes the data assuming the null hypothesis, the other claims to describe the hypothesis itself. Reviewers who catch this in your manuscript will ask you to rewrite it. More importantly, thinking about your own result the wrong way changes what you believe you found.

The sentence almost every paper gets wrong

The sentence takes a few interchangeable forms, and you have almost certainly written or read one of them:

  • "P = .03, meaning there is a 97% probability the treatment worked."
  • "There is only a 5% chance this finding is due to chance."
  • "P < .001 proves the two groups are truly different."

Each of these treats the P-value as a probability attached to the hypothesis: the probability the treatment worked, the probability the finding is real, the probability of "chance" as an explanation. A P-value is not any of those things. It is a probability attached to the data, computed on the assumption that the null hypothesis is already true.

What a P-value actually is

The ASA's statement defines it precisely: "Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value."1

Unpacked for a clinical audience: you start by assuming the null hypothesis is exactly true (no real difference between your groups, no real association between your variables). Under that assumption, you ask how surprising your observed data would be. The P-value answers that question as a probability: if there really were no effect, how often would a sample this size produce a difference this large, or larger, just from ordinary sampling variation? A small P-value means your data would be a surprising, rare outcome under "no effect." It does not mean "no effect" itself is now proven unlikely; it means your specific data are hard to reconcile with that assumption.

Every valid probability statement a P-value makes is conditional on the null hypothesis being true. It cannot simultaneously be a statement about the probability that the null hypothesis is true, on pain of circularity. Statisticians call this confusing the probability of the data given the hypothesis with the probability of the hypothesis given the data, and the two are not interchangeable outside of very specific Bayesian setups that a standard t-test or Mann-Whitney U does not perform.

A computed example: what P < .001 does and does not tell you

Consider a real comparison: do patients later diagnosed with diabetes have a higher body mass index at baseline than patients who are not? Using the Pima Indians Diabetes dataset (see Data Sources), Shapiro-Wilk analysis rejects normality in both groups (P < .001 for each), so the comparison is routed to Mann-Whitney U rather than an independent-samples t-test.

StatsPlease output: Mann-Whitney U test
GroupnMedian BMIIQR
Diabetes diagnosis26634.330.9–38.9
No diabetes diagnosis49130.125.6–35.3

U = 89731 · P < .001 · r = 0.37 (medium)

Body mass index was significantly higher in patients with a diabetes diagnosis (median 34.3, IQR 30.9–38.9) than in those without (median 30.1, IQR 25.6–35.3), U = 89731, P < .001, r = 0.37.

Example output. Figures are illustrative of AMA-style formatting.

Computed from the Pima Indians Diabetes dataset (National Institute of Diabetes and Digestive and Kidney Diseases, accessed via the Plotly public datasets collection; see Data Sources), n = 757 after excluding 11 records with BMI recorded as 0, a missing-data artifact in the original file. Figures computed with scipy from real data.

Here is what P < .001 means for this specific result: if there were truly no difference in baseline BMI between people who do and do not go on to receive a diabetes diagnosis, a sample this size would produce a gap this large, or larger, less than once in a thousand times. That is a strong signal against the null hypothesis.

Here is what it does not mean. It does not mean there is a 99.9% probability that a real BMI difference exists. It does not mean the null hypothesis has a 0.1% chance of being true. It does not, by itself, tell you whether a roughly 4-point median BMI gap is large enough to matter clinically; that is what the effect size, r = 0.37, and the confidence interval around it are for, not the P-value.3 The P-value's whole job here is to describe how compatible your data are with "no difference." It has no opinion on how probable "no difference" was to begin with.

Common mistake

Writing "the difference was real 99.9% of the time" or "there was a 0.1% chance the null hypothesis was true" after a result of P < .001. Neither sentence is licensed by the test. The correct statement stays on the data: the observed difference would be very unlikely to occur under the assumption of no true difference.

What a P-value is not: the ASA's own wording

The ASA's statement lists this as its second numbered principle, stated as plainly as a professional body statement ever gets: "P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. Researchers often wish to turn a p-value into a statement about the truth of a null hypothesis, or about the probability that random chance produced the observed data. The p-value is neither. It is a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself."1

That sentence, "the p-value is neither," is worth pinning above your desk. It rules out both common misreadings at once: the "probability the hypothesis is true" reading, and the "probability of chance" reading. Both treat the number as if it describes the explanation. It describes the data.

Not every significant result is equally significant

A second, related trap: treating P = .04 and P < .001 as interchangeable because both clear the conventional .05 threshold. The ASA addresses this too, in its sixth principle: "a p-value near 0.05 taken by itself offers only weak evidence against the null hypothesis. Likewise, a relatively large p-value does not imply evidence in favor of the null hypothesis; many other hypotheses may be equally or more consistent with the observed data."1 A result of P = .04 and a result of P < .001 both sit on the "significant" side of a bright line, but they are not equally strong evidence against the null, and P > .05 is not evidence that the null is correct either. It is simply an absence of strong evidence against it, which is a different thing from proof of "no effect."

How often does this go wrong in practice

This is not a hypothetical concern about careless writing. A 2022 survey by Uppsala University researchers of 75 doctoral students and 64 professional statisticians and epidemiologists tested exactly this distinction directly, presenting participants with a significant test result and asking what it did and did not license them to conclude about the null and alternative hypotheses. Correct answers, meaning the respondent recognized that a statistically significant finding cannot be read as proof of the alternative hypothesis or as evidence that the null hypothesis is improbable, were given by only 10.7% of the doctoral students and 12.5% of the statisticians and epidemiologists.2 People with substantial statistical training, working professionally with statistics, still reached for the wrong sentence most of the time. The error is not a matter of insufficient education; it is a matter of the sentence itself being a natural but incorrect shortcut.

How to write the sentence correctly

You do not need to abandon P-values or hedge every result into unreadability. You need one substitution: keep the probability statement attached to the data, not the hypothesis.

Instead of: "P < .001, so there is a 99.9% probability the groups truly differ."
Write: "The observed difference (U = 89731, P < .001) would be very unlikely to occur if there were truly no difference between groups."

Instead of: "There is only a 4% chance this result is due to chance (P = .04)."
Write: "The result was statistically significant (P = .04), though a P-value this close to the conventional threshold offers only modest evidence against the null hypothesis on its own."

Both rewrites are one clause longer than the wrong version and considerably more defensible under review. Pair every P-value with an effect size and, where possible, a confidence interval; the P-value tells a reader how surprising your data are under "no effect," the effect size tells them how big the effect might actually be.

Why a language model will happily write the wrong sentence for you

Ask ChatGPT or a similar LLM to "interpret this P-value" and it will typically produce a fluent, confident paragraph, and that paragraph will often contain exactly the reversal described above. That is not a bug specific to one model; it is a structural consequence of how these tools work. A language model predicts plausible next words based on patterns in its training text, and the training text is full of published papers making this exact error. The model is not deriving a probability from your data; it is pattern-matching to how people have written about P-values before, mistakes included.

StatsPlease does not generate an interpretation of your result. It computes the test, the exact P-value, and the effect size directly from the numbers you upload, using the same procedures (scipy's implementations of Shapiro-Wilk, Mann-Whitney U, the t-test, and others) that produce identical output in SPSS or R. Re-run the same dataset anywhere and you get the same number, because it is computed, not generated.

Try it yourself

Reproduce this result: in SPSS or StatsPlease

The BMI comparison above comes from a public dataset. Run it yourself in either tool to confirm the numbers match, and to see how the two tools present the same test.

In SPSS

  1. Download the Pima Indians Diabetes dataset (see Data Sources) and open it in SPSS.
  2. Check normality first: Analyze → Descriptive Statistics → Explore. Add BMI to the Dependent List and Outcome (diabetes diagnosis) to Factor List. Tick "Normality plots with tests." Both groups fail Shapiro-Wilk (P < .001).
  3. Run the test: Analyze → Nonparametric Tests → Legacy Dialogs → 2 Independent Samples. Move BMI into Test Variable List and Outcome into Grouping Variable (define groups 0 and 1). Tick Mann-Whitney U.
  4. SPSS reports U = 40875 and the exact P-value. It does not compute an effect size automatically: calculate r = Z / √N ≈ 0.31 using the Z value SPSS provides — a related but different estimator from the rank-biserial r StatsPlease reports.

In StatsPlease

  1. Download the same CSV from Data Sources.
  2. Upload it and choose BMI as the outcome and Outcome (diabetes diagnosis) as the grouping variable.
  3. Run. StatsPlease checks normality with Shapiro-Wilk, routes to Mann-Whitney U automatically, and returns U = 89731, the exact P-value, and the rank-biserial effect size r = 0.37 (medium) in a ready-to-paste AMA sentence.

Compare: both paths run the identical test on the identical data and return the identical P < .001. The two U values are the same statistic seen from opposite sides: SPSS reports the smaller orientation (U = 40875), scipy and StatsPlease the complementary one (U = 89731), and they always sum to n₁ × n₂ = 130,606. The effect sizes differ by estimator, not by disagreement: Z/√N ≈ 0.31 in SPSS's manual calculation, rank-biserial r = 0.37 in StatsPlease's output.

References

  1. Wasserstein RL, Lazar NA. The ASA's Statement on p-Values: Context, Process, and Purpose. The American Statistician. 2016;70(2):129–133. https://doi.org/10.1080/00031305.2016.1154108
  2. Lytsy P, Hartman M, Pingel R. Misinterpretations of P-values and statistical tests persists among researchers and professionals working with statistics and epidemiology. Upsala Journal of Medical Sciences. 2022;127:e8760. https://doi.org/10.48101/ujms.v127.8760
  3. Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates; 1988.

Two ways forward from here.

Work through it yourself using the guidance above: the exercise section shows the exact steps in SPSS. Or upload your dataset to StatsPlease and get the AMA-formatted result in 60 seconds, computed directly from your data, not generated.

Try StatsPlease free