Sample Size in Clinical Research: What It Means for Your Stats

Sample size is not just a count of how many patients you recruited. It is the single thing that most directly decides which statistical tests are valid, how likely you are to detect a real difference, and what conclusions you can responsibly draw. Most surgical studies, audits, and trainee projects work with fewer than 50 patients in total — often far fewer. Understanding what that means is not optional.

Why sample size changes everything

When groups are large — roughly 100 or more each — a rule called the central limit theorem means the averages behave in a predictable, bell-shaped way even if the raw data do not. Parametric tests — tests that assume the data follow a normal, bell-shaped curve — work well at that size. When groups are small — under about 30 each — that protection disappears: averages become unreliable summaries, you can no longer assume a normal shape, and parametric tests can give misleading results. Most clinical research sits in this small-sample range.

The "fewer than 30" rule

When you have fewer than about 30 people per group, use a non-parametric test unless a normality test confirms the data are normal. Run a Shapiro-Wilk test on each continuous outcome.¹ If its p-value is above 0.05, a parametric test is acceptable; if it is 0.05 or below, use the non-parametric version. This is not a weakness in your study — it is simply a constraint that decides your analysis, and stating it openly in your methods makes your paper stronger, not weaker.

Test selection by sample size

People per group	Normal data	Not normal or ordered scores
Under 30	Welch's t-test (only if normality is confirmed)	Mann-Whitney U
30 to 100	Independent t-test	Mann-Whitney U
Over 100	Independent t-test	Either works
Under 30, three or more groups	One-way ANOVA (if normal)	Kruskal-Wallis
Over 30, three or more groups	One-way ANOVA	Kruskal-Wallis

Statistical power when groups are small

Power is the chance of detecting a real difference when one truly exists. The usual target is 80%. Consider a study of patients with severely reduced cardiac function (ejection fraction ≤ 20%). Only 23 patients met this criterion. This is a common situation in clinical research: the subgroup of interest is small, and that small size limits what you can detect. When groups are small, power drops sharply for moderate differences. For a medium difference, roughly 10 people per group gives only about 18% power, 30 per group about 47%, and you need about 64 per group to reach the 80% target. The general point — that the sample size needed depends heavily on the effect size and the power you want — is shown clearly by Zhu in a detailed comparison of methods for this test,² and the small/medium/large effect categories come from Cohen.³ The practical message is not that small studies are worthless. It is that they can only reliably detect large effects, so describe them as pilot or exploratory and save firm claims for properly powered work.

Small-sample results are summarised with the median and the interquartile range (the IQR, the middle 50% of the values), and StatsPlease flags the limited power, as in the output below:

StatsPlease output — small sample, power flag

Total n = 23. This comparison may be underpowered to detect small or medium effects reliably. Interpret with caution; findings should be replicated in a larger cohort.

Group (serum creatinine)	n	Median	IQR
Died (DEATH_EVENT = 1)	20	1.72 mg/dL	1.27–1.95
Survived (DEATH_EVENT = 0)	3	1.00 mg/dL	0.90–1.00

U = 54.5 · p = .028 · r = 0.47 (medium-large)

Among patients with severely reduced ejection fraction, serum creatinine was significantly higher in those who died (median 1.72 mg/dL, IQR 1.27–1.95) than those who survived (median 1.00, IQR 0.90–1.00), U = 54.5, p = .028, r = .47. Note: with a total sample of n = 23, this study is powered to detect large effects only.

Example output. Figures are illustrative.

Example data: Vanderbilt University Department of Biostatistics public teaching datasets (hbiostat.org/data). Figures computed with scipy from real data.

What reviewers expect

For any study with fewer than about 50 patients in total, reviewers will ask: did you justify your sample size, either with a power calculation or by clearly calling the study exploratory? Did you test for normality and report the result? Did you choose parametric or non-parametric tests appropriately for your size? Did you report effect sizes rather than just p-values? And are your conclusions suitably cautious? Preparing for these questions in your methods section prevents most statistics-related revisions.

Template language for small studies

You can adapt this: "Given the exploratory nature of this study and a total sample size of n = [X], non-parametric tests were used throughout. Normality was assessed with the Shapiro-Wilk test. Results should be interpreted with caution pending confirmation in a larger cohort."

How StatsPlease handles small samples

StatsPlease detects how many people are in each group before choosing a test, runs the Shapiro-Wilk check automatically, and defaults to a non-parametric test when groups are small or the data are not normal. It flags comparisons that are likely underpowered in the output and includes the appropriate cautious wording in the generated methods statement.

Try it yourself

Reproduce this result — in StatsPlease or SPSS

The small-sample result above comes from a public dataset, so you can run the same comparison yourself and see the power caution for yourself.

In StatsPlease

Download the heart failure dataset (see Data Sources) and save it as CSV.
Keep only patients with ejection fraction ≤ 20%, then choose serum creatinine as the outcome and death event as the group.
Run. StatsPlease selects Mann-Whitney U, reports U, p, and r, and flags the small sample.

In SPSS

Open the same CSV. Use Data ▸ Select Cases to keep ejection fraction ≤ 20.
Run Analyze ▸ Nonparametric Tests ▸ Independent Samples (Mann-Whitney U).
Read U and the p-value — and remember the small n yourself.

Compare: both should return U = 54.5 and p = .028. SPSS leaves the power caveat to you; StatsPlease flags the underpowered comparison and writes the cautious wording into the methods statement.

You might also read

References

Mishra P, Pandey CM, Singh U, Gupta A, Sahu C, Keshri A. Descriptive Statistics and Normality Tests for Statistical Data. Annals of Cardiac Anaesthesia. 2019;22(1):67–72. https://doi.org/10.4103/aca.ACA_157_18
Zhu X. Sample size calculation for Mann-Whitney U test with five methods. International Journal of Clinical Trials. 2021;8(3):184–195. https://doi.org/10.18203/2349-3259.ijct20212840
Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates; 1988.

StatsPlease handles small clinical datasets correctly, with normality checks and underpowered-result flags built in.

Analyse My Dataset →