Q: What are the most common mistakes people make with p-values?

The first is treating p 0.05 as proof of no effect — both interpretations are wrong; p-values are continuous measures of evidence against H₀, not categorical truths, and "absence of evidence is not evidence of absence". The second is p-hacking: running many tests, sub-group analyses, or transformations and reporting only the significant ones — running 20 independent tests at α = 0.05 produces about one false positive by chance. The third is confusing statistical significance with practical significance; on a million-row dataset, a trivially small effect can have p < 0.001 and still be irrelevant. The fourth is forgetting multiple-testing correction (Bonferroni, Benjamini-Hochberg FDR, or pre-registration) when running many tests. The fifth is misreporting: "p = 0.5" is not the same as "p = 0.05" and slip-of-keyboard errors are common in published papers. Finally, never report only "p < 0.05" — always include the exact p, the effect size, and a confidence interval so readers can assess both significance and magnitude.

Q: When should I not use this calculator?

Skip it for small-sample t-tests with unknown population SD — you need a t-distribution p-value with appropriate degrees of freedom (df = n − 1), not a z-based one. The t-distribution has heavier tails, so the same statistic produces a larger p; using this calculator on a t-statistic systematically understates p (overstates significance) and inflates false positives. At df = 5, a t-statistic of 2.0 gives p ≈ 0.102 two-tailed; the same number as a z would give p ≈ 0.046 — a qualitatively different conclusion. Do not use it for chi-square, F, or other non-normal test statistics either; each has its own reference distribution. It is the wrong tool for non-parametric tests (Mann-Whitney, Wilcoxon, Kruskal-Wallis), exact tests (Fisher's exact, binomial), or permutation/bootstrap p-values, which all require dedicated calculators or software. Avoid it as a stand-alone gate without considering effect size, sample size, and statistical power. And do not use it for Bayesian inference — the probability that H₀ is true given the data is a posterior, not a p-value, and requires a prior.

Question 1

What does p = 0.05 actually mean (and what doesn't it mean)?

Accepted Answer

The literal definition: if the null hypothesis (H₀) is exactly true and you ran your study many times, results at least this extreme would occur in 5% of those repetitions. That is the entire definition — full stop. p = 0.05 does NOT mean there is a 5% chance the null is true; that would be a Bayesian posterior, which requires a prior distribution this calculation never sees. It does NOT mean a 95% chance the alternative is true, NOT a 5% chance the result is "due to chance", NOT a 5% chance the result will fail to replicate, and NOT a measure of how big or important the effect is. p is a long-run error-rate guarantee about your decision rule, not a probability about your specific study. The 0.05 cutoff is also a historical convention from R.A. Fisher in the 1920s — there is nothing magical about it, and a 2018 paper by Benjamin et al. (68 co-authors) argued for shifting the default to 0.005 to fight the replication crisis.

Question 2

Should I use a one-tailed or two-tailed p-value?

Accepted Answer

Use a two-tailed p-value by default — it tests whether the parameter differs from H₀ in either direction, which is almost always the question you actually care about. Use a one-tailed p-value only when there is a strong, pre-registered theoretical reason to test in a single direction, and when finding the effect in the opposite direction would not be interesting (or would still count as "no effect"). The one-tailed test has more statistical power for the chosen direction (effectively half the p-value), but if you ever decide the direction after seeing the data you have doubled your true Type I error rate — a textbook form of p-hacking. Most journals require explicit justification for one-tailed tests precisely because they are easy to abuse. When in doubt, use two-tailed; the chart above shades both tails when you do so you can see the symmetric rejection region directly.

Question 3

What is the difference between a p-value and an effect size?

Accepted Answer

A p-value tells you how strongly the data argues against H₀ — specifically, how unlikely the observed result would be if H₀ were true. It does NOT tell you how big the effect is, only that there is one. Effect size measures the magnitude of the effect independent of sample size: Cohen's d for mean differences, Pearson's r for correlations, odds ratio for proportions, η² for ANOVA. The two are linked through statistical power: with a huge sample, even a trivial effect (Cohen's d = 0.01) can produce p < 0.001, while a real and important effect (d = 0.8) might fail to clear p < 0.05 with a small sample. The American Statistical Association's 2016 statement on p-values explicitly recommends reporting effect sizes and confidence intervals alongside p, because p alone is uninformative about practical significance. Concrete example: a vaccine that reduces flu cases by 0.1% in 10 million subjects will have a vanishingly small p but a clinically negligible effect; a treatment that doubles survival in a 30-person pilot might miss p < 0.05 yet warrant a much larger trial. Always pair p with effect size.

Question 4

What are the most common mistakes people make with p-values?

Accepted Answer

The first is treating p < 0.05 as proof of an effect and p > 0.05 as proof of no effect — both interpretations are wrong; p-values are continuous measures of evidence against H₀, not categorical truths, and "absence of evidence is not evidence of absence". The second is p-hacking: running many tests, sub-group analyses, or transformations and reporting only the significant ones — running 20 independent tests at α = 0.05 produces about one false positive by chance. The third is confusing statistical significance with practical significance; on a million-row dataset, a trivially small effect can have p < 0.001 and still be irrelevant. The fourth is forgetting multiple-testing correction (Bonferroni, Benjamini-Hochberg FDR, or pre-registration) when running many tests. The fifth is misreporting: "p = 0.5" is not the same as "p = 0.05" and slip-of-keyboard errors are common in published papers. Finally, never report only "p < 0.05" — always include the exact p, the effect size, and a confidence interval so readers can assess both significance and magnitude.

Question 5

When should I not use this calculator?

Accepted Answer

Skip it for small-sample t-tests with unknown population SD — you need a t-distribution p-value with appropriate degrees of freedom (df = n − 1), not a z-based one. The t-distribution has heavier tails, so the same statistic produces a larger p; using this calculator on a t-statistic systematically understates p (overstates significance) and inflates false positives. At df = 5, a t-statistic of 2.0 gives p ≈ 0.102 two-tailed; the same number as a z would give p ≈ 0.046 — a qualitatively different conclusion. Do not use it for chi-square, F, or other non-normal test statistics either; each has its own reference distribution. It is the wrong tool for non-parametric tests (Mann-Whitney, Wilcoxon, Kruskal-Wallis), exact tests (Fisher's exact, binomial), or permutation/bootstrap p-values, which all require dedicated calculators or software. Avoid it as a stand-alone gate without considering effect size, sample size, and statistical power. And do not use it for Bayesian inference — the probability that H₀ is true given the data is a posterior, not a p-value, and requires a prior.

P-Value Calculator

Compare with similar

About this calculator

How to use

Frequently asked questions

What does p = 0.05 actually mean (and what doesn't it mean)?

Should I use a one-tailed or two-tailed p-value?

What is the difference between a p-value and an effect size?

What are the most common mistakes people make with p-values?

When should I not use this calculator?

Sources & references