P-Value Calculator
Convert a z-score (or any standard-normal test statistic) into a p-value for a one- or two-tailed hypothesis test, then see it visualised as the shaded tail area on the standard normal curve. The smaller the p-value, the stronger the evidence against the null hypothesis — reject H₀ when p drops below your significance threshold (commonly 0.05).
Red shaded area is the rejection region — the probability mass in the tail(s) at least as extreme as your test statistic. A significant result means this red area is smaller than your significance level α.
About this calculator
The p-value answers a precise question that is easy to misunderstand: assuming the null hypothesis (H₀) is exactly true, how often would random sampling produce a test statistic at least as extreme as the one you observed? The formulas for a standard-normal test statistic z are: two-tailed p = 2·[1 − Φ(|z|)], right-tailed p = 1 − Φ(z), left-tailed p = Φ(z), where Φ is the standard normal cumulative distribution function. This calculator evaluates Φ via the Abramowitz-Stegun rational approximation of the error function: Φ(z) = ½·[1 + erf(z/√2)]. The red shaded area on the bell curve above is exactly that probability — visualised as the tail (or two tails) beyond your test statistic. Significance level α is the long-run rate of false rejections you are willing to tolerate, decided before seeing the data. Common thresholds: α = 0.05 (1-in-20 false-positive rate) is the default in psychology, education, and most social science; α = 0.01 (1-in-100) is typical in clinical trials and pharmacology; α = 0.001 (1-in-1,000) is used in high-stakes inference like genome-wide association studies, usually after multiple-testing correction; particle physics famously requires 5σ ≈ p < 3·10⁻⁷ for a discovery claim. The 0.05 cutoff itself is a historical convention attributed to R.A. Fisher in the 1920s — there is nothing magical about it, and p = 0.049 versus 0.051 is essentially identical evidence. The decision rule is simply: reject H₀ if p < α, otherwise fail to reject. Choose a two-tailed test (the default) when deviations in either direction from H₀ would interest you — almost always the right choice. Choose one-tailed only when a directional hypothesis was pre-specified before seeing the data; deciding the direction after looking at the data doubles your true Type I error rate and is a textbook form of p-hacking. This calculator assumes the test statistic follows a standard normal distribution, which holds when the population SD is genuinely known or the sample is large (n ≥ 30 as a rough rule). For small samples with estimated SD, use a t-distribution p-value calculator instead — the heavier tails of the t-distribution produce a larger p for the same statistic, and using z here would systematically overstate significance. Edge cases: very large |z| (above about 5) returns p ≈ 0 because the rational approximation cannot represent extremely small tail probabilities exactly — the true value is real but vanishingly small (|z| = 5 gives p ≈ 5.7·10⁻⁷ two-tailed; |z| = 6 gives p ≈ 2·10⁻⁹). A z of exactly 0 gives p = 1 two-tailed and p = 0.5 one-tailed. Critically, a p-value is NOT the probability that H₀ is true (that is a Bayesian posterior requiring a prior), NOT the probability the result is due to chance, and NOT the probability of replication. It is a conditional probability given H₀ — a long-run error-rate guarantee about your decision rule, nothing more.
How to use
Example 1 — Two-tailed test with z = 2.10. You ran a z-test and got z = 2.10 (e.g., sample mean lies 2.10 standard errors above the hypothesised mean). Enter Test Statistic = 2.10, Test Type = two-tailed. The calculator computes Φ(2.10) ≈ 0.9821, so p = 2·(1 − 0.9821) = 2·0.0179 ≈ 0.0357. ✓ Since p < 0.05, you reject H₀ at the 5% significance level. The result says: if H₀ were true, you would see a test statistic this extreme (in either direction) about 3.6% of the time. Example 2 — One-tailed test with z = −1.50. You pre-specified a left-tailed alternative (e.g., new manufacturing process has lower defect rate than baseline). Enter Test Statistic = −1.50, Test Type = left-tailed. Φ(−1.50) ≈ 0.0668. p ≈ 0.0668. ✓ Since p > 0.05, you fail to reject H₀ at the 5% level — the evidence is suggestive (about 6.7% chance under H₀) but does not clear the conventional threshold. Note: if you had used a two-tailed test on the same z, p ≈ 0.1336, which is more conservative — that is the cost (or honesty) of two-tailed testing.
Frequently asked questions
What does p = 0.05 actually mean (and what doesn't it mean)?
The literal definition: if the null hypothesis (H₀) is exactly true and you ran your study many times, results at least this extreme would occur in 5% of those repetitions. That is the entire definition — full stop. p = 0.05 does NOT mean there is a 5% chance the null is true; that would be a Bayesian posterior, which requires a prior distribution this calculation never sees. It does NOT mean a 95% chance the alternative is true, NOT a 5% chance the result is "due to chance", NOT a 5% chance the result will fail to replicate, and NOT a measure of how big or important the effect is. p is a long-run error-rate guarantee about your decision rule, not a probability about your specific study. The 0.05 cutoff is also a historical convention from R.A. Fisher in the 1920s — there is nothing magical about it, and a 2018 paper by Benjamin et al. (68 co-authors) argued for shifting the default to 0.005 to fight the replication crisis.
Should I use a one-tailed or two-tailed p-value?
Use a two-tailed p-value by default — it tests whether the parameter differs from H₀ in either direction, which is almost always the question you actually care about. Use a one-tailed p-value only when there is a strong, pre-registered theoretical reason to test in a single direction, and when finding the effect in the opposite direction would not be interesting (or would still count as "no effect"). The one-tailed test has more statistical power for the chosen direction (effectively half the p-value), but if you ever decide the direction after seeing the data you have doubled your true Type I error rate — a textbook form of p-hacking. Most journals require explicit justification for one-tailed tests precisely because they are easy to abuse. When in doubt, use two-tailed; the chart above shades both tails when you do so you can see the symmetric rejection region directly.
What is the difference between a p-value and an effect size?
A p-value tells you how strongly the data argues against H₀ — specifically, how unlikely the observed result would be if H₀ were true. It does NOT tell you how big the effect is, only that there is one. Effect size measures the magnitude of the effect independent of sample size: Cohen's d for mean differences, Pearson's r for correlations, odds ratio for proportions, η² for ANOVA. The two are linked through statistical power: with a huge sample, even a trivial effect (Cohen's d = 0.01) can produce p < 0.001, while a real and important effect (d = 0.8) might fail to clear p < 0.05 with a small sample. The American Statistical Association's 2016 statement on p-values explicitly recommends reporting effect sizes and confidence intervals alongside p, because p alone is uninformative about practical significance. Concrete example: a vaccine that reduces flu cases by 0.1% in 10 million subjects will have a vanishingly small p but a clinically negligible effect; a treatment that doubles survival in a 30-person pilot might miss p < 0.05 yet warrant a much larger trial. Always pair p with effect size.
What are the most common mistakes people make with p-values?
The first is treating p < 0.05 as proof of an effect and p > 0.05 as proof of no effect — both interpretations are wrong; p-values are continuous measures of evidence against H₀, not categorical truths, and "absence of evidence is not evidence of absence". The second is p-hacking: running many tests, sub-group analyses, or transformations and reporting only the significant ones — running 20 independent tests at α = 0.05 produces about one false positive by chance. The third is confusing statistical significance with practical significance; on a million-row dataset, a trivially small effect can have p < 0.001 and still be irrelevant. The fourth is forgetting multiple-testing correction (Bonferroni, Benjamini-Hochberg FDR, or pre-registration) when running many tests. The fifth is misreporting: "p = 0.5" is not the same as "p = 0.05" and slip-of-keyboard errors are common in published papers. Finally, never report only "p < 0.05" — always include the exact p, the effect size, and a confidence interval so readers can assess both significance and magnitude.
When should I not use this calculator?
Skip it for small-sample t-tests with unknown population SD — you need a t-distribution p-value with appropriate degrees of freedom (df = n − 1), not a z-based one. The t-distribution has heavier tails, so the same statistic produces a larger p; using this calculator on a t-statistic systematically understates p (overstates significance) and inflates false positives. At df = 5, a t-statistic of 2.0 gives p ≈ 0.102 two-tailed; the same number as a z would give p ≈ 0.046 — a qualitatively different conclusion. Do not use it for chi-square, F, or other non-normal test statistics either; each has its own reference distribution. It is the wrong tool for non-parametric tests (Mann-Whitney, Wilcoxon, Kruskal-Wallis), exact tests (Fisher's exact, binomial), or permutation/bootstrap p-values, which all require dedicated calculators or software. Avoid it as a stand-alone gate without considering effect size, sample size, and statistical power. And do not use it for Bayesian inference — the probability that H₀ is true given the data is a posterior, not a p-value, and requires a prior.