Hypothesis Testing Calculator
Run a one-sample t-test to decide whether a sample mean differs significantly from a hypothesised population mean. Returns 1 (reject the null hypothesis) or 0 (fail to reject) at your chosen confidence level using the one-sample t-statistic and a critical-value lookup.
About this calculator
A one-sample t-test checks whether the mean of a sample (x̄) is statistically different from a hypothesised population mean (μ₀). The test statistic is: t = (x̄ − μ₀) / (s / √n), where s is the sample standard deviation and n is the sample size; s/√n is the standard error of the mean. Degrees of freedom equal n − 1, which determine the critical t-value at your chosen confidence level (e.g. 95% confidence → α = 0.05). For a two-tailed test the null hypothesis H₀: μ = μ₀ is rejected when |t| > t_critical; for a one-tailed test (H₁: μ > μ₀ or H₁: μ < μ₀) the critical threshold is smaller because all of α sits in one tail. Conventionally, "rejecting H₀" means the data is unlikely enough under the null that we deem the difference real — but it does not "prove" the alternative, just provides evidence against H₀. Important interpretive notes: a non-significant result does not mean "there is no effect" — it might mean the sample was too small to detect one (low power). Conversely, a highly significant result on a huge sample may reflect a trivially small real-world effect; statistical significance and practical significance are separate questions. Edge cases: requires n ≥ 2 (df = 0 makes t undefined); assumes the sample is drawn from a roughly normal population (Central Limit Theorem makes this OK for n ≥ 30 even on non-normal data); assumes observations are independent. For paired data or two-sample comparisons, use a paired or two-sample t-test instead — the formulas differ. This calculator outputs a binary 1/0 decision rather than a p-value, so it is best treated as a yes/no significance gate for quick checks.
How to use
Example 1 — Bolt diameter quality check. A factory claims average bolt diameter of 10 mm. You measure 25 bolts: x̄ = 10.5 mm, s = 1.2 mm. Test at 95% confidence, two-tailed. Enter Sample Mean = 10.5, Population Mean = 10, Sample Std = 1.2, Sample Size = 25, Test Type = two_tailed, Confidence Level = 0.95. Standard error = 1.2 / √25 = 0.24. t = (10.5 − 10) / 0.24 = 2.083. df = 24, t_critical (two-tailed, α = 0.05) ≈ 2.064. |2.083| > 2.064 → reject H₀. Result: 1. ✓ The factory's claimed mean of 10 mm is not consistent with the observed sample. Example 2 — Small effect, larger sample. Same factory now claims 10.0 mm, you measure 30 bolts with x̄ = 10.05 mm and s = 0.4 mm. Enter 10.05, 10, 0.4, 30, two_tailed, 0.95. SE = 0.4 / √30 ≈ 0.0730. t = 0.05 / 0.0730 ≈ 0.685. df = 29, t_critical ≈ 2.045. |0.685| < 2.045 → fail to reject. Result: 0. ✓ The 0.5% deviation is well within sampling noise — no statistical evidence that the true mean differs from 10 mm, even though x̄ is not exactly 10.
Frequently asked questions
What is the difference between rejecting and failing to reject the null hypothesis?
Rejecting H₀ means the data is unlikely enough under the null that we are willing to deem the difference real — formally, the observed test statistic falls in the rejection region defined by α. Failing to reject H₀ does not mean H₀ is true or that there is no effect; it means we did not gather enough evidence to rule out chance as the explanation, possibly because the true effect is small, the sample is small, or the variability is high. In Karl Popper's framing, we never "accept" the null — we just continue to act as if it is the best working hypothesis until contrary evidence accumulates. This is one of the most misunderstood points in introductory statistics: "p > 0.05" is not proof of no effect, just absence of strong evidence of one.
When should I use a one-tailed vs two-tailed test?
Use a two-tailed test when you care whether the sample mean differs from μ₀ in either direction (the default, conservative choice). Use a one-tailed test only when there is a strong a priori theoretical reason to test in one direction — for example, you are testing whether a new drug raises (not lowers) a specific biomarker, and a decrease would be uninteresting or pre-specified as not being part of the hypothesis. A one-tailed test has more statistical power for the chosen direction (smaller critical value), but if you ever switch directions based on the data, you have effectively doubled your false-positive rate. Pre-registering the direction in a protocol is essential to keep one-tailed tests honest. Most journals and reviewers default to expecting two-tailed unless the directional hypothesis is justified.
What does "statistical significance" actually mean in practical terms?
Statistical significance means that under the null hypothesis, an effect at least as extreme as the one observed would happen with probability less than α (typically 5%). It is a statement about how likely the data is given a chance-only model, not about how likely the data is to reflect a real effect. Crucially, statistical significance is not the same as practical significance: with a sufficiently large sample, even a microscopic real-world difference becomes statistically significant. Conversely, an important real-world effect can fail to reach significance on a small sample. Always pair the p-value or test outcome with an effect-size estimate (e.g., Cohen's d, mean difference, or a confidence interval) so you can judge both "is this likely real?" and "is this big enough to matter?"
What are the most common mistakes people make in hypothesis testing?
The first is treating p = 0.05 as a magic threshold and viewing 0.049 and 0.051 as categorically different — they are not; they are essentially the same evidence. The second is interpreting "fail to reject" as proof the null is true (covered above). The third is running many tests and reporting only the ones that came out significant ("p-hacking"); doing this inflates the false-positive rate dramatically and is now widely regarded as scientific misconduct. The fourth is forgetting to check the assumptions: the t-test assumes independent observations from a roughly normal population (or large n). For dependent or heavily skewed data, use a paired test, non-parametric test, or bootstrap. The fifth is confusing statistical significance with practical relevance — always report effect size alongside the p-value or decision.
When should I not use this calculator?
Skip it for paired data (before/after on the same subject) — use a paired t-test, which differences the pairs before computing the statistic. Do not use it for two-sample comparisons (group A vs group B); the formula and degrees of freedom are different. Avoid it for very small samples (n < 5), where t-distribution assumptions become fragile and a non-parametric test (Wilcoxon, sign test) is more honest. It is the wrong tool for non-numeric outcomes (categorical comparisons need chi-square; proportions need a z-test for proportions). Do not use it without first thinking about effect size and statistical power; a binary "reject / fail to reject" output from a low-powered test is essentially uninformative regardless of which way it falls. For rigorous reporting, use a calculator or software that returns the p-value and confidence interval, not just a 1/0 decision.