Question 1

What is the difference between rejecting and failing to reject the null hypothesis?

Accepted Answer

Rejecting H₀ means the data is unlikely enough under the null that we are willing to deem the difference real — formally, the observed test statistic falls in the rejection region defined by α. Failing to reject H₀ does not mean H₀ is true or that there is no effect; it means we did not gather enough evidence to rule out chance as the explanation, possibly because the true effect is small, the sample is small, or the variability is high. In Karl Popper's framing, we never "accept" the null — we just continue to act as if it is the best working hypothesis until contrary evidence accumulates. This is one of the most misunderstood points in introductory statistics: "p > 0.05" is not proof of no effect, just absence of strong evidence of one.

Question 2

When should I use a one-tailed vs two-tailed test?

Accepted Answer

Use a two-tailed test when you care whether the sample mean differs from μ₀ in either direction (the default, conservative choice). Use a one-tailed test only when there is a strong a priori theoretical reason to test in one direction — for example, you are testing whether a new drug raises (not lowers) a specific biomarker, and a decrease would be uninteresting or pre-specified as not being part of the hypothesis. A one-tailed test has more statistical power for the chosen direction (smaller critical value), but if you ever switch directions based on the data, you have effectively doubled your false-positive rate. Pre-registering the direction in a protocol is essential to keep one-tailed tests honest. Most journals and reviewers default to expecting two-tailed unless the directional hypothesis is justified.

Question 3

What does "statistical significance" actually mean in practical terms?

Accepted Answer

Statistical significance means that under the null hypothesis, an effect at least as extreme as the one observed would happen with probability less than α (typically 5%). It is a statement about how likely the data is given a chance-only model, not about how likely the data is to reflect a real effect. Crucially, statistical significance is not the same as practical significance: with a sufficiently large sample, even a microscopic real-world difference becomes statistically significant. Conversely, an important real-world effect can fail to reach significance on a small sample. Always pair the p-value or test outcome with an effect-size estimate (e.g., Cohen's d, mean difference, or a confidence interval) so you can judge both "is this likely real?" and "is this big enough to matter?"

Question 4

What are the most common mistakes people make in hypothesis testing?

Accepted Answer

The first is treating p = 0.05 as a magic threshold and viewing 0.049 and 0.051 as categorically different — they are not; they are essentially the same evidence. The second is interpreting "fail to reject" as proof the null is true (covered above). The third is running many tests and reporting only the ones that came out significant ("p-hacking"); doing this inflates the false-positive rate dramatically and is now widely regarded as scientific misconduct. The fourth is forgetting to check the assumptions: the t-test assumes independent observations from a roughly normal population (or large n). For dependent or heavily skewed data, use a paired test, non-parametric test, or bootstrap. The fifth is confusing statistical significance with practical relevance — always report effect size alongside the p-value or decision.

Question 5

When should I not use this calculator?

Accepted Answer

Skip it for paired data (before/after on the same subject) — use a paired t-test, which differences the pairs before computing the statistic. Do not use it for two-sample comparisons (group A vs group B); the formula and degrees of freedom are different. Avoid it for very small samples (n < 5), where t-distribution assumptions become fragile and a non-parametric test (Wilcoxon, sign test) is more honest. It is the wrong tool for non-numeric outcomes (categorical comparisons need chi-square; proportions need a z-test for proportions). Do not use it without first thinking about effect size and statistical power; a binary "reject / fail to reject" output from a low-powered test is essentially uninformative regardless of which way it falls. For rigorous reporting, use a calculator or software that returns the p-value and confidence interval, not just a 1/0 decision.

Hypothesis Testing Calculator

Compare with similar

About this calculator

How to use

Frequently asked questions

What is the difference between rejecting and failing to reject the null hypothesis?

When should I use a one-tailed vs two-tailed test?

What does "statistical significance" actually mean in practical terms?

What are the most common mistakes people make in hypothesis testing?

When should I not use this calculator?

Sources & references