Correlation Coefficient Calculator
Compute the Pearson correlation coefficient (r) between two variables from paired summary statistics — n, Σxy, Σx, Σy — to measure how strongly they move together linearly. The number ranges from −1 (perfect negative linear relationship) through 0 (no linear relationship) to +1 (perfect positive linear relationship).
About this calculator
Pearson's r quantifies the strength and direction of the linear relationship between two paired variables X and Y. The computational formula used here is: r = [ n·Σxy − Σx·Σy ] / √[ (n·Σx² − (Σx)²) · (n·Σy² − (Σy)²) ], where n is the number of (x, y) pairs, Σxy is the sum of each x·y product, Σx and Σy are the simple sums, and Σx² and Σy² are sums of squared values. Algebraically this is equivalent to r = Cov(X, Y) / (σx · σy) — covariance scaled by the product of the two standard deviations — but the computational form lets you derive r directly from running totals without having to compute deviations from the means. r is dimensionless and bounded between −1 and +1 by the Cauchy-Schwarz inequality. Critical interpretive note: r measures the strength of linear association only; two variables can have r = 0 yet a strong non-linear relationship (e.g., y = x² over a symmetric range around 0 has r = 0). It also says nothing about causation — confounding variables, reverse causation, and coincidence all produce non-zero correlations. r² (the coefficient of determination) is often more useful than r itself: it equals the fraction of the variance in Y explained by the linear regression on X, so r = 0.8 implies r² = 0.64, meaning 64% of the variation in Y is "explained" by X. Edge cases: r is undefined if either variable has zero variance (all x or all y identical — denominator becomes 0); r is highly sensitive to outliers, so a single extreme point can swing it dramatically; r is not invariant to non-linear monotonic transformations, so always inspect the scatter plot before reporting.
How to use
Example 1 — Three data pairs. Data: (1, 2), (2, 4), (3, 5). Compute n = 3, Σx = 1 + 2 + 3 = 6, Σy = 2 + 4 + 5 = 11, Σxy = 1·2 + 2·4 + 3·5 = 25. Note that this calculator's formula treats Σx and Σy as both numerator pieces — enter 3, 25, 6, 11. Manually: r = (3·25 − 6·11) / √[(3·14 − 36)·(3·45 − 121)] = (75 − 66) / √[6·14] = 9 / √84 ≈ 0.982. ✓ r ≈ 0.98 implies a near-perfect positive linear relationship — confirmed by plotting the three points, which fall almost on a straight line. Example 2 — Negative relationship. Hours of TV per day and self-reported exercise minutes per day for 5 people: TV = {1, 2, 3, 4, 5}, Exercise = {60, 50, 40, 30, 20}. Σx = 15, Σy = 200, Σxy = 1·60 + 2·50 + 3·40 + 4·30 + 5·20 = 60 + 100 + 120 + 120 + 100 = 500. Σx² = 55, Σy² = 9000. r = (5·500 − 15·200) / √[(5·55 − 225)·(5·9000 − 40000)] = (2500 − 3000) / √[50·5000] = −500 / √250000 = −500 / 500 = −1. ✓ r = −1 reflects the perfectly linear inverse relationship; in real data you would never see exactly −1, only values that come close.
Frequently asked questions
How strong does a correlation need to be to "matter"?
There is no universal threshold — context dominates. Rough conventions: |r| > 0.7 is often called strong, 0.4–0.7 moderate, and below 0.4 weak. But these are field-dependent. In physics or engineering, where measurements are precise and the underlying relationship is deterministic, you would expect |r| above 0.95 for a "real" relationship; anything below should make you suspicious. In social sciences and behavioural research, r values of 0.3–0.5 are routinely treated as meaningful because the underlying phenomena are noisy. r² is often the more practical statistic: r = 0.5 means r² = 0.25, so X explains only a quarter of the variation in Y — most of what is happening to Y is driven by something else. Always report sample size and ideally a confidence interval for r, because small samples can produce dramatic-looking correlations purely by chance.
Does correlation imply causation?
No, and this is the most repeated warning in statistics for good reason. A non-zero r tells you only that two variables move together linearly in your sample; it does not say one causes the other. There are four common alternative explanations to keep in mind. (1) Reverse causation: maybe Y causes X, not the other way around. (2) Confounding: a third variable Z drives both X and Y, producing a spurious correlation between them. (3) Selection bias: the sample over-represents pairs where X and Y happen to align. (4) Coincidence: with enough variables and small enough samples, some correlations are random noise. Establishing causation requires controlled experiments (randomised assignment), natural experiments, instrumental variables, or rigorous causal-inference techniques (DAGs, propensity scores). Correlation is a useful first clue, never a conclusion.
When should I use Pearson r vs Spearman ρ vs Kendall τ?
Use Pearson r when both variables are continuous, approximately normally distributed, and the relationship is genuinely linear. Use Spearman's rank correlation (ρ) when the relationship is monotonic but not linear (Y consistently increases with X but not in a straight line), or when your data contains influential outliers — Spearman operates on ranks rather than raw values and is therefore robust. Use Kendall's tau when sample size is small (n < 20), when there are many tied ranks, or when you want a more conservative measure of association (Kendall typically gives smaller numbers than Spearman on the same data). All three measure association; only Pearson assumes linearity. If a scatter plot shows a clear curve, Pearson will understate the true strength of the relationship — switch to Spearman or fit a non-linear model.
What are the most common mistakes people make computing or interpreting correlation?
The first is reporting r without ever looking at the scatter plot — Anscombe's quartet famously shows four data sets with identical r ≈ 0.82 but completely different shapes (one linear, one curved, one with a single huge outlier, one with all but one point identical). The second is conflating r with r²; r = 0.4 sounds substantial but r² = 0.16 means X explains only 16% of Y's variance, which is often unimpressive. The third is treating r as causal evidence — covered above. The fourth is failing to spot outlier-driven correlations: a single extreme point can push r from 0.0 to 0.6 with no real relationship in the bulk of the data, or hide a strong relationship in the bulk. The fifth is computing r on truncated data (restricting range on X) and concluding "no relationship" because r drops sharply — range restriction always attenuates r even when the underlying relationship is strong.
When should I not use this calculator?
Skip it for non-linear relationships — Pearson r will dramatically understate the strength of curved or quadratic associations. Do not use it on ordinal or rank data; use Spearman ρ or Kendall τ instead. It is the wrong tool when one of your variables is categorical (use point-biserial, phi, or Cramér's V depending on the situation). Avoid it for time-series data without first checking for autocorrelation and trend — both can inflate r without any genuine cross-variable relationship. Do not use it for very small samples (n < 5); confidence intervals on r are extremely wide there and small-sample r values are essentially noise. Finally, never report a single correlation coefficient as evidence of a relationship without also showing the scatter plot, sample size, and ideally a confidence interval or p-value.