Linear Regression Calculator
Fit a least-squares regression line y = a + b·x to paired x and y values and read off the slope, intercept, or coefficient of determination (r²). Used in forecasting, trend analysis, and any setting where you want to quantify how much y changes with x.
About this calculator
Linear regression fits the line y = a + b·x that minimises the sum of squared vertical residuals (the ordinary least-squares, or OLS, criterion) for paired data {(x₁, y₁), ..., (xₙ, yₙ)}. The slope b and intercept a have closed-form solutions: b = (n·Σxy − Σx·Σy) / (n·Σx² − (Σx)²) and a = (Σy − b·Σx) / n = ȳ − b·x̄. The slope tells you how much y changes per one-unit increase in x; the intercept is the predicted y when x = 0 (often outside the data range and not always physically meaningful). r², the coefficient of determination, is the fraction of variance in y explained by the linear fit: r² = 1 − SSE/SST, where SSE is sum of squared residuals and SST is total sum of squares. Equivalently, r² is just the square of Pearson's correlation r — same number expressed differently. r² ranges from 0 (no linear relationship explains any variance) to 1 (the line fits perfectly). Variables: enter your x values and y values as comma-separated lists of equal length, and pick which summary you want. Edge cases: n must be ≥ 2 to fit a line; if all x values are identical (zero variance in x), the slope is undefined (vertical line, infinite slope) and the calculator returns NaN; if all y values are identical (zero variance in y), slope is 0, intercept equals the constant y, and r² is undefined/0. Critical assumptions of OLS for inference (not just fitting): linearity of the true relationship, independence of observations, constant residual variance (homoscedasticity), and approximately normal residuals — violation of any of these does not break the slope estimate but does invalidate p-values and confidence intervals. Outliers can drag the line dramatically (least squares is not robust); always plot the data and residuals before trusting the fit.
How to use
Example 1 — Perfect linear data, output = slope. Enter X = 1, 2, 3, 4, 5 and Y = 3, 5, 7, 9, 11. Choose Output = slope. Compute: n = 5, Σx = 15, Σy = 35, Σxy = 1·3 + 2·5 + 3·7 + 4·9 + 5·11 = 3 + 10 + 21 + 36 + 55 = 125, Σx² = 1 + 4 + 9 + 16 + 25 = 55. Slope b = (5·125 − 15·35) / (5·55 − 15²) = (625 − 525) / (275 − 225) = 100 / 50 = 2. ✓ Intercept a = ȳ − b·x̄ = 7 − 2·3 = 1, so the fitted line is y = 1 + 2x — exactly the line that generated the data. Switch Output to r²: the data is perfectly linear, so r² = 1.00. Example 2 — Noisy data, output = r². Hours studied vs exam scores for 5 students: X = 2, 3, 5, 6, 8 and Y = 65, 70, 78, 80, 92. Compute: n = 5, Σx = 24, Σy = 385, Σxy = 2·65 + 3·70 + 5·78 + 6·80 + 8·92 = 130 + 210 + 390 + 480 + 736 = 1946, Σx² = 4 + 9 + 25 + 36 + 64 = 138, Σy² = 4225 + 4900 + 6084 + 6400 + 8464 = 30073. Slope b = (5·1946 − 24·385) / (5·138 − 24²) = (9730 − 9240) / (690 − 576) = 490 / 114 ≈ 4.30. Intercept a = (385 − 4.30·24)/5 = (385 − 103.2)/5 ≈ 56.36. So predicted score ≈ 56.36 + 4.30·hours. r² = (5·1946 − 24·385)² / [(5·138 − 576)·(5·30073 − 385²)] = 490² / (114·1140) = 240100 / 129960 ≈ 0.985. ✓ About 98.5% of variance in scores is explained by hours studied — a strong linear relationship.
Frequently asked questions
What do slope, intercept, and r² actually tell me?
The slope b is the rate of change: how much y changes per one-unit increase in x. A slope of 2 means y goes up by 2 for every one-unit increase in x. The intercept a is the predicted y when x = 0; it is the line's starting point on the y-axis. Often x = 0 is outside the range of your data and the intercept has no real-world meaning by itself — that is fine, it still anchors the line. r² is the fraction of variation in y that the line "explains" relative to the total variation; r² = 0.80 means 80% of the variability in y is captured by the linear model on x, and 20% is unexplained by the model. r² ranges from 0 (no linear association) to 1 (perfect fit). High r² does not mean the model is correct — it just means the line fits these data points well — and low r² does not always mean the variables are unrelated, only that the relationship is not linear.
How is linear regression different from correlation?
Correlation (Pearson r) measures the strength and direction of the linear relationship between X and Y on a scale from −1 to +1 — it is symmetric in X and Y and dimensionless. Linear regression fits a directional model y = a + b·x where x is the predictor and y is the response; switching their roles produces a different line (the y-on-x line is not the same as the x-on-y line). Slope and r are related but not the same: b = r · (sy / sx), so they share a sign but have different scales. r² = correlation² gives the variance-explained interpretation that regression cares about. Use correlation when you simply want to quantify association; use regression when you want to predict y from x, quantify how much y changes per unit x, or build a model for further analysis.
What assumptions does OLS regression rely on, and what happens if they are violated?
OLS regression makes four classic assumptions for inference (slope estimation works without them, but p-values and CIs don't): (1) Linearity — the true relationship between X and Y is linear; check with a scatter plot and a residual-vs-fitted plot. (2) Independence — observations are not correlated with each other; time-series data routinely violates this. (3) Homoscedasticity — residual variance is constant across X; "fan-shaped" residual plots indicate violation. (4) Normality of residuals — for inference to work in small samples; check with a Q-Q plot. Outliers and influential points are a separate concern: a single high-leverage point can drag the line dramatically. When assumptions fail, you have options: transform X or Y (log, square root), use robust regression (Huber, LAD), use generalised least squares for heteroscedasticity, or use time-series models for autocorrelation. The estimates remain unbiased even when assumptions fail; what fails is the uncertainty around them.
What are the most common mistakes people make with linear regression?
The first is extrapolating beyond the range of the data — the model only describes behaviour where you have observations; predictions far outside that range are speculation. The second is treating r² as a goodness-of-fit verdict; r² near 1 does not mean the line is the right model (it just means it fits these data well), and r² near 0 can hide a strong non-linear relationship. The third is ignoring outliers and influential points; OLS is not robust, and a single bad point can flip slope from positive to negative. The fourth is confusing correlation with causation: "miles driven" and "engine wear" both rise with vehicle age but driving more does not necessarily cause wear if engine quality is the real driver. The fifth is fitting a line to data that is obviously curved — produces a meaningless slope and a useless r²; visualise first, then fit. Finally, do not report slope without standard error; a slope of 2 ± 0.1 is very different from a slope of 2 ± 5.
When should I not use this calculator?
Skip it when your data is clearly non-linear (look at the scatter plot first); use polynomial regression, log-transforms, or non-linear fits instead. Do not use it for multiple regression (more than one predictor) — this calculator handles simple linear regression only; for multivariate models use a statistics package. It is the wrong tool for time-series data without first checking and correcting for autocorrelation; ARIMA, exponential smoothing, or other time-series models are more appropriate. Avoid it for datasets with extreme outliers unless you have first investigated whether to remove or down-weight them; robust regression (Theil-Sen, RANSAC) is better in those cases. Do not use it when you need confidence intervals, prediction intervals, or hypothesis tests on the slope — those require additional formulas and software (or at minimum the residual standard error). Finally, do not interpret slope and intercept causally without the right study design; regression describes association, not causation.