Ordinary Least Squares: Why This Simple Math Still Rules Modern Data Science

Ordinary Least Squares: Why This Simple Math Still Rules Modern Data Science

You've probably heard the term Ordinary Least Squares tossed around if you’ve spent any time near a spreadsheet or a data science textbook. It sounds fancy. Intimidating, even. But honestly? It’s basically just the mathematical version of trying to draw a straight line through a cloud of messy dots on a napkin. It's the "O.G." of regression.

Most people think machine learning is all about complex neural networks or black-box algorithms that nobody really understands. That's a mistake. Underneath the hood of billion-dollar predictive models, you’ll often find OLS doing the heavy lifting. It’s reliable. It’s transparent. And it’s surprisingly hard to beat when you need to know why something is happening, not just what might happen next.

📖 Related: DFS of a Tree: Why Most People Get the Implementation Wrong

What is Ordinary Least Squares actually doing?

Imagine you’re trying to predict house prices. You have a bunch of data points: square footage on one axis, price on the other. Unless you’re living in a perfectly simulated universe, those points won't form a perfect line. They’re scattered. Some houses are overpriced because they have a gold-plated toilet; others are cheap because they’re haunted.

Ordinary Least Squares is the method we use to find the "best-fit" line through those points.

But what does "best" even mean? In OLS, we look at the vertical distance between each data point and the line we're drawing. These distances are called residuals. If we just added them up, the positive ones (points above the line) and negative ones (points below) would cancel each other out. That's useless. So, we square them. Squaring makes everything positive and punishes the big outliers more than the small ones. OLS minimizes the sum of these squared residuals.

$$S = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

In the equation above, $y_i$ is the actual value, and $\hat{y}_i$ is the predicted value. We want $S$ to be as small as humanly possible.

The Gauss-Markov Theorem: Why we trust it

Why do we use OLS instead of just eyeballing it? It comes down to something called the Gauss-Markov Theorem. This theorem states that, under certain conditions, the OLS estimator is the BLUE: Best Linear Unbiased Estimator.

That’s a lot of jargon. Let’s break it down. "Unbiased" means that if you ran the experiment a million times, the average of your estimates would hit the true population value right on the nose. "Best" in this context refers to "efficiency," meaning it has the lowest variance. It’s the most precise tool in the shed, provided your data isn't lying to you.

However, there’s a catch. OLS is a bit of a "diva." It requires specific conditions to work correctly. If you violate them, your results become junk.

  1. Linearity: The relationship between your variables actually has to be a straight line. You can't use standard OLS to fit a giant U-shaped curve without some serious tweaking.
  2. Independence: Your observations shouldn't influence each other. If you’re measuring the height of siblings, they aren't independent.
  3. Homoscedasticity: This is a $50 word that just means the "scatter" of your points should be roughly the same across the whole line. If the dots get way more spread out as you go further right, OLS starts to wobble.
  4. No Multicollinearity: Your predictors shouldn't be "best friends." If you’re trying to predict weight using "height in inches" and "height in centimeters," OLS will have a mathematical meltdown because those two things are the same.

Real-world messiness: When OLS fails

I once worked on a project trying to predict retail sales. We used Ordinary Least Squares because it was quick and easy to explain to the stakeholders. The model looked great on paper. Then we noticed the outliers.

One store had a massive sales spike because a celebrity randomly tweeted about it. Another store was closed for three weeks because of a burst pipe. Because OLS squares the errors, these two weird data points dragged the entire regression line toward them, ruining the predictions for the other 98 normal stores.

This is the biggest weakness of OLS. It is incredibly sensitive to outliers.

If you have "dirty" data, OLS will try its best to accommodate the weirdness, often at the expense of the truth. In those cases, experts like Peter Huber (who pioneered robust statistics) suggest using methods that don't square the errors, like Least Absolute Deviations (LAD), which is much more "forgiving" of those weird celebrity tweets.

Misconceptions that lead to bad decisions

A huge mistake people make is assuming that a high $R^2$ (R-squared) value means the model is "good."

$R^2$ tells you how much of the variation in your data is explained by the model. You could have an $R^2$ of 0.95 and still have a totally useless model if you've ignored the underlying assumptions. For example, if you regress the number of shark attacks against ice cream sales, you'll get a beautiful OLS line and a high $R^2$. Does ice cream cause shark attacks? No. They both just happen more often in the summer.

This is the classic "correlation is not causation" trap. OLS finds correlation. It does not prove that one thing causes another. To get to causation, you need experimental design or more advanced econometric tricks like Instrumental Variables.

Implementation: It's easier than you think

You don't need a PhD to run an Ordinary Least Squares regression. If you're using Python, the statsmodels library is the gold standard because it gives you a detailed "summary" sheet that looks like something out of an academic paper. If you're more into machine learning, scikit-learn has LinearRegression(), which uses OLS under the hood.

Even Excel can do it. Use the LINEST function or the Data Analysis Toolpak. Just please, for the love of data, look at your scatter plot before you trust the numbers. If your data looks like a banana, don't use a straight-line method.

Practical Next Steps for Your Data

If you’re ready to actually use this, don't just run the code and call it a day.

First, visualize your residuals. After you run your OLS model, plot the errors. If you see a pattern—like a wave or a funnel shape—your model is missing something. It might be a non-linear relationship or a missing variable.

Second, check for high-leverage points. These are the outliers that are "pulling" your line too hard. Use a metric called Cook's Distance to identify them. Sometimes you should delete them; other times, they are the most important part of the story.

Third, standardize your variables. If you're regressing "Age in years" against "Income in millions," the scales are so different that it can make your coefficients hard to interpret. Scaling them helps you see which variable actually has the most "punch."

Ordinary Least Squares isn't just a classroom exercise. It’s the foundation of how we understand the relationship between variables in economics, biology, and engineering. Respect its assumptions, watch out for outliers, and it’ll serve you better than almost any other tool in your kit.


Next Steps:

  • Gather your dataset and create a scatter plot to visually check for a linear relationship.
  • Calculate the Correlation Coefficient ($r$) to see if a linear model is even appropriate.
  • Run your OLS regression and immediately generate a residual plot to validate the homoscedasticity assumption.
  • Identify any data points with a Cook's Distance greater than 1 and investigate if they are errors or genuine anomalies.