The Intuition Behind Regression Analysis
Read a summary using the INOMICS AI tool
Regression analysis is a key topic in economics, data science, and computer science, but its roots go further back than these disciplines. Sir Francis Galton is credited with the term, after studying the heights of parents and their children, and noticing that those heights tended to “regress” back towards the mean. That name became associated with this type of analysis, and has stuck, even though a more accurate name might have been something like “variance analysis”.
We’ve written previously about what a regression analysis is. If you’re completely new to the idea of a regression, that’s a good place to start. In this article, we’ll look more closely at this tool in the data scientist’s, economist’s and statistician’s toolbox, to understand exactly what’s happening when we run one.
It’s all about the variance
The basic question behind many regression analyses is: how much does x affect y? Or, stated another way: when x changes, how much does y change as a result? This is a great way to think about what a regression is attempting to do. In essence, a regression measures how much of y’s variation results from x varying, ignoring the variation in other regressors (the Frisch-Waugh Theorem illustrates how a regression can isolate the effect of a single variable on y, but that’s beyond the scope of this article).
The regression attributes whatever variation in y that x explains to a coefficient (usually denoted by a lower case beta symbol: 𝛽), and whatever variation x can’t explain to a term we call the “error”. Finally, statistics are computed that tell us how unlikely it is that the effect of x on y is due to random chance — if it’s very unlikely, we call it a “statistically significant” result.
An example: Let’s consider high-end TV sales as our y variable, and “income of consumers” as the x variable. How should consumer income affect the sales of high-end TVs? Clearly, the wealthier people are, the more likely they are to spend money on a nice TV — or even multiple TVs. Therefore we’d expect a positive relationship between x and y.
After gathering TV sales data and running our regression, if more TVs seem to be sold when consumers in the area have higher incomes (or after their incomes increase), we have some good evidence to support our hypothesis. The 𝛽 coefficient on the x variable “consumer income” might be statistically significant, and it can indicate by how much exactly an increase in income is expected to affect TV sales.
This might sound straightforward, but much is happening “under the hood”. Before diving into some more complex scenarios, let’s draw a picture so we can visualize what’s happening when a regression is run.
The geometry of regression analysis
You’ve seen plenty of graphs in math classes before, with y on the vertical axis and x on the horizontal axis. Most regressions cannot easily be depicted in this way, for one simple reason. Most regressions include far more than two variables!
When we collect data to run a regression, we end up having a series of y values (a vector), and several series of x values (a matrix). To compare the variation in y and x, we “project” y onto the “space” of x. Figure 1 below depicts this.

Figure 1: The Geometry of Regression Analysis
In Figure 1, we can see the vector of y data we’ve collected, which appears as the hypotenuse of the triangle. Let’s break this image down to understand all of its pieces, and how they’re represented by the tables of numbers a computer will give us.
Suppose that the y vector is a paper airplane you’re launching, and the x “space” contains the forces of wind, gravity, and your own muscle power. You throw your paper airplane and record its flight. Your muscles launch the paper airplane forward, gravity pulls it downward, and the wind blows it around wildly.
Similar to this scenario, the regression of y on x measures how much the x variables “pull” y in the direction it’s going, as the y vector “flies” over the space of x. If y is TV sales and x is income, the regression measures how much the force of “income gravity” changes the direction of the “TV sales airplane”’s flight, for example.
Consider the space of x in Figure 1, which appears to be a 2-dimensional-looking plane. If x has absolutely zero power to affect y, then the y vector will be perfectly vertical. Why? Because as the y vector takes flight, the x space has no power to change its trajectory. This would be as if you threw a paper airplane straight up into the air and both gravity and the wind turned off at the same time. But if the x space does influence y, then as x works or changes, y’s trajectory is changed too. So, in Figure 1, we can see that x has some power over y.
How much power? One way to measure it is by looking at the “base” of the triangle. This measures the exact horizontal distance that y has traveled as it’s moved through the x space. This quantity is called the projection of y onto x; it is the portion of x that explains y’s trajectory. This is also what gives us the model’s predicted values of y.
Another way to measure this explanatory power is by looking at the vertical distance between y and the space of x in Figure 1. This distance represents the portion of y’s trajectory that x does not explain at all. This is the “error” term! The best model is typically the one that minimizes this distance, or in other words, minimizes the sum of the error (which we usually measure using squared values, so that negative and positive errors don’t cancel each other out).
This picture hides a bit of math that’s going on under the hood. When you run a regression, the computer predicts an estimate of y for every set of x data points you have (the base of the triangle), then computes the resulting error (the vertical line) by subtracting this estimate from the actual y data point (the hypotenuse). In other words, together the error term and the projection constitute y. In order to figure out where y’s position is, we must know both the projection and the error — the explained and unexplained portions of y’s variance.
This is easy to see from the formal regression equation Y = X𝛽 + ε. The first term, X𝛽, is the data times the coefficient our regression calculated for it; it creates the prediction of y based solely on the data, which is the base of the triangle. The error term is what’s left over, denoted by ε, which is again the vertical line in the triangle.
Finally, we can use the ratio of these projection and error vectors to compute a very important statistic known as R2. This is a number that measures how much variation in y is explained by your regression. It’s one of the first things that students of statistics learn about regression analysis, and while it has problems (data scientists now typically use transformed versions of R2 that account for factors like scale and over-fitting), it’s a great window into what your regression’s version of Figure 1 probably looks like. High values of R2 indicate a good fit, which indicates that the model is useful for inference or prediction (one of which is usually the end goal of the analysis).
In search of unknowable truths
Figure 1 might make a regression seem very powerful, but unfortunately, a regression does not yield a certain answer. In other words, we can never learn the 100% precise answer to our research question for certain. This might be a surprising statement, but it’s true!
This is a very important point to realize about regression analysis. We can admittedly get very, very close to a certain answer, but the actual real-world “true” answer is usually impossible to know for certain. In most of the sciences, including economics and data science, these “true” values are called “population parameters”, and our regression analysis gives us estimates of them.
This is due to myriad real-world factors — like the difficulty of gathering fully comprehensive data, a limit to available computing power, and the imprecision of measurements we take. In science, it simply is not possible to have 100.00% certainty about things the majority of the time.
This is one of the reasons why having a solid theoretical reasoning behind a regression model is important. The statistical evidence can only point us in the right direction. After the dust settles, it’s up to the data scientist, economist or biologist, etc. to correctly interpret the results and their real-world implication(s).
This brings up another fascinating point, that an astute reader might have already wondered by now. What if we’re already confident that we know something about y before running our regression? Consider, for example, a macroeconomic model that attempts to explain inflation. This is something that’s been studied often before, and we’ve generated a lot of knowledge about it already. Can we account for that knowledge when running a new regression using new data?
The answer is yes! But, it can seem a little complicated at first…
Using our existing knowledge in scientific ways
Bayesian regression methods include techniques that allow us to incorporate existing knowledge — or even just a good guess — into a regression analysis. This is done by using probability distributions.
Before progressing further, it’s helpful to define what a non-Bayesian method is. We refer to them as “frequentist” methods. Introductory statistics courses (and this whole article so far) discuss frequentist methods and techniques. In essence, these are the “default” setting for regression analysis. Frequentists have a research question, and start to answer it by gathering data.
In contrast, the Bayesian statistician doesn’t begin with the data, (s)he begins with a probability distribution that comes from their existing knowledge. What do we know about TV sales before collecting data on income? It’s very likely that more TVs are sold in wealthier areas, and when incomes of consumers increase. This information isn’t very controversial, but if we ignore it before running our regression, we’re essentially saying that we know nothing about TV sales at all before running the regression. This can be inefficient. Why throw away existing information when trying to answer a question?
So, the Bayesian statistician takes a probability distribution that they believe is accurate, which we call a “prior” distribution. Then, they collect data and set up a regression as the frequentist did before, and run a regression in mostly the same way. This results in a “posterior” distribution that the data has created. The frequentist relies entirely on this posterior; the Bayesian merges the prior and the posterior together before deciding if there’s evidence for statistical significance or not.
This Bayesian practice might seem arbitrary and unscientific, even blasphemous! How can we be allowed to insert our own beliefs into a scientific process? Fortunately, there are a few things keeping Bayesian methods “in check”.
First, many frequentist statistics and test methods are secretly Bayesian methods in disguise. In many cases it’s possible to show that by using specific priors, the frequentist and Bayesian methods are the same. We can think of it in this simple way: frequentist methods simply use the “default” priors, while Bayesian methods are akin to an expert going and changing the default settings for a specific purpose.
Second, if a data scientist, economist etc. uses a bad prior without sufficient justification, peer reviewers will point it out, and reputable journals might refuse to publish the work. Peer review is a cornerstone of science, after all, and statistical interpretation is not immune to its effects.
It’s important to note too that frequentist methods are far from perfect. Consider that the rule of thumb for a statistically significant result is 5% (in words, “this result is 95% likely to be due to a true relationship between x and y, and 5% likely to be due to random chance”)*. This means that there’s a chance that we get the wrong result even though we did nothing wrong during the regression analysis.
In some cases, this can even mean that a frequentist regression leads us to conclude something crazy that a Bayesian regression wouldn’t — like that the sun has exploded, when common sense would tell us that it clearly has not. For a beautiful and hilarious illustration of this exact scenario, look at the linked xkcd comic.
*This is a reason why the standards for natural sciences, or medical science, are often more strict than for economics or data science. When human lives are at stake, a 5% error rate might be considered unacceptably imprecise.
Both frequentist and Bayesian methods help us uncover the true relationships at work in the world around us. At the same time, there are many factors working against us…
Setting up a regression can be tricky
Regardless of which statistical techniques you employ, a regression really only measures how much y and x vary together, and uses that to estimate the strength of the relationship between x and y. But reality is so complicated that it can disrupt or obfuscate the true link between the variables, or even mislead us into thinking that a relationship exists when it truly doesn’t.
Consider, for example, the determinants of high salaries. What factors probably contribute to having a higher salary? Talent, work experience, education, and charisma immediately come to mind. Other factors might include location, the type of industry, workplace politics, labor market factors, and even luck.
All of these factors could be x variables that we gather data on before running a regression on y, which in this case is a vector of observed salaries. But there’s no obvious way to measure many of these variables. How do we measure talent or charisma, for instance? There’s no correct answer to this, although there are surely many wrong answers!
Even for more obvious factors, how to measure them can still be a tricky thing. Work experience can be approximated by age, for example. But age might not be a good measure of experience if a worker has switched jobs often. We could use the measure “years working in the same industry” instead, but this still isn’t perfect, as the definitions of industries themselves can be arbitrary. If someone worked as a marketer for Apple, and later got a job in the marketing team at PepsiCo, are they working in the same industry?
From this quick example, it’s clear that a solid theory should undergird any regression, and clear justifications need to be made for how exactly the practitioner gathers and prepares the data and sets up the regression equation.
But there’s much more; we haven’t yet touched the mathematics of this subject.
Warning: not everything is as it may seem!
Mathematically, there are many ways that reality complicates the relationship between x and y. These include, but are not limited to: cases where x and y affect each other; cases where x affects both y and the error term; cases where x, y or both trend over time; and cases where the error term has patterns or isn’t expected to be zero.
These types of issues occur commonly in real-world data, particularly economic data (that is very seldom experimental**). For example, consider using a regression to investigate the effect of an increase in government taxes on GDP growth.
**So much so that the economists who popularized randomized controlled trials (RCTs) in economics won the 2019 Nobel Prize for their work, a stark contrast to fields like biology and psychology that use RCTs all the time.
It might seem obvious that increasing taxes lowers GDP. However, if the government uses the net-new tax revenue to buy goods or services (for instance, construction projects), the true effect is much less clear. Recall that the aggregate demand equation is GDP = C + I + G + NX, where G is government spending.
Higher taxes probably lower GDP, but increase G, since taxes fund government spending. But G increases GDP. So…what is the true effect? Were we to run a regression including GDP and G without accounting for this feedback loop, we would likely reach an incorrect conclusion.
The previous example was a case where x and y affect each other. Let’s briefly examine the other cases mentioned above.
We can return to our salary example to examine a case where x affects both y and the error term. Consider using “years of education” as an x variable to predict salaries. We’d have to account for the fact that as years of education increase, the variance of earnings increases. That’s because, all else equal, individuals with few years of education tend to have low earnings, while highly-educated workers tend to experience a wider range of earnings. Thus, the error term increases as x increases.
When data trends over time — due to seasonality (think sales of ice cream) or simply the passage of time (macroeconomic variables like GDP and population tend to grow over time), it can be hard to disentangle the true effect of x on y, because that effect is hidden among the time trend that affects both variables.
Finally, consider a case where the error term has a pattern. This is very often the case with financial data, for instance predicted stock market returns. Financial assets often fluctuate partially based on their previous values. This may be because people assess their future worth based on their current price, and because people have expectations that an asset will continue to perform similarly to how it’s currently performing. Either way, the error term in a regression of financial asset value is likely to vary in a pattern, which will obfuscate the true effect of x on y.
Regression is an art and a science
Running a good regression can be the cornerstone of an excellent research paper or the key reason for a new business strategy, but they’re far from simple or easy to execute well in many real-world cases. Understanding the issues at play, and figuring out how best to untangle the relationship between y and x, is a bit of an artful science.
It requires a thorough understanding of the theory underlying the analysis, and running better regressions is a skill that comes with time. Hopefully, this article has offered a helpful glimpse into the machinery behind this ubiquitous tool in the data scientist’s arsenal, and will aid you the next time you begin to gather data for your own analysis!
Image Credit: Generated with Canva AI Image generator (Prompt: Regression)
-
- Universités d'été
- Posted 2 weeks ago
BSE Summer School 2026: Economics, Finance, Data Science, and related fields
Starts 22 Jun at Barcelona School of Economics in Barcelona, Espagne
-
- PhD Candidate Job
- Posted 1 week ago
12 Doctoral Researcher Positions in the Marie Skłodowska-Curie Doctoral Network HEPARD
At University of Duisburg-Essen in Essen, Allemagne
-
- Postdoc Job
- Posted 1 week ago
Postdoctoral Research Fellow or Social Science Research Scholar at Stanford (USA) or Heidelberg University (Germany)
At Stanford University in Stanford, États-Unis


