An Explanation

Standardizing the Data of Different Scales - Which Method to Use?

Read a summary using the INOMICS AI tool

10 July 2017

When you're doing data analysis, you might find yourself with a number of different variables to work with. For example, perhaps you have invited participants to your lab and run them through your experiment. You have collected data on variables such as the participant's age (in years), their height (in cm), their reaction time to a particular stimulus (in milliseconds), their reaction time to another stimulus (in milliseconds), and their rating of a preference (on a 1 – 10 scale).

When you want to analyze this data to find patterns in it, you will often run a regression and look to compare the effect sizes (i.e., the coefficients) of each variable. But, before you compare effect sizes, you first need to make sure that your data variables are on the same scale so that they can be compared.

Considering the example above, the effect of the two measures of reaction time on your dependent variable y would be comparable, as they are both measured in milliseconds. But how would you compare, say, the effect sizes of height and reaction time on y? To do this, you need to use a statistical technique called standardization.

Standardizing the Data of Different Scales - Which Method to Use? Confusingly, standardization can have several meanings, depending on the field. And to add to the confusion, others may refer to the same process as normalization, though we will introduce a separate definition for that. The general process we are talking about here (that normalization and standardization fall under) is the transforming of data of different types to the same scale so that their effects on the dependent variable y can be compared.

Methods for standardizing data

Standardization: calculating Z-score

The first step in standardization is quantifying how much variance exists in your data. This is described by the standard deviation, which is calculated by working out the square root of the variance (the average of the squared deviations of each value from its average value). This will give you a number that represents variance, which you need for the next step.

Once you know the mean and standard deviation of your data, you are ready to perform standardization. You calculate a z-score by subtracting the mean from each data point, and then dividing the difference by the standard deviation of the population. This means that each variable will have a mean of 0 and a standard deviation of 1, so you can compare your different variables meaningfully.

Studentization

However, economists never know the true population mean or variance (unless we have artificially generated the data set)! So, when you do this process of standardization, in most cases you will use your estimate of the standard deviation from your data in place of the true population standard deviation. This is called studentization.

Note for advanced readers – standardization (and studentization) assume that your data is normally distributed, an assumption that must be checked. The z-score calculation yields transformed variables that are standard normal distributed with mean 0 and variance 1, but this only works if the normally distributed assumption is valid.

Normalization

Another option is to take your data points and adapt them so that they fall on a scale of 0 to 1. Confusingly, this is referred to as normalization, even though it is not related to the normal distribution. Some statistics programs can do this automatically for you.

One big advantage of this method is that it lets you eyeball the effect sizes very easily, as it's intuitively obvious what the difference between a value of 0.6 and 0.8 is on a 0 to 1 scale, for example. The following formula shows how to normalize data:

\begin{equation*}
\mathit{X}_{\text{changed}}= \frac{\mathit{X}-\mathit{X}_{\text{min}}}{\mathit{X}_{\text{max}}-\mathit{X}_{\text{min}}}
\end{equation*}

However, scaling in this manner is sensitive to outliers. Depending on the method used, outliers may be dropped to prevent them from overly influencing interpretations of the other variables. This is an inefficient loss of data, so unless there is a strong reason for preferring the 0-1 scale, a z-score is generally sufficient for most purposes if the data is normally distributed.

Note for advanced readers – removing outliers from data must be treated with caution. The reason why a data point is an outlier may be entirely valid, in which case removing the data point will cause your inference to be incorrect. This is a serious problem. However, if the data point is an outlier because of a data entry error or similar issue, it is safe to remove the data point. Check your outliers carefully before deciding how to handle them.

Find out more

To fully understand this topic and choose the right method for your data analysis, we recommend learning about the basics of statistics from an introductory textbook. INOMICS has also recommended several free online courses that you might find very helpful.

If you’re ready to dive deep and get formally educated in economic statistics and other economics topics, check out our top program recommendations. Pro tip: you can filter the program listings for programs that include a statistics focus by selecting “Econometrics, Statistics and Quantitative Methods (JEL C)” and “Statistics” at the very bottom of the list.

Suggested Opportunities

Programme de Maîtrise
Posted 1 day ago

MSc Economic Research UCologne

Starts 1 Oct at University of Cologne in Köln, Allemagne

Programme de Maîtrise
Posted 6 days ago

Master’s in Economics of Sustainability

Starts 1 Sep at Wageningen University & Research in Wageningen, Pays-Bas

Programme de Maîtrise
Posted 1 week ago

Master's programme in Quantitative Economics and Finance (MiQE/F)

Starts 31 Aug at University of St.Gallen in Sankt Gallen, Suisse

You can also read up on papers by statisticians which cover these issues. Here are some references to get you started:

- Everitt, B.S. Cluster Analysis. 1993. Third Edition. (New York and Toronto: Halsted Press, of John Wiley & Sons Inc.).

- Gower, J. C. 1985. Measures of similarity, dissimilarity, and distance. Pages 397-405 in Encyclopedia of Statistical Sciences, Vol. 5. S. Kotz, N.L. Johnson, and C.B. Read, Editors. (New York: John Wiley and Sons).

- Johnson, R.A., and D.W. Wichern. 1992. Applied Multivariate Statistical Analysis. 3rd Edition. (Englewood Cliffs, New Jersey: Prentice Hall).

- Milligan, G. W., and M. C. Cooper. 1988. A study of standardization of variables in cluster analysis. Journal of Classification, 5, 181-204.

- van Tongeren, O. F. R. 1995. Cluster Analysis. Pages 174-212 in: R. H. G. Jongman, C. J. F. ter Braak, and O. F. R. van Tongeren, Eds. Data Analysis in Landscape and Community Ecology. (Cambridge & New York: Cambridge University Press).