What are the OLS Assumptions?

Ordinary Least Squares (OLS) regression is a powerful statistical technique used to estimate the linear relationship between a dependent variable and one or more independent variables. For the OLS estimators to be considered the Best Linear Unbiased Estimators (BLUE), and for statistical inference to be valid, several key assumptions must hold. Violations of these assumptions can lead to biased, inefficient, or inconsistent estimates.

Key Assumptions for Unbiased and Efficient Estimators

The reliability and efficiency of OLS estimates heavily depend on certain properties of the error term. Specifically, for OLS to provide unbiased, efficient linear estimators, three critical assumptions about the error term are necessary:

Homoscedasticity: This means that the error terms have an identical distribution (specifically, a constant variance) across all levels of the independent variables. Without this, the OLS estimators are still unbiased but lose their efficiency (i.e., they are no longer BLUE).
No Autocorrelation: The error terms must be independent of each other. This implies that the error for one observation does not systematically influence the error for another observation. Violation of this assumption (autocorrelation) leads to unbiased but inefficient OLS estimators, and standard errors will be incorrect, affecting hypothesis tests.
Normality of Errors: The error terms are assumed to be normally distributed. While OLS estimators are still unbiased and consistent even without normality (especially in large samples due to the Central Limit Theorem), this assumption is crucial for conducting hypothesis tests (t-tests, F-tests) and constructing confidence intervals, particularly in small samples.

Beyond these crucial error term assumptions, a comprehensive set of OLS assumptions ensures the robust application and interpretation of regression results.

Comprehensive OLS Assumptions

The following table summarizes the primary OLS assumptions, their meaning, and the implications of their violation:

Assumption	Explanation	Implication of Violation
1. Linearity in Parameters	The relationship between the dependent variable and the independent variables is linear in terms of the parameters (coefficients). The independent variables themselves do not need to be linear (e.g., you can have X²).	If the true relationship is non-linear, OLS will provide biased and inconsistent estimates, potentially leading to incorrect conclusions about the relationships between variables.
2. Random Sampling	The data used in the regression analysis must be a simple random sample from the population of interest. This ensures that the sample is representative of the population.	Non-random sampling can lead to biased and inconsistent estimators, as the sample may not accurately reflect the underlying population relationships.
3. No Perfect Multicollinearity	No independent variable is a perfect linear combination of other independent variables. This means there's no exact linear relationship among the predictors. For example, you shouldn't include both height in inches and height in centimeters as separate predictors.	Perfect multicollinearity makes it impossible to uniquely determine the individual coefficients, leading to infinite standard errors and unreliable estimates. High (but not perfect) multicollinearity can inflate standard errors, making it difficult to assess the individual impact of correlated predictors.
4. Zero Conditional Mean of Errors (Exogeneity)	The expected value of the error term, given any values of the independent variables, is zero. E(u	X) = 0. This implies that all relevant variables are included in the model and are uncorrelated with the error term.
5. Homoscedasticity	The variance of the error terms is constant across all observations. Var(u	X) = σ². This means the spread of residuals should be roughly the same across the range of predicted values. (As noted, errors having "identical distributions" implies this constant variance).
6. No Autocorrelation	The error terms are uncorrelated with each other across different observations. Cov(ui, uj	X) = 0 for i ≠ j. This is particularly relevant in time-series data, where errors in one period might be related to errors in a subsequent period. (As noted, "errors are independent").
7. Normality of Errors	The error term is normally distributed. u ~ N(0, σ²). This assumption is less critical for large samples due to the Central Limit Theorem, which states that the distribution of OLS estimators will be approximately normal regardless of the error distribution.	For small samples, if errors are not normally distributed, hypothesis tests (t-tests, F-tests) and confidence intervals may not be valid. Large samples generally mitigate this, as OLS estimators are still unbiased and consistent.

Practical Considerations and Diagnostics

Understanding these assumptions is crucial, as real-world data rarely perfectly satisfies all of them. Here's what to consider:

Checking Assumptions:
- Residual Plots: Plotting residuals against predicted values or independent variables can help detect non-linearity, heteroscedasticity, and outliers.
- Tests for Heteroscedasticity: Breusch-Pagan test, White test.
- Tests for Autocorrelation: Durbin-Watson test, Breusch-Godfrey test.
- Tests for Normality: Jarque-Bera test, Shapiro-Wilk test, Q-Q plots.
- Variance Inflation Factor (VIF): To detect multicollinearity.
Addressing Violations:
- Non-linearity: Transform variables (e.g., log transformations), add polynomial terms, or use non-linear regression models.
- Heteroscedasticity: Use robust standard errors (e.g., White's heteroscedasticity-consistent standard errors) or Weighted Least Squares (WLS).
- Autocorrelation: Use robust standard errors (e.g., Newey-West standard errors for time series), transform data (e.g., differencing), or use time series models like ARIMA.
- Multicollinearity: Remove highly correlated variables, combine them, or use techniques like Principal Component Analysis (PCA) or Ridge Regression.
- Endogeneity: This is the most serious violation and often requires advanced econometric techniques like Instrumental Variables (IV) regression or Two-Stage Least Squares (2SLS).

By carefully checking and addressing these assumptions, researchers can ensure the validity and reliability of their OLS regression results.