Ora

What is the difference between R-squared and pseudo R Squared?

Published in Statistical Model Fit 5 mins read

The fundamental difference between R-squared and pseudo R-squared lies in the types of models they evaluate and how they interpret explained variance. R-squared is primarily used for ordinary least squares (OLS) linear regression models, while pseudo R-squared measures are designed for generalized linear models (GLMs), such as logistic regression or probit regression, where the concept of "variance explained" is not as straightforward.

Understanding R-squared (Coefficient of Determination)

R-squared, also known as the coefficient of determination, is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a linear regression model.

  • Purpose: It quantifies how well the regression model fits the observed data.
  • Scale: For OLS linear regression, R-squared always ranges from 0 to 1, where:
    • 0 indicates that the model explains none of the variability of the response data around its mean.
    • 1 indicates that the model explains all the variability of the response data around its mean.
  • Interpretation: An R-squared of 0.75, for example, means that 75% of the variation in the dependent variable can be explained by the independent variables in the model.
  • Calculation: It is typically calculated as 1 minus the ratio of the sum of squares of residuals (SSR) to the total sum of squares (SST).

What is Pseudo R-squared?

Pseudo R-squared measures are a family of statistics developed to provide an analogous measure of model fit for generalized linear models (GLMs), which do not rely on the assumption of normally distributed errors and often deal with categorical dependent variables. Because GLMs use different estimation methods (like maximum likelihood) and have different error structures, the traditional R-squared calculation isn't appropriate or meaningful.

  • Purpose: To offer an approximate indication of how well a GLM fits the data, similar in concept to how R-squared works for OLS.
  • Scale: Unlike OLS R-squared, most pseudo R-squareds do not range from 0 to 1. This is a key distinction. For instance, Cox & Snell's pseudo R-squared is an example of a measure that does not necessarily range from 0 to 1, often having a maximum value less than 1, even for a perfect model. This makes direct comparison to OLS R-squared challenging.
  • Interpretation: Pseudo R-squared values are generally not interpreted as the proportion of variance explained in the same way as OLS R-squared. Their absolute values are often much lower than traditional R-squared values for models with similar explanatory power. They are more useful for comparing competing models for the same dependent variable and dataset (e.g., comparing two different logistic regression models) rather than for providing an absolute measure of fit or comparing to R-squared from linear models.
  • Calculation: There are various pseudo R-squared measures, each calculated differently based on likelihood functions, chi-square statistics, or other metrics appropriate for GLMs.

Common Types of Pseudo R-squared

Several types of pseudo R-squared measures exist, each with its own strengths and weaknesses:

  • McFadden's R-squared:
    • Often ranges from 0 to just under 1, but rarely reaches 1.
    • Calculated as 1 - (log-likelihood of the full model / log-likelihood of the null model).
    • Values often appear lower than OLS R-squared, even for a good model.
  • Cox & Snell's R-squared:
    • Based on the likelihood ratio test statistic.
    • Does not range from 0 to 1, often having a maximum value less than 1, making it less intuitive as a proportion.
  • Nagelkerke's R-squared (or Cragg and Uhler's R-squared):
    • An adjustment of Cox & Snell's R-squared that scales it to have a maximum value of 1.
    • Often considered the most interpretable pseudo R-squared for comparing with OLS R-squared conceptually, though values are still not directly comparable.
  • Efron's R-squared:
    • Similar in calculation to OLS R-squared but applied to the predicted probabilities in GLMs.
    • Can sometimes be interpreted more directly but is less commonly reported.

Key Differences Summarized

Feature R-squared (OLS Linear Regression) Pseudo R-squared (Generalized Linear Models)
Model Type Ordinary Least Squares (OLS) Linear Regression Generalized Linear Models (e.g., Logistic, Probit, Poisson Regression)
Dependent Variable Continuous Often categorical (binary, ordinal) or count data
Error Distribution Assumes normally distributed errors Does not assume normal errors; uses distributions like binomial, Poisson, etc.
Interpretation Proportion of variance in the dependent variable explained by predictors. An analogous, but often not directly comparable, measure of model fit. Not typically "variance explained."
Range Always ranges from 0 to 1. Most pseudo R-squareds do not range from 0 to 1 (e.g., Cox & Snell's). Nagelkerke's is scaled to range to 1.
Comparability Direct comparison between linear models is generally valid. Primarily useful for comparing different GLMs on the same data. Not directly comparable to OLS R-squared.
Calculation Basis Sum of squares of residuals and total sum of squares. Based on likelihood functions, chi-square statistics, or other GLM-specific metrics.

Practical Insights and Solutions

  • When to Use:
    • Use R-squared when you are performing OLS linear regression with a continuous dependent variable.
    • Use pseudo R-squared measures when working with GLMs like logistic regression, probit regression, or Poisson regression, particularly when dealing with categorical or count dependent variables.
  • Interpreting Values:
    • A high R-squared (e.g., > 0.7) typically indicates a very good fit in OLS.
    • Pseudo R-squared values are generally lower than OLS R-squared values, even for well-fitting models. A McFadden's R-squared of 0.2 to 0.4 can indicate an excellent model fit, which would be considered poor for OLS R-squared.
    • Focus on the relative improvement among different models rather than the absolute value of pseudo R-squared.
  • Reporting: When reporting pseudo R-squared, it's good practice to specify which type you are using (e.g., "McFadden's R-squared," "Nagelkerke's R-squared") to avoid confusion.

In essence, while both aim to describe how well a model fits the data, they do so using different underlying statistical principles and scales, tailored to the specific assumptions and characteristics of linear versus generalized linear models.