Ora

How Do I Automatically Remove Outliers in SPSS?

Published in Data Cleaning 4 mins read

While SPSS doesn't offer a single "automatic" button to remove outliers, the process is straightforward and involves two main steps: identifying the outliers and then excluding them from your analysis using the "Select Cases" function. You first have to determine which observations are outliers and then remove them.

Outlier removal in SPSS is not a one-click operation; instead, it's a thoughtful analytical process where you define what constitutes an outlier and then instruct SPSS to filter those cases.

Understanding Outliers and Why They Matter

Outliers are data points that significantly deviate from other observations in a dataset. They can occur due to measurement errors, data entry mistakes, or genuinely unusual but valid data points. Their presence can distort statistical analyses, affecting means, standard deviations, correlations, and regression models.

Before attempting to remove outliers, it's crucial to:

  • Investigate their cause: Is it a data entry error, a measurement error, or a legitimate extreme value?
  • Understand the impact: How do they affect your analysis results?
  • Consider alternatives: Instead of removal, could you transform the data, use robust statistical methods, or analyze the data with and without outliers?

Step 1: Identifying Outliers in SPSS

Identifying outliers typically involves visual inspection or statistical criteria. Here are common methods in SPSS:

1. Using Box Plots

Box plots are excellent visual tools for identifying outliers, which are usually represented as individual points beyond the "whiskers" of the box plot.

  • Path: Analyze > Descriptive Statistics > Explore...
  • Steps:
    1. Move the variable(s) of interest into the Dependent List.
    2. In the Plots dialog, select Boxplots (Factor levels together or Dependents together).
    3. Ensure Outliers is checked.
    4. Click Continue and then OK.
  • Interpretation: SPSS will generate box plots and an "Extreme Values" table in the output, listing the case numbers and values of the five highest and five lowest observations, which often include outliers.

2. Using Z-Scores

Z-scores (standardized scores) measure how many standard deviations an observation is from the mean. A common rule of thumb for identifying outliers is a Z-score with an absolute value greater than 3 (or 3.29 for a more conservative approach), indicating a value far from the average.

  • Path: Analyze > Descriptive Statistics > Descriptives...
  • Steps:
    1. Move the variable(s) of interest into the Variable(s) list.
    2. Check the box Save standardized values as variables.
    3. Click OK.
  • Interpretation: SPSS will create new variables in your Data View (e.g., Zscore(VAR_NAME)). You can then sort these Z-score variables to quickly find cases with extreme values.

3. Using the Interquartile Range (IQR) Rule

The IQR method defines outliers as any data point that falls below Q1 - (1.5 IQR) or above Q3 + (1.5 IQR), where Q1 is the first quartile, Q3 is the third quartile, and IQR = Q3 - Q1.

  • Path: Analyze > Descriptive Statistics > Explore...
  • Steps:
    1. Move the variable(s) of interest into the Dependent List.
    2. In the Statistics dialog, ensure Descriptives is checked.
    3. Click Continue and then OK.
  • Interpretation: The output will provide Q1, Q3, and IQR values. You'll then need to manually calculate the upper and lower bounds for outliers and identify cases that fall outside these bounds.

4. Multivariate Outliers (Advanced)

For multivariate data, where outliers might not be obvious in individual variables but stand out when considering multiple variables together, measures like Mahalanobis Distance or Cook's Distance (from regression analysis) can be used. These are typically obtained through regression analysis or specific multivariate procedures.

Step 2: Removing/Excluding Outliers using "Select Cases"

Once you have identified the specific case numbers or established a criterion for outliers (e.g., Z-score > 3), you can use the Select Cases function to exclude them from your analysis.

  • Path: Data > Select Cases...
  • Steps:
    1. Choose If condition is satisfied.
    2. Click the If... button to open the "Select Cases: If" dialog.
    3. Define your outlier exclusion condition:
      • Using Z-scores: If you created Z-score variables (e.g., ZVAR_1), you can enter a condition like: ABS(ZVAR_1) <= 3 (to select cases within 3 standard deviations, thereby excluding those outside). Or, if you want to include all non-outliers for multiple variables: (ABS(ZVAR_1) <= 3 AND ABS(ZVAR_2) <= 3 AND ABS(ZVAR_3) <= 3).
      • Using Case Numbers: If you identified specific case numbers (e.g., cases 5, 12, 23) from an "Extreme Values" table, you could enter: NOT (CASE_N = 5 OR CASE_N = 12 OR CASE_N = 23).
      • Using IQR Rule (manual bounds): If your variable VAR_A has IQR bounds of 10 and 50, you'd enter: (VAR_A >= 10 AND VAR_A <= 50).
    4. Click Continue.
    5. In the Select Cases dialog, under Output, choose:
      • Filter out unselected cases (Recommended): This option temporarily excludes the identified outliers from analyses but keeps them in your dataset. SPSS will indicate filtered cases with a diagonal line through their row number. You can easily revert this by going back to Select Cases and choosing All cases.
      • Delete unselected cases: This permanently removes the outliers from your active dataset. Use with extreme caution and always save a backup of your data first.
    6. Click OK.

Example: Excluding Outliers Based on Z-Score

Let's say you have a variable Income and you've calculated its Z-scores (Zscore(Income)). You want to exclude any Income value with a Z-score greater than 3 or less than -3.

  1. Go to Data > Select Cases...
  2. Select If condition is satisfied.
  3. Click If...
  4. Enter the condition: ABS(Zincome) <= 3
  5. Click Continue.
  6. Select Filter out unselected cases.
  7. Click OK.

Now, any subsequent analyses will be run only on the cases where Zincome is between -3 and 3 (inclusive).

Summary Table: Outlier Identification and Selection in SPSS

Method for Identification SPSS Menu Path (Identification) How to Identify SPSS Menu Path (Exclusion) Exclusion Condition Example
Box Plots Analyze > Descriptive Statistics > Explore... Visual; Extreme Values Table Data > Select Cases... > If NOT (CASE_N = 12 OR CASE_N = 45) (using identified case numbers)
Z-Scores Analyze > Descriptive Statistics > Descriptives... (check "Save standardized values") New Z-score variable in Data View Data > Select Cases... > If ABS(Z_Variable) <= 3
IQR Rule Analyze > Descriptive Statistics > Explore... Manual calculation from Q1, Q3, IQR values Data > Select Cases... > If (Variable >= Lower_Bound AND Variable <= Upper_Bound)
Mahalanobis/Cook's Distance Typically from Regression or Advanced Modules Output tables, specific cut-offs Data > Select Cases... > If Mahal_Dist <= Critical_Value

By following these steps, you can systematically identify and manage outliers in your SPSS dataset, ensuring your analyses are robust and accurate. Remember to always document your outlier handling process.