While SPSS doesn't offer a single "automatic" button to remove outliers, the process is straightforward and involves two main steps: identifying the outliers and then excluding them from your analysis using the "Select Cases" function. You first have to determine which observations are outliers and then remove them.
Outlier removal in SPSS is not a one-click operation; instead, it's a thoughtful analytical process where you define what constitutes an outlier and then instruct SPSS to filter those cases.
Understanding Outliers and Why They Matter
Outliers are data points that significantly deviate from other observations in a dataset. They can occur due to measurement errors, data entry mistakes, or genuinely unusual but valid data points. Their presence can distort statistical analyses, affecting means, standard deviations, correlations, and regression models.
Before attempting to remove outliers, it's crucial to:
- Investigate their cause: Is it a data entry error, a measurement error, or a legitimate extreme value?
- Understand the impact: How do they affect your analysis results?
- Consider alternatives: Instead of removal, could you transform the data, use robust statistical methods, or analyze the data with and without outliers?
Step 1: Identifying Outliers in SPSS
Identifying outliers typically involves visual inspection or statistical criteria. Here are common methods in SPSS:
1. Using Box Plots
Box plots are excellent visual tools for identifying outliers, which are usually represented as individual points beyond the "whiskers" of the box plot.
- Path:
Analyze > Descriptive Statistics > Explore...
- Steps:
- Move the variable(s) of interest into the
Dependent List
. - In the
Plots
dialog, selectBoxplots
(Factor levels together or Dependents together). - Ensure
Outliers
is checked. - Click
Continue
and thenOK
.
- Move the variable(s) of interest into the
- Interpretation: SPSS will generate box plots and an "Extreme Values" table in the output, listing the case numbers and values of the five highest and five lowest observations, which often include outliers.
2. Using Z-Scores
Z-scores (standardized scores) measure how many standard deviations an observation is from the mean. A common rule of thumb for identifying outliers is a Z-score with an absolute value greater than 3 (or 3.29 for a more conservative approach), indicating a value far from the average.
- Path:
Analyze > Descriptive Statistics > Descriptives...
- Steps:
- Move the variable(s) of interest into the
Variable(s)
list. - Check the box
Save standardized values as variables
. - Click
OK
.
- Move the variable(s) of interest into the
- Interpretation: SPSS will create new variables in your Data View (e.g.,
Zscore(VAR_NAME)
). You can then sort these Z-score variables to quickly find cases with extreme values.
3. Using the Interquartile Range (IQR) Rule
The IQR method defines outliers as any data point that falls below Q1 - (1.5 IQR) or above Q3 + (1.5 IQR), where Q1 is the first quartile, Q3 is the third quartile, and IQR = Q3 - Q1.
- Path:
Analyze > Descriptive Statistics > Explore...
- Steps:
- Move the variable(s) of interest into the
Dependent List
. - In the
Statistics
dialog, ensureDescriptives
is checked. - Click
Continue
and thenOK
.
- Move the variable(s) of interest into the
- Interpretation: The output will provide Q1, Q3, and IQR values. You'll then need to manually calculate the upper and lower bounds for outliers and identify cases that fall outside these bounds.
4. Multivariate Outliers (Advanced)
For multivariate data, where outliers might not be obvious in individual variables but stand out when considering multiple variables together, measures like Mahalanobis Distance or Cook's Distance (from regression analysis) can be used. These are typically obtained through regression analysis or specific multivariate procedures.
Step 2: Removing/Excluding Outliers using "Select Cases"
Once you have identified the specific case numbers or established a criterion for outliers (e.g., Z-score > 3), you can use the Select Cases
function to exclude them from your analysis.
- Path:
Data > Select Cases...
- Steps:
- Choose
If condition is satisfied
. - Click the
If...
button to open the "Select Cases: If" dialog. - Define your outlier exclusion condition:
- Using Z-scores: If you created Z-score variables (e.g.,
ZVAR_1
), you can enter a condition like:ABS(ZVAR_1) <= 3
(to select cases within 3 standard deviations, thereby excluding those outside). Or, if you want to include all non-outliers for multiple variables:(ABS(ZVAR_1) <= 3 AND ABS(ZVAR_2) <= 3 AND ABS(ZVAR_3) <= 3)
. - Using Case Numbers: If you identified specific case numbers (e.g., cases 5, 12, 23) from an "Extreme Values" table, you could enter:
NOT (CASE_N = 5 OR CASE_N = 12 OR CASE_N = 23)
. - Using IQR Rule (manual bounds): If your variable
VAR_A
has IQR bounds of 10 and 50, you'd enter:(VAR_A >= 10 AND VAR_A <= 50)
.
- Using Z-scores: If you created Z-score variables (e.g.,
- Click
Continue
. - In the
Select Cases
dialog, underOutput
, choose:Filter out unselected cases
(Recommended): This option temporarily excludes the identified outliers from analyses but keeps them in your dataset. SPSS will indicate filtered cases with a diagonal line through their row number. You can easily revert this by going back toSelect Cases
and choosingAll cases
.Delete unselected cases
: This permanently removes the outliers from your active dataset. Use with extreme caution and always save a backup of your data first.
- Click
OK
.
- Choose
Example: Excluding Outliers Based on Z-Score
Let's say you have a variable Income
and you've calculated its Z-scores (Zscore(Income)
). You want to exclude any Income
value with a Z-score greater than 3 or less than -3.
- Go to
Data > Select Cases...
- Select
If condition is satisfied
. - Click
If...
- Enter the condition:
ABS(Zincome) <= 3
- Click
Continue
. - Select
Filter out unselected cases
. - Click
OK
.
Now, any subsequent analyses will be run only on the cases where Zincome
is between -3 and 3 (inclusive).
Summary Table: Outlier Identification and Selection in SPSS
Method for Identification | SPSS Menu Path (Identification) | How to Identify | SPSS Menu Path (Exclusion) | Exclusion Condition Example |
---|---|---|---|---|
Box Plots | Analyze > Descriptive Statistics > Explore... |
Visual; Extreme Values Table | Data > Select Cases... > If |
NOT (CASE_N = 12 OR CASE_N = 45) (using identified case numbers) |
Z-Scores | Analyze > Descriptive Statistics > Descriptives... (check "Save standardized values") |
New Z-score variable in Data View | Data > Select Cases... > If |
ABS(Z_Variable) <= 3 |
IQR Rule | Analyze > Descriptive Statistics > Explore... |
Manual calculation from Q1, Q3, IQR values | Data > Select Cases... > If |
(Variable >= Lower_Bound AND Variable <= Upper_Bound) |
Mahalanobis/Cook's Distance | Typically from Regression or Advanced Modules | Output tables, specific cut-offs | Data > Select Cases... > If |
Mahal_Dist <= Critical_Value |
By following these steps, you can systematically identify and manage outliers in your SPSS dataset, ensuring your analyses are robust and accurate. Remember to always document your outlier handling process.