In SAS, merging "variables" can refer to a couple of distinct operations. Most commonly, it refers to combining two datasets based on one or more common variables. This process is fundamental for data integration and is primarily achieved using the MERGE
statement within a DATA
step. Less commonly, it might refer to combining the values of two existing variables (e.g., concatenating text or performing calculations) to create a new variable.
Let's explore both interpretations to provide a comprehensive understanding.
Merging Datasets Based on Common Variables
The primary method for combining observations from two or more SAS datasets into a single new dataset, based on the values of shared variables, involves the MERGE
statement. This is often referred to as a "match-merge."
To successfully merge two datasets based on common variables in SAS, you follow these key steps:
- Sort Both Datasets: Before you can merge, both datasets must be sorted by the common variable(s) that will be used for matching. This is a critical prerequisite for the
MERGE
statement to work correctly. - Use the
MERGE
andBY
Statements: Within aDATA
step, you use theMERGE
statement to list the datasets you want to combine and theBY
statement to specify the common variable(s) to match observations.
Here's the basic syntax:
DATA New-Dataset-Name (OPTIONS);
MERGE Dataset-Name-1 (OPTIONS) Dataset-Name-2 (OPTIONS);
BY Variable(s);
RUN;
Step-by-Step Example
Let's say you have two datasets: Students
with StudentID
, FirstName
, LastName
and Grades
with StudentID
, Course
, Score
. You want to combine them to see each student's name alongside their grades.
1. Create Sample Data:
DATA Students;
INPUT StudentID FirstName $ LastName $;
DATALINES;
101 Alice Smith
102 Bob Johnson
103 Carol Davis
;
RUN;
DATA Grades;
INPUT StudentID Course $ Score;
DATALINES;
101 Math 90
101 Science 85
102 Math 78
103 History 92
103 Art 88
;
RUN;
2. Sort the Datasets by the Common Variable (StudentID
):
PROC SORT DATA=Students OUT=Sorted_Students;
BY StudentID;
RUN;
PROC SORT DATA=Grades OUT=Sorted_Grades;
BY StudentID;
RUN;
3. Merge the Sorted Datasets:
DATA Merged_Student_Grades;
MERGE Sorted_Students (IN=s) Sorted_Grades (IN=g);
BY StudentID;
* Optional: Keep only observations that exist in both datasets (inner join);
IF s AND g;
RUN;
PROC PRINT DATA=Merged_Student_Grades;
TITLE "Merged Student Grades Dataset";
RUN;
Explanation of the MERGE
Statement:
DATA Merged_Student_Grades;
: This initiates a new DATA step and specifies the name of the output dataset.MERGE Sorted_Students (IN=s) Sorted_Grades (IN=g);
: This statement lists the input datasets to be merged. TheIN=
option creates a temporary boolean variable (e.g.,s
andg
) that is1
if an observation contributes from that dataset, and0
otherwise. This is useful for controlling the merge type (e.g., an inner join usingIF s AND g;
).BY StudentID;
: This crucial statement tells SAS to match observations acrossSorted_Students
andSorted_Grades
based on theStudentID
variable.IF s AND g;
: This conditional statement ensures that only observations present in bothSorted_Students
(indicated bys=1
) andSorted_Grades
(indicated byg=1
) are written to theMerged_Student_Grades
dataset. Without this line, SAS performs a full outer merge, keeping all observations from both datasets and populating missing values with.
.
Types of Merges
The MERGE
statement in SAS performs a match-merge, which by default is a full outer merge. However, by using the IN=
option, you can achieve various join types:
Merge Type | Description | IN= Logic (Example) |
---|---|---|
Inner Join | Includes only observations that have a match in both datasets. | IF s AND g; |
Left Join | Includes all observations from the first (left) dataset and matching observations from the second (right) dataset. If no match, variables from the right dataset are missing. | IF s; |
Right Join | Includes all observations from the second (right) dataset and matching observations from the first (left) dataset. If no match, variables from the left dataset are missing. | IF g; |
Full Outer Join | Includes all observations from both datasets. If an observation exists in one dataset but not the other, missing values are generated for variables from the non-contributing dataset. This is the default behavior without IN= filtering. |
(No IF statement) |
For more detailed information on merging datasets, consult the SAS Support Communities or the official SAS Documentation.
Combining Variables Within or Across Datasets (Other Interpretations)
If "merging two variables" refers to combining the values of variables to create a new one, this is typically done using concatenation for character variables or arithmetic operations for numeric variables.
1. Concatenating Character Variables
To combine two or more character variables into a single new character variable, you can use concatenation operators or functions.
-
Concatenation Operator (
||
):DATA CombinedNames; SET Students; FullName = FirstName || ' ' || LastName; * Combines with a space in between; RUN; PROC PRINT DATA=CombinedNames; VAR StudentID FullName; TITLE "Concatenated FullName"; RUN;
-
CATX
Function: TheCATX
function is particularly useful as it automatically trims leading/trailing blanks and inserts a specified delimiter only between non-missing values.DATA CombinedNames; SET Students; FullName_CATX = CATX(' ', FirstName, LastName); * Automatically handles spaces and missing values; RUN; PROC PRINT DATA=CombinedNames; VAR StudentID FullName_CATX; TITLE "FullName using CATX Function"; RUN;
2. Performing Arithmetic Operations on Numeric Variables
If you need to combine numeric variables, you typically use arithmetic operators (+
, -
, *
, /
) or functions to create a new derived variable.
DATA CalculatedScores;
SET Grades;
* Assuming there's another variable 'HomeworkScore' to combine;
* For this example, let's create a hypothetical 'BonusPoints' variable;
BonusPoints = 5;
TotalScore = Score + BonusPoints;
AverageScore = MEAN(Score, BonusPoints); * Using a function to calculate mean;
RUN;
PROC PRINT DATA=CalculatedScores;
VAR StudentID Course Score BonusPoints TotalScore AverageScore;
TITLE "Calculated Scores";
RUN;
In summary, while the phrase "merge two variables" can be interpreted in a few ways, the most common and powerful application in SAS is the merging of datasets based on common variables, utilizing the MERGE
and BY
statements after ensuring both datasets are sorted. For combining variable values within a dataset, concatenation and arithmetic operations are the go-to methods.