How do you merge two variables in SAS?

In SAS, merging "variables" can refer to a couple of distinct operations. Most commonly, it refers to combining two datasets based on one or more common variables. This process is fundamental for data integration and is primarily achieved using the MERGE statement within a DATA step. Less commonly, it might refer to combining the values of two existing variables (e.g., concatenating text or performing calculations) to create a new variable.

Let's explore both interpretations to provide a comprehensive understanding.

Merging Datasets Based on Common Variables

The primary method for combining observations from two or more SAS datasets into a single new dataset, based on the values of shared variables, involves the MERGE statement. This is often referred to as a "match-merge."

To successfully merge two datasets based on common variables in SAS, you follow these key steps:

Sort Both Datasets: Before you can merge, both datasets must be sorted by the common variable(s) that will be used for matching. This is a critical prerequisite for the MERGE statement to work correctly.
Use the MERGE and BY Statements: Within a DATA step, you use the MERGE statement to list the datasets you want to combine and the BY statement to specify the common variable(s) to match observations.

Here's the basic syntax:

DATA New-Dataset-Name (OPTIONS);
    MERGE Dataset-Name-1 (OPTIONS) Dataset-Name-2 (OPTIONS);
    BY Variable(s);
RUN;

Step-by-Step Example

Let's say you have two datasets: Students with StudentID, FirstName, LastName and Grades with StudentID, Course, Score. You want to combine them to see each student's name alongside their grades.

1. Create Sample Data:

DATA Students;
    INPUT StudentID FirstName $ LastName $;
    DATALINES;
    101 Alice Smith
    102 Bob Johnson
    103 Carol Davis
    ;
RUN;

DATA Grades;
    INPUT StudentID Course $ Score;
    DATALINES;
    101 Math 90
    101 Science 85
    102 Math 78
    103 History 92
    103 Art 88
    ;
RUN;

2. Sort the Datasets by the Common Variable (StudentID):

PROC SORT DATA=Students OUT=Sorted_Students;
    BY StudentID;
RUN;

PROC SORT DATA=Grades OUT=Sorted_Grades;
    BY StudentID;
RUN;

3. Merge the Sorted Datasets:

DATA Merged_Student_Grades;
    MERGE Sorted_Students (IN=s) Sorted_Grades (IN=g);
    BY StudentID;
    * Optional: Keep only observations that exist in both datasets (inner join);
    IF s AND g;
RUN;

PROC PRINT DATA=Merged_Student_Grades;
    TITLE "Merged Student Grades Dataset";
RUN;

Explanation of the MERGE Statement:

DATA Merged_Student_Grades;: This initiates a new DATA step and specifies the name of the output dataset.
MERGE Sorted_Students (IN=s) Sorted_Grades (IN=g);: This statement lists the input datasets to be merged. The IN= option creates a temporary boolean variable (e.g., s and g) that is 1 if an observation contributes from that dataset, and 0 otherwise. This is useful for controlling the merge type (e.g., an inner join using IF s AND g;).
BY StudentID;: This crucial statement tells SAS to match observations across Sorted_Students and Sorted_Grades based on the StudentID variable.
IF s AND g;: This conditional statement ensures that only observations present in both Sorted_Students (indicated by s=1) and Sorted_Grades (indicated by g=1) are written to the Merged_Student_Grades dataset. Without this line, SAS performs a full outer merge, keeping all observations from both datasets and populating missing values with ..

Types of Merges

The MERGE statement in SAS performs a match-merge, which by default is a full outer merge. However, by using the IN= option, you can achieve various join types:

Merge Type	Description	`IN=` Logic (Example)
Inner Join	Includes only observations that have a match in both datasets.	`IF s AND g;`
Left Join	Includes all observations from the first (left) dataset and matching observations from the second (right) dataset. If no match, variables from the right dataset are missing.	`IF s;`
Right Join	Includes all observations from the second (right) dataset and matching observations from the first (left) dataset. If no match, variables from the left dataset are missing.	`IF g;`
Full Outer Join	Includes all observations from both datasets. If an observation exists in one dataset but not the other, missing values are generated for variables from the non-contributing dataset. This is the default behavior without `IN=` filtering.	(No `IF` statement)

For more detailed information on merging datasets, consult the SAS Support Communities or the official SAS Documentation.

Combining Variables Within or Across Datasets (Other Interpretations)

If "merging two variables" refers to combining the values of variables to create a new one, this is typically done using concatenation for character variables or arithmetic operations for numeric variables.

1. Concatenating Character Variables

To combine two or more character variables into a single new character variable, you can use concatenation operators or functions.

Concatenation Operator (||):

DATA CombinedNames;
    SET Students;
    FullName = FirstName || ' ' || LastName; * Combines with a space in between;
RUN;

PROC PRINT DATA=CombinedNames;
    VAR StudentID FullName;
    TITLE "Concatenated FullName";
RUN;

CATX Function: The CATX function is particularly useful as it automatically trims leading/trailing blanks and inserts a specified delimiter only between non-missing values.

DATA CombinedNames;
    SET Students;
    FullName_CATX = CATX(' ', FirstName, LastName); * Automatically handles spaces and missing values;
RUN;

PROC PRINT DATA=CombinedNames;
    VAR StudentID FullName_CATX;
    TITLE "FullName using CATX Function";
RUN;

2. Performing Arithmetic Operations on Numeric Variables

If you need to combine numeric variables, you typically use arithmetic operators (+, -, *, /) or functions to create a new derived variable.

DATA CalculatedScores;
    SET Grades;
    * Assuming there's another variable 'HomeworkScore' to combine;
    * For this example, let's create a hypothetical 'BonusPoints' variable;
    BonusPoints = 5;
    TotalScore = Score + BonusPoints;
    AverageScore = MEAN(Score, BonusPoints); * Using a function to calculate mean;
RUN;

PROC PRINT DATA=CalculatedScores;
    VAR StudentID Course Score BonusPoints TotalScore AverageScore;
    TITLE "Calculated Scores";
RUN;

In summary, while the phrase "merge two variables" can be interpreted in a few ways, the most common and powerful application in SAS is the merging of datasets based on common variables, utilizing the MERGE and BY statements after ensuring both datasets are sorted. For combining variable values within a dataset, concatenation and arithmetic operations are the go-to methods.