Ora

How to find substring in variable in SAS?

Published in SAS String Functions 6 mins read

You can efficiently find a substring within a variable in SAS using several powerful string functions and operators, primarily the FIND, INDEX, CONTAINS, and PRXMATCH functions. These tools allow you to locate specific text patterns within character variables in your datasets.

Understanding Substring Search in SAS

SAS provides robust capabilities for manipulating and searching character strings, which is crucial for data cleaning, transformation, and analysis. When you need to determine if a particular sequence of characters (a substring) exists within a larger text string (your variable), or to find its exact position, SAS offers a range of dedicated functions.

Key SAS Functions for Substring Search

Here are the primary methods for finding substrings in SAS, each with its own advantages:

1. The FIND Function

The FIND function is a versatile tool for locating substrings. It searches a string for the first occurrence of the specified substring and returns the starting position of that substring. If the substring is not found in the string, FIND returns a value of 0. It is case-sensitive by default but can be made case-insensitive.

  • Syntax: FIND(string, substring <, modifiers>);

    • string: The character variable or literal to search within.
    • substring: The character string or literal to search for.
    • modifiers (optional):
      • 'i' or 'I': Performs a case-insensitive search.
      • 't' or 'T': Trims trailing blanks from the string and substring.
  • Example:

    DATA search_example;
        text_var = 'SAS Programming is fun!';
        position_find = FIND(text_var, 'Programming');
        position_find_case_insensitive = FIND(text_var, 'programming', 'i');
        position_not_found = FIND(text_var, 'SQL');
        PUT 'FIND Position (case-sensitive): ' position_find;
        PUT 'FIND Position (case-insensitive): ' position_find_case_insensitive;
        PUT 'FIND Position (not found): ' position_not_found;
    RUN;
    • Output:
      FIND Position (case-sensitive): 5
      FIND Position (case-insensitive): 5
      FIND Position (not found): 0

2. The INDEX Function

The INDEX function is very similar to FIND and often used interchangeably. It also returns the starting position of the first occurrence of a substring within a string. Like FIND, it returns 0 if the substring is not found.

  • Syntax: INDEX(string, substring);

    • string: The character variable or literal to search within.
    • substring: The character string or literal to search for.
  • Key Difference from FIND: INDEX is always case-sensitive and does not have built-in modifiers for case insensitivity or trimming. For case-insensitive searches with INDEX, you would typically convert both the string and substring to the same case using UPCASE or LOWCASE functions before applying INDEX.

  • Example:

    DATA search_example_index;
        description = 'Advanced SAS Analytics';
        pos_analytics = INDEX(description, 'Analytics');
        pos_sas_case_sensitive = INDEX(description, 'sas'); /* Will not find 'sas' */
        pos_sas_case_insensitive = INDEX(UPCASE(description), 'SAS'); /* Use UPCASE for case insensitivity */
        PUT 'INDEX Position (Analytics): ' pos_analytics;
        PUT 'INDEX Position (sas case-sensitive): ' pos_sas_case_sensitive;
        PUT 'INDEX Position (SAS case-insensitive): ' pos_sas_case_insensitive;
    RUN;
    • Output:
      INDEX Position (Analytics): 13
      INDEX Position (sas case-sensitive): 0
      INDEX Position (SAS case-insensitive): 9

3. The CONTAINS Operator

For a simple check to see if a substring exists within a variable (a boolean check), the CONTAINS operator is highly convenient. It returns TRUE (1) if the substring is found, and FALSE (0) if it is not found. It's often used in IF statements or WHERE clauses.

  • Syntax: string CONTAINS substring

  • Behavior: It is case-sensitive. To make it case-insensitive, use UPCASE or LOWCASE.

  • Example:

    DATA product_check;
        product_name = 'SAS Viya Platform';
        IF product_name CONTAINS 'Viya' THEN
            category = 'Cloud Product';
        ELSE
            category = 'Other';
    
        product_name2 = 'SAS Base Software';
        IF UPCASE(product_name2) CONTAINS 'SOFTWARE' THEN /* Case-insensitive check */
            category2 = 'Core Product';
        ELSE
            category2 = 'Other';
    
        PUT 'Category for Viya: ' category;
        PUT 'Category for Base Software: ' category2;
    RUN;
    • Output:
      Category for Viya: Cloud Product
      Category for Base Software: Core Product

4. The PRXMATCH Function (Regular Expressions)

For more complex pattern matching, including wildcards, character classes, and quantifiers, the PRXMATCH function is invaluable. It uses Perl-compatible regular expressions (PCRE). It returns the position of the first match or 0 if no match is found.

  • Syntax: PRXMATCH(regular_expression, string);

    • regular_expression: A regular expression pattern enclosed in forward slashes /pattern/ with optional modifiers (e.g., /pattern/i for case-insensitive).
    • string: The character variable or literal to search within.
  • Example:

    DATA regex_search;
        email_address = '[email protected]';
        phone_number = 'Call 555-123-4567 for support.';
    
        /* Find if it looks like an email address (contains @ and .) */
        is_email = PRXMATCH('/@.+\./', email_address); /* Pattern for '@' followed by chars, then '.' */
    
        /* Find if a 10-digit phone number pattern exists (case-insensitive for 'call') */
        is_phone = PRXMATCH('/call \d{3}-\d{3}-\d{4}/i', phone_number); /* \d{n} for n digits */
    
        PUT 'Is Email: ' is_email;
        PUT 'Is Phone: ' is_phone;
    RUN;
    • Output:
      Is Email: 1
      Is Phone: 1

Comparison of Substring Search Functions

Function Purpose Case Sensitivity Returns Regular Expressions
FIND Position of first occurrence, with modifiers Default: Yes (Can be No with 'i' modifier) Position (0 if not found) No
INDEX Position of first occurrence Yes Position (0 if not found) No
CONTAINS Boolean check for existence Yes 1 (true) or 0 (false) No
PRXMATCH Position of first regex match Default: Yes (Can be No with 'i' modifier) Position (0 if not found) Yes

Practical Considerations

  • Case Sensitivity: Always be mindful of case. Use UPCASE() or LOWCASE() for INDEX and CONTAINS if you need a case-insensitive search. FIND and PRXMATCH offer direct modifiers.
  • Trailing Blanks: Character variables in SAS often have fixed lengths, leading to trailing blanks. If your substring search is failing unexpectedly, consider using the TRIM() function on the variable being searched, or the 't' modifier with FIND.
  • Performance: For simple existence checks, CONTAINS is generally efficient. For positional information or complex patterns, FIND, INDEX, and PRXMATCH are your go-to options. PRXMATCH can be powerful but might have a higher overhead for very simple tasks.
  • Multiple Occurrences: These functions typically find only the first occurrence. To find all occurrences, you would usually combine them with SUBSTR() and loop through the string.

By understanding these functions and operators, you can effectively search for and identify substrings within your SAS variables, enabling robust data manipulation and analysis.