Ora

How to Filter Rows in Pandas Python?

Published in Pandas Data Filtering 8 mins read

Filtering rows in Pandas DataFrames is a fundamental operation that allows you to select specific data based on one or more conditions, making your data analysis more focused and efficient.

Pandas offers several powerful and flexible methods to achieve this, primarily through boolean indexing, label-based indexing (.loc[]), positional indexing (.iloc[]), and the convenient .query() method. Understanding these techniques is crucial for effective data manipulation in Python.

1. Filtering with Boolean Indexing

Boolean indexing is the most common and powerful way to filter rows in Pandas. It works by creating a boolean Series (a sequence of True/False values) where True indicates rows that meet the condition and False indicates those that don't. When this boolean Series is passed to the DataFrame, only rows corresponding to True values are returned.

Let's use a sample DataFrame for our examples:

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Age': [25, 30, 35, 28, 22, 40],
    'City': ['New York', 'London', 'Paris', 'New York', 'London', 'Berlin'],
    'Salary': [70000, 85000, 90000, 72000, 65000, 95000],
    'Department': ['HR', 'IT', 'Marketing', 'HR', 'IT', 'Sales']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Output:

Original DataFrame:
      Name  Age      City  Salary Department
0    Alice   25  New York   70000         HR
1      Bob   30    London   85000         IT
2  Charlie   35     Paris   90000  Marketing
3    David   28  New York   72000         HR
4      Eve   22    London   65000         IT
5    Frank   40    Berlin   95000      Sales

a. Filtering with a Single Condition

You can filter rows based on a single condition applied to a column.

  • Example: Select rows where 'Age' is greater than 30.

    # Create a boolean Series
    condition_age = df['Age'] > 30
    print("\nBoolean Series for Age > 30:")
    print(condition_age)
    
    # Filter the DataFrame
    filtered_df_age = df[condition_age]
    print("\nFiltered by Age > 30:")
    print(filtered_df_age)

    Output:

    Boolean Series for Age > 30:
    0    False
    1    False
    2     True
    3    False
    4    False
    5     True
    Name: Age, dtype: bool
    
    Filtered by Age > 30:
        Name  Age    City  Salary Department
    2  Charlie   35   Paris   90000  Marketing
    5    Frank   40  Berlin   95000      Sales

b. Filtering with Multiple Conditions

You can combine multiple conditions using logical operators:

  • & (AND): Both conditions must be true.
  • | (OR): At least one condition must be true.
  • ~ (NOT): Negates a condition.

Important: Each condition must be enclosed in parentheses () for correct evaluation.

  • Example: Select rows where 'City' is 'New York' AND 'Salary' is greater than 70000.

    filtered_df_and = df[(df['City'] == 'New York') & (df['Salary'] > 70000)]
    print("\nFiltered by City 'New York' AND Salary > 70000:")
    print(filtered_df_and)

    Output:

    Filtered by City 'New York' AND Salary > 70000:
        Name  Age      City  Salary Department
    3  David   28  New York   72000         HR
  • Example: Select rows where 'Department' is 'HR' OR 'IT'.

    filtered_df_or = df[(df['Department'] == 'HR') | (df['Department'] == 'IT')]
    print("\nFiltered by Department 'HR' OR 'IT':")
    print(filtered_df_or)

    Output:

    Filtered by Department 'HR' OR 'IT':
      Name  Age      City  Salary Department
    0  Alice   25  New York   70000         HR
    1    Bob   30    London   85000         IT
    3  David   28  New York   72000         HR
    4    Eve   22    London   65000         IT

c. Filtering using isin() for Multiple Values

The .isin() method is highly efficient for filtering a column based on whether its values are present in a list of desired values.

  • Example: Select rows where 'City' is 'New York' or 'London'.

    filtered_df_isin = df[df['City'].isin(['New York', 'London'])]
    print("\nFiltered by City 'New York' or 'London' using isin():")
    print(filtered_df_isin)

    Output:

    Filtered by City 'New York' or 'London' using isin():
      Name  Age      City  Salary Department
    0  Alice   25  New York   70000         HR
    1    Bob   30    London   85000         IT
    3  David   28  New York   72000         HR
    4    Eve   22    London   65000         IT

    For more on isin(), refer to the Pandas documentation.

d. Filtering Text Data with str Accessor

For string manipulation and filtering, Pandas Series have a .str accessor that provides a variety of string methods.

  • Example: Select rows where 'Name' contains the letter 'a'.

    filtered_df_str = df[df['Name'].str.contains('a', case=False)]
    print("\nFiltered by Name containing 'a' (case-insensitive):")
    print(filtered_df_str)

    Output:

    Filtered by Name containing 'a' (case-insensitive):
        Name  Age      City  Salary Department
    0    Alice   25  New York   70000         HR
    2  Charlie   35     Paris   90000  Marketing
    3    David   28  New York   72000         HR

e. Filtering with isna() and notna() for Missing Data

To filter rows based on the presence or absence of missing (NaN) values, use .isna() or .notna().

  • Example: Create a row with missing data and filter.

    df_missing = df.copy()
    df_missing.loc[2, 'Salary'] = None # Introduce a missing value
    filtered_df_na = df_missing[df_missing['Salary'].isna()]
    print("\nDataFrame with missing value:")
    print(df_missing)
    print("\nRows where Salary is NaN:")
    print(filtered_df_na)

    Output:

    DataFrame with missing value:
        Name  Age      City   Salary Department
    0    Alice   25  New York  70000.0         HR
    1      Bob   30    London  85000.0         IT
    2  Charlie   35     Paris      NaN  Marketing
    3    David   28  New York  72000.0         HR
    4      Eve   22    London  65000.0         IT
    5    Frank   40    Berlin  95000.0      Sales
    
    Rows where Salary is NaN:
        Name  Age   City  Salary Department
    2  Charlie   35  Paris     NaN  Marketing

2. Filtering with .query()

The .query() method provides a more SQL-like syntax for filtering, which can be more readable for complex conditions, especially when dealing with string comparisons or referencing variables.

  • Example: Select rows where 'Age' > 25 AND 'City' is 'London'.

    filtered_df_query = df.query("Age > 25 and City == 'London'")
    print("\nFiltered using .query() (Age > 25 and City == 'London'):")
    print(filtered_df_query)

    Output:

    Filtered using .query() (Age > 25 and City == 'London'):
      Name  Age    City  Salary Department
    1  Bob   30  London   85000         IT

You can also use variables within query() by prefixing them with @.

  • Example: Filter using a variable for age threshold.

    min_age = 30
    filtered_df_query_var = df.query("Age >= @min_age")
    print(f"\nFiltered using .query() with variable (Age >= {min_age}):")
    print(filtered_df_query_var)

    Output:

    Filtered using .query() with variable (Age >= 30):
        Name  Age    City  Salary Department
    1    Bob   30  London   85000         IT
    2  Charlie   35   Paris   90000  Marketing
    5    Frank   40  Berlin   95000      Sales

    Learn more about .query() in the Pandas documentation.

3. Filtering by Row Labels or Position

While boolean indexing is for content-based filtering, you can also select rows based on their explicit labels or integer positions using .loc[] and .iloc[] respectively. This is more about selection than filtering based on data values within the cells.

a. .loc[] (Label-based Indexing)

Use .loc[] when you know the specific index labels of the rows you want to select. It also allows for filtering columns by label.

  • Example: Select rows with index labels 0 and 3.

    filtered_df_loc_labels = df.loc[[0, 3]]
    print("\nSelected rows by label (0 and 3) using .loc[]:")
    print(filtered_df_loc_labels)

    Output:

    Selected rows by label (0 and 3) using .loc[]:
         Name  Age      City  Salary Department
    0   Alice   25  New York   70000         HR
    3   David   28  New York   72000         HR

    loc can also take a boolean Series, effectively combining label-based selection with value-based filtering.

b. .iloc[] (Integer-position based Indexing)

Use .iloc[] when you know the integer positions of the rows (similar to list indexing).

  • Example: Select rows at integer positions 1 and 4.

    filtered_df_iloc_positions = df.iloc[[1, 4]]
    print("\nSelected rows by integer position (1 and 4) using .iloc[]:")
    print(filtered_df_iloc_positions)

    Output:

    Selected rows by integer position (1 and 4) using .iloc[]:
      Name  Age    City  Salary Department
    1  Bob   30  London   85000         IT
    4  Eve   22  London   65000         IT

    For more details, refer to the Pandas .loc and .iloc documentation.

Comparison of .loc[] vs .iloc[]

Feature .loc[] .iloc[]
Index Type Label-based (uses actual row/column names) Integer-position based (0-indexed)
Slicing End Inclusive (e.g., start:end includes end) Exclusive (e.g., start:end excludes end)
Input for Rows Single label, list of labels, boolean array, slice of labels Single integer, list of integers, slice of integers
Input for Cols Single label, list of labels, boolean array, slice of labels Single integer, list of integers, slice of integers

4. Iterating Through Rows (Less Common for Filtering)

While vectorized operations (like boolean indexing and .query()) are almost always preferred for performance, especially on large datasets, there might be specific scenarios where iterating through rows is necessary for complex, row-by-row logic.

Pandas offers methods like iterrows() for this purpose. The iterrows() method generates an iterator object of the DataFrame, allowing you to iterate each row in the DataFrame. Each iteration produces an index object and a row object (a Pandas Series object).

  • Example: Filtering rows based on a complex custom function (demonstrative, not recommended for performance).

    filtered_rows_list = []
    for index, row in df.iterrows():
        # A hypothetical complex condition that's hard to vectorize
        if (row['Age'] * row['Salary']) > 2000000 and 'i' in row['Department'].lower():
            filtered_rows_list.append(row)
    
    filtered_df_iter = pd.DataFrame(filtered_rows_list)
    print("\nFiltered using iterrows() (for demonstration of row access):")
    print(filtered_df_iter)

    Output:

    Filtered using iterrows() (for demonstration of row access):
      Name  Age    City  Salary Department
    1  Bob   30  London   85000         IT

Important Note: Using iterrows() for filtering is generally much slower than vectorized methods for large DataFrames. It should be used judiciously for specific, non-vectorizable tasks where you absolutely need to process each row individually. For most filtering needs, stick to boolean indexing or .query().

Best Practices for Filtering in Pandas

  • Prioritize Vectorized Operations: Always opt for boolean indexing, .query(), .isin(), or .str methods over explicit loops (.iterrows(), .apply(axis=1)) for performance reasons.
  • Use Parentheses for Multiple Conditions: Ensure each condition is enclosed in parentheses when combining with & or |.
  • Readability: For complex filters, consider using .query() for better readability, or break down your boolean conditions into named variables.
  • Chaining Operations: You can chain filtering with other Pandas operations (e.g., df[df['Age'] > 30]['Salary'].mean()).
  • Understand loc vs iloc: Know when to use label-based (.loc) versus integer-position based (.iloc) indexing.

By mastering these techniques, you'll be able to effectively isolate and analyze the specific data you need within your Pandas DataFrames.