Filtering rows in Pandas DataFrames is a fundamental operation that allows you to select specific data based on one or more conditions, making your data analysis more focused and efficient.
Pandas offers several powerful and flexible methods to achieve this, primarily through boolean indexing, label-based indexing (.loc[]
), positional indexing (.iloc[]
), and the convenient .query()
method. Understanding these techniques is crucial for effective data manipulation in Python.
1. Filtering with Boolean Indexing
Boolean indexing is the most common and powerful way to filter rows in Pandas. It works by creating a boolean Series (a sequence of True
/False
values) where True
indicates rows that meet the condition and False
indicates those that don't. When this boolean Series is passed to the DataFrame, only rows corresponding to True
values are returned.
Let's use a sample DataFrame for our examples:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'Age': [25, 30, 35, 28, 22, 40],
'City': ['New York', 'London', 'Paris', 'New York', 'London', 'Berlin'],
'Salary': [70000, 85000, 90000, 72000, 65000, 95000],
'Department': ['HR', 'IT', 'Marketing', 'HR', 'IT', 'Sales']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
Name Age City Salary Department
0 Alice 25 New York 70000 HR
1 Bob 30 London 85000 IT
2 Charlie 35 Paris 90000 Marketing
3 David 28 New York 72000 HR
4 Eve 22 London 65000 IT
5 Frank 40 Berlin 95000 Sales
a. Filtering with a Single Condition
You can filter rows based on a single condition applied to a column.
-
Example: Select rows where 'Age' is greater than 30.
# Create a boolean Series condition_age = df['Age'] > 30 print("\nBoolean Series for Age > 30:") print(condition_age) # Filter the DataFrame filtered_df_age = df[condition_age] print("\nFiltered by Age > 30:") print(filtered_df_age)
Output:
Boolean Series for Age > 30: 0 False 1 False 2 True 3 False 4 False 5 True Name: Age, dtype: bool Filtered by Age > 30: Name Age City Salary Department 2 Charlie 35 Paris 90000 Marketing 5 Frank 40 Berlin 95000 Sales
b. Filtering with Multiple Conditions
You can combine multiple conditions using logical operators:
&
(AND): Both conditions must be true.|
(OR): At least one condition must be true.~
(NOT): Negates a condition.
Important: Each condition must be enclosed in parentheses ()
for correct evaluation.
-
Example: Select rows where 'City' is 'New York' AND 'Salary' is greater than 70000.
filtered_df_and = df[(df['City'] == 'New York') & (df['Salary'] > 70000)] print("\nFiltered by City 'New York' AND Salary > 70000:") print(filtered_df_and)
Output:
Filtered by City 'New York' AND Salary > 70000: Name Age City Salary Department 3 David 28 New York 72000 HR
-
Example: Select rows where 'Department' is 'HR' OR 'IT'.
filtered_df_or = df[(df['Department'] == 'HR') | (df['Department'] == 'IT')] print("\nFiltered by Department 'HR' OR 'IT':") print(filtered_df_or)
Output:
Filtered by Department 'HR' OR 'IT': Name Age City Salary Department 0 Alice 25 New York 70000 HR 1 Bob 30 London 85000 IT 3 David 28 New York 72000 HR 4 Eve 22 London 65000 IT
c. Filtering using isin()
for Multiple Values
The .isin()
method is highly efficient for filtering a column based on whether its values are present in a list of desired values.
-
Example: Select rows where 'City' is 'New York' or 'London'.
filtered_df_isin = df[df['City'].isin(['New York', 'London'])] print("\nFiltered by City 'New York' or 'London' using isin():") print(filtered_df_isin)
Output:
Filtered by City 'New York' or 'London' using isin(): Name Age City Salary Department 0 Alice 25 New York 70000 HR 1 Bob 30 London 85000 IT 3 David 28 New York 72000 HR 4 Eve 22 London 65000 IT
For more on
isin()
, refer to the Pandas documentation.
d. Filtering Text Data with str
Accessor
For string manipulation and filtering, Pandas Series have a .str
accessor that provides a variety of string methods.
-
Example: Select rows where 'Name' contains the letter 'a'.
filtered_df_str = df[df['Name'].str.contains('a', case=False)] print("\nFiltered by Name containing 'a' (case-insensitive):") print(filtered_df_str)
Output:
Filtered by Name containing 'a' (case-insensitive): Name Age City Salary Department 0 Alice 25 New York 70000 HR 2 Charlie 35 Paris 90000 Marketing 3 David 28 New York 72000 HR
e. Filtering with isna()
and notna()
for Missing Data
To filter rows based on the presence or absence of missing (NaN) values, use .isna()
or .notna()
.
-
Example: Create a row with missing data and filter.
df_missing = df.copy() df_missing.loc[2, 'Salary'] = None # Introduce a missing value filtered_df_na = df_missing[df_missing['Salary'].isna()] print("\nDataFrame with missing value:") print(df_missing) print("\nRows where Salary is NaN:") print(filtered_df_na)
Output:
DataFrame with missing value: Name Age City Salary Department 0 Alice 25 New York 70000.0 HR 1 Bob 30 London 85000.0 IT 2 Charlie 35 Paris NaN Marketing 3 David 28 New York 72000.0 HR 4 Eve 22 London 65000.0 IT 5 Frank 40 Berlin 95000.0 Sales Rows where Salary is NaN: Name Age City Salary Department 2 Charlie 35 Paris NaN Marketing
2. Filtering with .query()
The .query()
method provides a more SQL-like syntax for filtering, which can be more readable for complex conditions, especially when dealing with string comparisons or referencing variables.
-
Example: Select rows where 'Age' > 25 AND 'City' is 'London'.
filtered_df_query = df.query("Age > 25 and City == 'London'") print("\nFiltered using .query() (Age > 25 and City == 'London'):") print(filtered_df_query)
Output:
Filtered using .query() (Age > 25 and City == 'London'): Name Age City Salary Department 1 Bob 30 London 85000 IT
You can also use variables within query()
by prefixing them with @
.
-
Example: Filter using a variable for age threshold.
min_age = 30 filtered_df_query_var = df.query("Age >= @min_age") print(f"\nFiltered using .query() with variable (Age >= {min_age}):") print(filtered_df_query_var)
Output:
Filtered using .query() with variable (Age >= 30): Name Age City Salary Department 1 Bob 30 London 85000 IT 2 Charlie 35 Paris 90000 Marketing 5 Frank 40 Berlin 95000 Sales
Learn more about
.query()
in the Pandas documentation.
3. Filtering by Row Labels or Position
While boolean indexing is for content-based filtering, you can also select rows based on their explicit labels or integer positions using .loc[]
and .iloc[]
respectively. This is more about selection than filtering based on data values within the cells.
a. .loc[]
(Label-based Indexing)
Use .loc[]
when you know the specific index labels of the rows you want to select. It also allows for filtering columns by label.
-
Example: Select rows with index labels 0 and 3.
filtered_df_loc_labels = df.loc[[0, 3]] print("\nSelected rows by label (0 and 3) using .loc[]:") print(filtered_df_loc_labels)
Output:
Selected rows by label (0 and 3) using .loc[]: Name Age City Salary Department 0 Alice 25 New York 70000 HR 3 David 28 New York 72000 HR
loc
can also take a boolean Series, effectively combining label-based selection with value-based filtering.
b. .iloc[]
(Integer-position based Indexing)
Use .iloc[]
when you know the integer positions of the rows (similar to list indexing).
-
Example: Select rows at integer positions 1 and 4.
filtered_df_iloc_positions = df.iloc[[1, 4]] print("\nSelected rows by integer position (1 and 4) using .iloc[]:") print(filtered_df_iloc_positions)
Output:
Selected rows by integer position (1 and 4) using .iloc[]: Name Age City Salary Department 1 Bob 30 London 85000 IT 4 Eve 22 London 65000 IT
For more details, refer to the Pandas
.loc
and.iloc
documentation.
Comparison of .loc[]
vs .iloc[]
Feature | .loc[] |
.iloc[] |
---|---|---|
Index Type | Label-based (uses actual row/column names) | Integer-position based (0-indexed) |
Slicing End | Inclusive (e.g., start:end includes end ) |
Exclusive (e.g., start:end excludes end ) |
Input for Rows | Single label, list of labels, boolean array, slice of labels | Single integer, list of integers, slice of integers |
Input for Cols | Single label, list of labels, boolean array, slice of labels | Single integer, list of integers, slice of integers |
4. Iterating Through Rows (Less Common for Filtering)
While vectorized operations (like boolean indexing and .query()
) are almost always preferred for performance, especially on large datasets, there might be specific scenarios where iterating through rows is necessary for complex, row-by-row logic.
Pandas offers methods like iterrows()
for this purpose. The iterrows()
method generates an iterator object of the DataFrame, allowing you to iterate each row in the DataFrame. Each iteration produces an index object and a row object (a Pandas Series object).
-
Example: Filtering rows based on a complex custom function (demonstrative, not recommended for performance).
filtered_rows_list = [] for index, row in df.iterrows(): # A hypothetical complex condition that's hard to vectorize if (row['Age'] * row['Salary']) > 2000000 and 'i' in row['Department'].lower(): filtered_rows_list.append(row) filtered_df_iter = pd.DataFrame(filtered_rows_list) print("\nFiltered using iterrows() (for demonstration of row access):") print(filtered_df_iter)
Output:
Filtered using iterrows() (for demonstration of row access): Name Age City Salary Department 1 Bob 30 London 85000 IT
Important Note: Using iterrows()
for filtering is generally much slower than vectorized methods for large DataFrames. It should be used judiciously for specific, non-vectorizable tasks where you absolutely need to process each row individually. For most filtering needs, stick to boolean indexing or .query()
.
Best Practices for Filtering in Pandas
- Prioritize Vectorized Operations: Always opt for boolean indexing,
.query()
,.isin()
, or.str
methods over explicit loops (.iterrows()
,.apply(axis=1)
) for performance reasons. - Use Parentheses for Multiple Conditions: Ensure each condition is enclosed in parentheses when combining with
&
or|
. - Readability: For complex filters, consider using
.query()
for better readability, or break down your boolean conditions into named variables. - Chaining Operations: You can chain filtering with other Pandas operations (e.g.,
df[df['Age'] > 30]['Salary'].mean()
). - Understand
loc
vsiloc
: Know when to use label-based (.loc
) versus integer-position based (.iloc
) indexing.
By mastering these techniques, you'll be able to effectively isolate and analyze the specific data you need within your Pandas DataFrames.