How do you filter unique rows in pandas?

To filter unique rows in a pandas DataFrame, the most direct and widely used method is drop_duplicates(). This function efficiently identifies and removes duplicate rows, returning a new DataFrame containing only the unique entries.

Using the `drop_duplicates()` Method

The drop_duplicates() method is a quick and easy way to get all unique rows in a DataFrame. It effectively removes all duplicate rows, based on either a specific subset of columns or all columns, and returns a new DataFrame with only the unique rows.

Basic Usage

By default, drop_duplicates() considers all columns in the DataFrame to determine if a row is a duplicate. It keeps the first occurrence of any duplicate set and discards the rest.

import pandas as pd

# Sample DataFrame with duplicate rows
data = {
    'product_id': [101, 102, 101, 103, 102, 104],
    'category': ['Electronics', 'Books', 'Electronics', 'Home', 'Books', 'Clothing'],
    'price': [500, 25, 500, 120, 25, 75]
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Filter unique rows considering all columns
unique_df = df.drop_duplicates()

print("\nDataFrame with unique rows (all columns considered):")
print(unique_df)

In this example, the row [101, 'Electronics', 500] appears twice, and [102, 'Books', 25] also appears twice. The drop_duplicates() method, by default, keeps the first instance of each.

Filtering Unique Rows Based on Specific Columns (`subset`)

Often, you might define uniqueness based on only a subset of your DataFrame's columns. For instance, you might want to ensure each product_id is unique, even if other column values differ. The subset parameter allows you to specify which columns to consider when identifying duplicates.

# Filter unique rows based only on 'product_id'
unique_by_id_df = df.drop_duplicates(subset=['product_id'])

print("\nDataFrame with unique rows based on 'product_id':")
print(unique_by_id_df)

Here, product_id=101 and product_id=102 each have multiple entries. When subset=['product_id'] is used, only the first row for each unique product_id is retained.

Controlling Which Duplicate to Keep (`keep`)

The keep parameter provides control over which occurrence of a duplicate set is retained:

'first' (default): Keeps the first observed row of a duplicate set and drops subsequent duplicates.
'last': Keeps the last observed row of a duplicate set and drops all previous duplicates.
False: Drops all occurrences of rows that are duplicated. This means only truly unique rows (those that appear exactly once in the DataFrame) will be kept.

# Keep the last occurrence of duplicates (based on all columns)
unique_last_df = df.drop_duplicates(keep='last')
print("\nDataFrame with unique rows (keeping last occurrence):")
print(unique_last_df)

# Keep only rows that are not duplicated at all (appear exactly once)
truly_unique_df = df.drop_duplicates(keep=False)
print("\nDataFrame with only truly unique rows (appearing once):")
print(truly_unique_df)

Modifying the DataFrame In-Place (`inplace`)

By default, drop_duplicates() returns a new DataFrame with the unique rows, leaving the original DataFrame unchanged. If you want to modify the original DataFrame directly, you can set the inplace parameter to True.

# Create a new DataFrame for an inplace demonstration
df_inplace_example = pd.DataFrame(data)
print("\nOriginal DataFrame for inplace operation:")
print(df_inplace_example)

# Apply drop_duplicates in-place
df_inplace_example.drop_duplicates(inplace=True)
print("\nDataFrame after inplace drop_duplicates:")
print(df_inplace_example)

For more detailed information, refer to the official pandas documentation for drop_duplicates().

Summary of `drop_duplicates()` Parameters

Parameter	Description	Default
`subset`	Column label or sequence of labels. Only consider these columns for identifying duplicates. If `None`, all columns are used.	`None`
`keep`	Determines which duplicates to mark/drop. `'first'` keeps the first, `'last'` keeps the last, and `False` drops all occurrences of duplicates (keeping only truly unique rows).	`'first'`
`inplace`	If `True`, modifies the DataFrame in place and returns `None`.	`False`
`ignore_index`	If `True`, the resulting axis will be labeled 0, 1, ..., n-1. This is useful if you don't care about the original index labels of the kept rows.	`False`

Alternative: Using Boolean Indexing with `duplicated()`

While drop_duplicates() is generally the go-to method, understanding duplicated() can offer more granular control. The duplicated() method returns a boolean Series indicating whether each row is a duplicate of a previous row (or a later row, depending on keep).

# Create a fresh DataFrame for this example
df_bool = pd.DataFrame(data)

# Identify duplicated rows (True for duplicates, False for unique/first occurrence)
duplicates_series = df_bool.duplicated()
print("\nBoolean Series indicating duplicated rows:")
print(duplicates_series)

# Filter unique rows by selecting non-duplicated rows
unique_rows_via_bool = df_bool[~df_bool.duplicated()]
print("\nDataFrame with unique rows using boolean indexing (keeps first):")
print(unique_rows_via_bool)

# To keep last occurrences, you can pass keep='last' to duplicated()
unique_rows_last_via_bool = df_bool[~df_bool.duplicated(keep='last')]
print("\nDataFrame with unique rows using boolean indexing (keeps last):")
print(unique_rows_last_via_bool)

For more information, refer to the official pandas documentation for duplicated().

How do you filter unique rows in pandas?

Using the drop_duplicates() Method

Basic Usage

Filtering Unique Rows Based on Specific Columns (subset)

Controlling Which Duplicate to Keep (keep)

Modifying the DataFrame In-Place (inplace)

Summary of drop_duplicates() Parameters

Alternative: Using Boolean Indexing with duplicated()