Ora

How do you filter unique rows in pandas?

Published in Pandas Data Filtering 5 mins read

To filter unique rows in a pandas DataFrame, the most direct and widely used method is drop_duplicates(). This function efficiently identifies and removes duplicate rows, returning a new DataFrame containing only the unique entries.

Using the drop_duplicates() Method

The drop_duplicates() method is a quick and easy way to get all unique rows in a DataFrame. It effectively removes all duplicate rows, based on either a specific subset of columns or all columns, and returns a new DataFrame with only the unique rows.

Basic Usage

By default, drop_duplicates() considers all columns in the DataFrame to determine if a row is a duplicate. It keeps the first occurrence of any duplicate set and discards the rest.

import pandas as pd

# Sample DataFrame with duplicate rows
data = {
    'product_id': [101, 102, 101, 103, 102, 104],
    'category': ['Electronics', 'Books', 'Electronics', 'Home', 'Books', 'Clothing'],
    'price': [500, 25, 500, 120, 25, 75]
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Filter unique rows considering all columns
unique_df = df.drop_duplicates()

print("\nDataFrame with unique rows (all columns considered):")
print(unique_df)

In this example, the row [101, 'Electronics', 500] appears twice, and [102, 'Books', 25] also appears twice. The drop_duplicates() method, by default, keeps the first instance of each.

Filtering Unique Rows Based on Specific Columns (subset)

Often, you might define uniqueness based on only a subset of your DataFrame's columns. For instance, you might want to ensure each product_id is unique, even if other column values differ. The subset parameter allows you to specify which columns to consider when identifying duplicates.

# Filter unique rows based only on 'product_id'
unique_by_id_df = df.drop_duplicates(subset=['product_id'])

print("\nDataFrame with unique rows based on 'product_id':")
print(unique_by_id_df)

Here, product_id=101 and product_id=102 each have multiple entries. When subset=['product_id'] is used, only the first row for each unique product_id is retained.

Controlling Which Duplicate to Keep (keep)

The keep parameter provides control over which occurrence of a duplicate set is retained:

  • 'first' (default): Keeps the first observed row of a duplicate set and drops subsequent duplicates.
  • 'last': Keeps the last observed row of a duplicate set and drops all previous duplicates.
  • False: Drops all occurrences of rows that are duplicated. This means only truly unique rows (those that appear exactly once in the DataFrame) will be kept.
# Keep the last occurrence of duplicates (based on all columns)
unique_last_df = df.drop_duplicates(keep='last')
print("\nDataFrame with unique rows (keeping last occurrence):")
print(unique_last_df)

# Keep only rows that are not duplicated at all (appear exactly once)
truly_unique_df = df.drop_duplicates(keep=False)
print("\nDataFrame with only truly unique rows (appearing once):")
print(truly_unique_df)

Modifying the DataFrame In-Place (inplace)

By default, drop_duplicates() returns a new DataFrame with the unique rows, leaving the original DataFrame unchanged. If you want to modify the original DataFrame directly, you can set the inplace parameter to True.

# Create a new DataFrame for an inplace demonstration
df_inplace_example = pd.DataFrame(data)
print("\nOriginal DataFrame for inplace operation:")
print(df_inplace_example)

# Apply drop_duplicates in-place
df_inplace_example.drop_duplicates(inplace=True)
print("\nDataFrame after inplace drop_duplicates:")
print(df_inplace_example)

For more detailed information, refer to the official pandas documentation for drop_duplicates().

Summary of drop_duplicates() Parameters

Parameter Description Default
subset Column label or sequence of labels. Only consider these columns for identifying duplicates. If None, all columns are used. None
keep Determines which duplicates to mark/drop. 'first' keeps the first, 'last' keeps the last, and False drops all occurrences of duplicates (keeping only truly unique rows). 'first'
inplace If True, modifies the DataFrame in place and returns None. False
ignore_index If True, the resulting axis will be labeled 0, 1, ..., n-1. This is useful if you don't care about the original index labels of the kept rows. False

Alternative: Using Boolean Indexing with duplicated()

While drop_duplicates() is generally the go-to method, understanding duplicated() can offer more granular control. The duplicated() method returns a boolean Series indicating whether each row is a duplicate of a previous row (or a later row, depending on keep).

# Create a fresh DataFrame for this example
df_bool = pd.DataFrame(data)

# Identify duplicated rows (True for duplicates, False for unique/first occurrence)
duplicates_series = df_bool.duplicated()
print("\nBoolean Series indicating duplicated rows:")
print(duplicates_series)

# Filter unique rows by selecting non-duplicated rows
unique_rows_via_bool = df_bool[~df_bool.duplicated()]
print("\nDataFrame with unique rows using boolean indexing (keeps first):")
print(unique_rows_via_bool)

# To keep last occurrences, you can pass keep='last' to duplicated()
unique_rows_last_via_bool = df_bool[~df_bool.duplicated(keep='last')]
print("\nDataFrame with unique rows using boolean indexing (keeps last):")
print(unique_rows_last_via_bool)

For more information, refer to the official pandas documentation for duplicated().