To filter unique rows in a pandas DataFrame, the most direct and widely used method is drop_duplicates()
. This function efficiently identifies and removes duplicate rows, returning a new DataFrame containing only the unique entries.
Using the drop_duplicates()
Method
The drop_duplicates()
method is a quick and easy way to get all unique rows in a DataFrame. It effectively removes all duplicate rows, based on either a specific subset of columns or all columns, and returns a new DataFrame with only the unique rows.
Basic Usage
By default, drop_duplicates()
considers all columns in the DataFrame to determine if a row is a duplicate. It keeps the first occurrence of any duplicate set and discards the rest.
import pandas as pd
# Sample DataFrame with duplicate rows
data = {
'product_id': [101, 102, 101, 103, 102, 104],
'category': ['Electronics', 'Books', 'Electronics', 'Home', 'Books', 'Clothing'],
'price': [500, 25, 500, 120, 25, 75]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Filter unique rows considering all columns
unique_df = df.drop_duplicates()
print("\nDataFrame with unique rows (all columns considered):")
print(unique_df)
In this example, the row [101, 'Electronics', 500]
appears twice, and [102, 'Books', 25]
also appears twice. The drop_duplicates()
method, by default, keeps the first instance of each.
Filtering Unique Rows Based on Specific Columns (subset
)
Often, you might define uniqueness based on only a subset of your DataFrame's columns. For instance, you might want to ensure each product_id
is unique, even if other column values differ. The subset
parameter allows you to specify which columns to consider when identifying duplicates.
# Filter unique rows based only on 'product_id'
unique_by_id_df = df.drop_duplicates(subset=['product_id'])
print("\nDataFrame with unique rows based on 'product_id':")
print(unique_by_id_df)
Here, product_id=101
and product_id=102
each have multiple entries. When subset=['product_id']
is used, only the first row for each unique product_id
is retained.
Controlling Which Duplicate to Keep (keep
)
The keep
parameter provides control over which occurrence of a duplicate set is retained:
'first'
(default): Keeps the first observed row of a duplicate set and drops subsequent duplicates.'last'
: Keeps the last observed row of a duplicate set and drops all previous duplicates.False
: Drops all occurrences of rows that are duplicated. This means only truly unique rows (those that appear exactly once in the DataFrame) will be kept.
# Keep the last occurrence of duplicates (based on all columns)
unique_last_df = df.drop_duplicates(keep='last')
print("\nDataFrame with unique rows (keeping last occurrence):")
print(unique_last_df)
# Keep only rows that are not duplicated at all (appear exactly once)
truly_unique_df = df.drop_duplicates(keep=False)
print("\nDataFrame with only truly unique rows (appearing once):")
print(truly_unique_df)
Modifying the DataFrame In-Place (inplace
)
By default, drop_duplicates()
returns a new DataFrame with the unique rows, leaving the original DataFrame unchanged. If you want to modify the original DataFrame directly, you can set the inplace
parameter to True
.
# Create a new DataFrame for an inplace demonstration
df_inplace_example = pd.DataFrame(data)
print("\nOriginal DataFrame for inplace operation:")
print(df_inplace_example)
# Apply drop_duplicates in-place
df_inplace_example.drop_duplicates(inplace=True)
print("\nDataFrame after inplace drop_duplicates:")
print(df_inplace_example)
For more detailed information, refer to the official pandas documentation for drop_duplicates()
.
Summary of drop_duplicates()
Parameters
Parameter | Description | Default |
---|---|---|
subset |
Column label or sequence of labels. Only consider these columns for identifying duplicates. If None , all columns are used. |
None |
keep |
Determines which duplicates to mark/drop. 'first' keeps the first, 'last' keeps the last, and False drops all occurrences of duplicates (keeping only truly unique rows). |
'first' |
inplace |
If True , modifies the DataFrame in place and returns None . |
False |
ignore_index |
If True , the resulting axis will be labeled 0, 1, ..., n-1. This is useful if you don't care about the original index labels of the kept rows. |
False |
Alternative: Using Boolean Indexing with duplicated()
While drop_duplicates()
is generally the go-to method, understanding duplicated()
can offer more granular control. The duplicated()
method returns a boolean Series indicating whether each row is a duplicate of a previous row (or a later row, depending on keep
).
# Create a fresh DataFrame for this example
df_bool = pd.DataFrame(data)
# Identify duplicated rows (True for duplicates, False for unique/first occurrence)
duplicates_series = df_bool.duplicated()
print("\nBoolean Series indicating duplicated rows:")
print(duplicates_series)
# Filter unique rows by selecting non-duplicated rows
unique_rows_via_bool = df_bool[~df_bool.duplicated()]
print("\nDataFrame with unique rows using boolean indexing (keeps first):")
print(unique_rows_via_bool)
# To keep last occurrences, you can pass keep='last' to duplicated()
unique_rows_last_via_bool = df_bool[~df_bool.duplicated(keep='last')]
print("\nDataFrame with unique rows using boolean indexing (keeps last):")
print(unique_rows_last_via_bool)
For more information, refer to the official pandas documentation for duplicated()
.