WikiGalaxy

Personalize

Dropping Duplicates in Pandas

Introduction

Handling duplicate data is a common task in data processing and cleaning. Pandas provides a convenient method called drop_duplicates() to remove duplicate rows from a DataFrame. This method can be customized to drop duplicates based on specific columns, keep particular occurrences, or drop all duplicates.

Example 1: Basic Usage

Basic Usage

The simplest way to use drop_duplicates() is to call it without any parameters, which will drop all duplicate rows, keeping the first occurrence.


import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'Charlie'], 'Age': [25, 30, 25, 35]}
df = pd.DataFrame(data)

# Drop duplicates
df_unique = df.drop_duplicates()

print(df_unique)

Explanation

In this example, the duplicate row for 'Alice' is removed, keeping only the first occurrence. The drop_duplicates() method by default keeps the first occurrence of each duplicate row.

Console Output:

Name Age 0 Alice 25 1 Bob 30 3 Charlie 35

Example 2: Dropping Duplicates Based on Specific Columns

Dropping Duplicates Based on Specific Columns

You can drop duplicates based on specific columns by passing the column names to the subset parameter.


import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'Charlie'], 'Age': [25, 30, 26, 35]}
df = pd.DataFrame(data)

# Drop duplicates based on 'Name'
df_unique_name = df.drop_duplicates(subset=['Name'])

print(df_unique_name)

Explanation

In this example, duplicates are removed based on the 'Name' column. Even though the ages differ, the second occurrence of 'Alice' is considered a duplicate and is removed.

Console Output:

Name Age 0 Alice 25 1 Bob 30 3 Charlie 35

Example 3: Keeping the Last Occurrence

Keeping the Last Occurrence

You can specify whether to keep the first or last occurrence of each duplicate row using the keep parameter.


import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'Charlie'], 'Age': [25, 30, 25, 35]}
df = pd.DataFrame(data)

# Drop duplicates, keep the last occurrence
df_keep_last = df.drop_duplicates(keep='last')

print(df_keep_last)

Explanation

In this example, the method keeps the last occurrence of each duplicate row. The first occurrence of 'Alice' is removed, keeping the second one.

Console Output:

Name Age 1 Bob 30 2 Alice 25 3 Charlie 35

Example 4: Dropping All Duplicates

Dropping All Duplicates

To drop all duplicates and not keep any occurrence, set the keep parameter to False.


import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'Charlie'], 'Age': [25, 30, 25, 35]}
df = pd.DataFrame(data)

# Drop all duplicates
df_no_duplicates = df.drop_duplicates(keep=False)

print(df_no_duplicates)

Explanation

In this case, all duplicate occurrences of 'Alice' are removed, leaving only unique rows in the DataFrame.

Console Output:

Name Age 1 Bob 30 3 Charlie 35

Example 5: Resetting Index After Dropping Duplicates

Resetting Index After Dropping Duplicates

After dropping duplicates, the index might not be continuous. You can reset the index using the reset_index() method.


import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'Charlie'], 'Age': [25, 30, 25, 35]}
df = pd.DataFrame(data)

# Drop duplicates and reset index
df_unique_reset = df.drop_duplicates().reset_index(drop=True)

print(df_unique_reset)

Explanation

This example demonstrates how to reset the index after dropping duplicates, resulting in a DataFrame with a continuous index.

Console Output:

Name Age 0 Alice 25 1 Bob 30 2 Charlie 35