Code Snippets for Efficient Data Processing with Pandas

Question

I'm working on a data analysis project and need to optimize my Pandas code for efficiency. Can you provide some code snippets for common data processing tasks that are known to be efficient?

Jesus.Coleman · Accepted Answer

Here are several code snippets demonstrating efficient data processing techniques using Pandas in Python. These examples cover common tasks such as data cleaning, transformation, and analysis.

🧹 Data Cleaning

Removing Duplicate Rows:

import pandas as pd

df = pd.DataFrame({'col1': ['A', 'A', 'B', 'C', 'B'],
                   'col2': [1, 1, 2, 3, 2]})

df_unique = df.drop_duplicates()
print(df_unique)

Handling Missing Values (NaN):

import pandas as pd
import numpy as np

df = pd.DataFrame({'col1': [1, 2, np.nan, 4],
                   'col2': [5, np.nan, 7, 8]})

# Fill NaN values with a specific value (e.g., 0)
df_filled = df.fillna(0)
print(df_filled)

# Drop rows with NaN values
df_dropna = df.dropna()
print(df_dropna)

📊 Data Transformation

Applying Functions to Columns:

import pandas as pd

df = pd.DataFrame({'col1': [1, 2, 3, 4]})

# Using apply with a lambda function
df['col2'] = df['col1'].apply(lambda x: x * 2)
print(df)

# Using a defined function
def square(x):
    return x ** 2

df['col3'] = df['col1'].apply(square)
print(df)

Vectorized String Operations:

import pandas as pd

df = pd.DataFrame({'name': ['Alice', 'Bob', 'Charlie']})

# Convert names to uppercase
df['name_upper'] = df['name'].str.upper()
print(df)

📈 Data Analysis

Grouping and Aggregating Data:

import pandas as pd

df = pd.DataFrame({'category': ['A', 'A', 'B', 'B', 'A'],
                   'value': [10, 12, 15, 18, 20]})

# Group by 'category' and calculate the sum of 'value'
df_grouped = df.groupby('category')['value'].sum()
print(df_grouped)

# Multiple aggregations
df_agg = df.groupby('category')['value'].agg(['sum', 'mean', 'count'])
print(df_agg)

Filtering Data:

import pandas as pd

df = pd.DataFrame({'col1': [1, 2, 3, 4, 5],
                   'col2': [6, 7, 8, 9, 10]})

# Filter rows where 'col1' is greater than 2
df_filtered = df[df['col1'] > 2]
print(df_filtered)

# Multiple conditions
df_filtered_multiple = df[(df['col1'] > 2) & (df['col2'] < 10)]
print(df_filtered_multiple)

These snippets should help you process data more efficiently in Pandas. Remember to choose the right method based on your specific data and task for optimal performance. 🚀

Code Snippets for Efficient Data Processing with Pandas

1 Answers

🧹 Data Cleaning

📊 Data Transformation

📈 Data Analysis