Code Snippets for Efficient Data Processing with Pandas

I'm working on a data analysis project and need to optimize my Pandas code for efficiency. Can you provide some code snippets for common data processing tasks that are known to be efficient?

1 Answers

โœ“ Best Answer
Here are several code snippets demonstrating efficient data processing techniques using Pandas in Python. These examples cover common tasks such as data cleaning, transformation, and analysis.

๐Ÿงน Data Cleaning

  • Removing Duplicate Rows:

import pandas as pd

df = pd.DataFrame({'col1': ['A', 'A', 'B', 'C', 'B'],
                   'col2': [1, 1, 2, 3, 2]})

df_unique = df.drop_duplicates()
print(df_unique)
  • Handling Missing Values (NaN):

import pandas as pd
import numpy as np

df = pd.DataFrame({'col1': [1, 2, np.nan, 4],
                   'col2': [5, np.nan, 7, 8]})

# Fill NaN values with a specific value (e.g., 0)
df_filled = df.fillna(0)
print(df_filled)

# Drop rows with NaN values
df_dropna = df.dropna()
print(df_dropna)

๐Ÿ“Š Data Transformation

  • Applying Functions to Columns:

import pandas as pd

df = pd.DataFrame({'col1': [1, 2, 3, 4]})

# Using apply with a lambda function
df['col2'] = df['col1'].apply(lambda x: x * 2)
print(df)

# Using a defined function
def square(x):
    return x ** 2

df['col3'] = df['col1'].apply(square)
print(df)
  • Vectorized String Operations:

import pandas as pd

df = pd.DataFrame({'name': ['Alice', 'Bob', 'Charlie']})

# Convert names to uppercase
df['name_upper'] = df['name'].str.upper()
print(df)

๐Ÿ“ˆ Data Analysis

  • Grouping and Aggregating Data:

import pandas as pd

df = pd.DataFrame({'category': ['A', 'A', 'B', 'B', 'A'],
                   'value': [10, 12, 15, 18, 20]})

# Group by 'category' and calculate the sum of 'value'
df_grouped = df.groupby('category')['value'].sum()
print(df_grouped)

# Multiple aggregations
df_agg = df.groupby('category')['value'].agg(['sum', 'mean', 'count'])
print(df_agg)
  • Filtering Data:

import pandas as pd

df = pd.DataFrame({'col1': [1, 2, 3, 4, 5],
                   'col2': [6, 7, 8, 9, 10]})

# Filter rows where 'col1' is greater than 2
df_filtered = df[df['col1'] > 2]
print(df_filtered)

# Multiple conditions
df_filtered_multiple = df[(df['col1'] > 2) & (df['col2'] < 10)]
print(df_filtered_multiple)
These snippets should help you process data more efficiently in Pandas. Remember to choose the right method based on your specific data and task for optimal performance. ๐Ÿš€

Know the answer? Login to help.