Pandas Tricks for Data Scientists

When I was starting, I wasted countless hours writing verbose, inefficient code, only to discover elegant, one-line solutions. Now, after years of wrangling data in the real world, I’ve collected a handful of powerful Pandas tricks that not only save time but also make your code cleaner, faster, and more professional. In this article, I’ll walk you through my game-changing Pandas tricks that will instantly upgrade your data science workflow as a Data Scientist.

Pandas Tricks for Data Scientists

This isn’t just about syntax; it’s about a new way of thinking. Below are my game-changing Pandas tricks that will instantly upgrade your data science workflow as a Data Scientist.

The Power of .pipe() for Chained Functions

Have you ever had a long chain of operations where each step depends on the previous one? It often looks like this:

# the messy way
df = some_function(df)
df = another_function(df)
df = yet_another_function(df)
df = final_function(df)

This code is ugly and hard to read.

.pipe() allows you to chain custom functions together in a clean, readable way, just like you would with built-in Pandas methods. Instead of passing the DataFrame in and out of each function, .pipe() handles it for you. This is perfect for when you have a sequence of pre-processing steps.

Data Scientists use it for Readability. In a team setting, someone else (or your future self) needs to understand your code. Chaining with .pipe() makes the data flow explicit and easy to follow. Here’s how you do it:

import pandas as pd

# dummy data
data = {
    "id": [1, 2, 2, 3, 4, 4],
    "old_name": ["Alice", "Bob", "Bob", "Charlie", "David", "David"],
    "column": ["10", "20", "20", "30", "40", "40"] 
}

df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)


# your functions
def clean_names(df):
    return df.rename(columns={'old_name': 'new_name'})

def drop_duplicates(df):
    return df.drop_duplicates(subset=['id'])

def convert_types(df):
    return df.astype({'column': 'int'})


# processing
processed_df = (
    df
    .pipe(clean_names)
    .pipe(drop_duplicates)
    .pipe(convert_types)
)

print("\nProcessed DataFrame:")
print(processed_df)
Original DataFrame:
id old_name column
0 1 Alice 10
1 2 Bob 20
2 2 Bob 20
3 3 Charlie 30
4 4 David 40
5 4 David 40

Processed DataFrame:
id new_name column
0 1 Alice 10
1 2 Bob 20
3 3 Charlie 30
4 4 David 40

This looks so much better! The DataFrame df is automatically passed as the first argument to each function, creating a beautiful, logical pipeline.

Using .assign() to Add New Columns

Many beginners add new columns to a DataFrame using the standard bracket notation:

df['new_column'] = df['existing_column'] * 2

This works, but it breaks the flow of a method chain. If you’re trying to build a clean pipeline of operations, this approach forces you to do it on a separate line.

df.assign() is a cleaner alternative. It’s a method that returns a new DataFrame with the new columns added, allowing you to seamlessly integrate column creation into a single, fluid chain of operations.

Data Scientists use it for one-line transformations. .assign() is a game-changer for building clean, one-line transformations. It helps you avoid creating intermediate variables and keeps your code tidy. Here’s a practical example:

# dummy data
data = {
    "category": ["Electronics", "Clothing", "Electronics", "Furniture", "Clothing", "Furniture", "Toys"],
    "sales": [500, 200, 800, 1200, 600, 300, 150]
}

df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# your processing pipeline
processed_df = (
    df
    .groupby('category')
    .agg(total_sales=('sales', 'sum'))
    .reset_index()
    .assign(
        profit_margin=lambda x: x['total_sales'] * 0.15,
        is_high_profit=lambda x: x['profit_margin'] > 1000
    )
)

print("\nProcessed DataFrame:")
print(processed_df)
Original DataFrame:
category sales
0 Electronics 500
1 Clothing 200
2 Electronics 800
3 Furniture 1200
4 Clothing 600
5 Furniture 300
6 Toys 150

Processed DataFrame:
category total_sales profit_margin is_high_profit
0 Clothing 800 120.0 False
1 Electronics 1300 195.0 False
2 Furniture 1500 225.0 False
3 Toys 150 22.5 False

Notice the lambda function. It lets you create new columns based on other columns you just created within the same assign block. This is incredibly powerful and efficient.

Vectorization Over Iteration (The apply() Trap)

I can’t stress this one enough. When you need to operate on every row of a DataFrame, your first instinct might be to use .apply() or a loop.

Example of what to avoid:

# The Slow, Painful Way
def calculate_grade(row):
    if row['score'] > 90:
        return 'A'
    elif row['score'] > 80:
        return 'B'
    # ... and so on

df['grade'] = df.apply(calculate_grade, axis=1)

.apply() is a fancy loop. It’s slow and should be avoided for large datasets.

So, use vectorization. Whenever possible, use built-in Pandas or NumPy operations, which are optimized and run in compiled C code under the hood. For conditional logic, np.where() is your best friend. Here’s the fast and efficient way:

import numpy as np
# dummy data with scores across different ranges
data = {
    "student_id": [1, 2, 3, 4, 5, 6, 7],
    "name": ["Alice", "Bob", "Charlie", "David", "Eve", "Frank", "Grace"],
    "score": [95, 88, 73, 65, 82, 91, 70]  # covers A, B, C, and F cases
}

df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# apply grading logic
df['grade'] = np.where(df['score'] > 90, 'A',
                       np.where(df['score'] > 80, 'B',
                                np.where(df['score'] > 70, 'C', 'F')))

print("\nDataFrame with Grades:")
print(df)
Original DataFrame:
student_id name score
0 1 Alice 95
1 2 Bob 88
2 3 Charlie 73
3 4 David 65
4 5 Eve 82
5 6 Frank 91
6 7 Grace 70

DataFrame with Grades:
student_id name score grade
0 1 Alice 95 A
1 2 Bob 88 B
2 3 Charlie 73 C
3 4 David 65 F
4 5 Eve 82 B
5 6 Frank 91 A
6 7 Grace 70 F

This is not only faster but also much more scalable. Once you start dealing with millions of rows, the difference between a vectorized operation and a loop is the difference between a 1-second task and a 1-hour task.

Final Words

So, we’ve covered some powerful Pandas tricks:

  1. .pipe() for creating clean, readable method chains.
  2. .assign() to add new columns within a fluid pipeline.
  3. Vectorization (with np.where()) to replace slow loops and .apply() calls.

Don’t just read about these. Open up a Jupyter notebook with a dataset you’re familiar with and try to apply at least one of these tricks. Refactor an old piece of code and see how much cleaner and more efficient you can make it.

I hope you liked this article on Pandas tricks that will instantly upgrade your data science workflow as a Data Scientist. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.

Aman Kharwal
Aman Kharwal

AI/ML Engineer | Published Author. My aim is to decode data science for the real world in the most simple words.

Articles: 2074

Leave a Reply

Discover more from AmanXai by Aman Kharwal

Subscribe now to keep reading and get access to the full archive.

Continue reading