Mastering Pandas DataFrame Operations

I've been working with Pandas long enough to have made most of the classic mistakes - overwriting a DataFrame I needed, spending 20 minutes debugging a shape mismatch that turned out to be a stray NaN, trying to merge two DataFrames on a column that had different dtypes on each side. This post walks through the operations I use most often, and I'll flag the spots where I've burned time so you don't have to.

Getting Started

First, let's import the essential libraries:

import pandas as pd
import numpy as np

1. Creating DataFrames

There are a few ways to get a DataFrame. The two I reach for most are dictionaries for quick test data and read_csv for anything real.

Creating from Dictionary

The dictionary approach is my go-to when I'm prototyping or writing examples:

data = {
    'Student_ID': [1, 2, 3, 4, 5],
    'Name': ['Amit', 'Sneha', 'Rahul', 'Pooja', 'Amit'],
    'Marks': [85, 90, None, 78, 85],
    'City': ['Pune', 'Mumbai', 'Pune', 'Nashik', 'Pune']
}

df = pd.DataFrame(data)
print(df)

Output:

   Student_ID   Name  Marks    City
0           1   Amit   85.0    Pune
1           2  Sneha   90.0  Mumbai
2           3  Rahul    NaN    Pune
3           4  Pooja   78.0  Nashik
4           5   Amit   85.0    Pune

The None in the Marks column gets cast to NaN automatically - which is why Marks shows up as float64 instead of int64. That surprises people the first time they see it.

Loading from CSV Files

For real-world projects, you'll often load data from CSV files:

try:
    df_csv = pd.read_csv("students.csv")
    print(df_csv)
except FileNotFoundError:
    print("Error: CSV file not found.")

2. Looking at What You've Got

I wasted a lot of time early on jumping straight into analysis without actually inspecting the data first. Don't do that. Run these before you do anything else.

Quick Data Preview

# View first 5 rows
print(df.head())

# View last 5 rows
print(df.tail())

# Get DataFrame dimensions
print(f"Shape: {df.shape}")  # (rows, columns)
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

Understanding DataFrame Structure

# Display column names
print(df.columns.tolist())
# Output: ['Student_ID', 'Name', 'Marks', 'City']

# Get detailed information
print(df.info())

DataFrame Info Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Student_ID  5 non-null      int64  
 1   Name        5 non-null      object 
 2   Marks       4 non-null      float64
 3   City        5 non-null      object

df.info() is underrated. It shows you dtypes and null counts at a glance, which is way more useful than just printing the DataFrame when you're working with something larger.

Statistical Summary

# Get statistical summary of numerical columns
print(df.describe())

Output:

       Student_ID      Marks
count    5.000000   4.000000
mean     3.000000  84.500000
std      1.581139   4.932883
min      1.000000  78.000000
25%      2.000000  83.250000
50%      3.000000  85.000000
75%      4.000000  86.250000
max      5.000000  90.000000

Notice count is 4 for Marks, not 5. That's your NaN showing up. If those counts don't match across columns, you've got missing data somewhere.

3. Cleaning Up the Mess

Real data is never clean. Here's what I do in roughly this order every time.

Handling Missing Values

# Detect null values
print(df.isnull())

# Count null values per column
print(df.isnull().sum())
# Output: Marks    1

# Fill missing values with mean
df['Marks'] = df['Marks'].fillna(df['Marks'].mean())
print(df)

After filling, the missing mark (previously NaN) becomes 84.5 (the mean of existing marks).

Whether filling with the mean is the right choice depends on your data. For grades, it's often acceptable. For something like income data, it can silently bias your results in ways that are hard to catch later.

Removing Duplicates

# Check for duplicates
print(f"Number of duplicates: {df.duplicated().sum()}")

# Remove duplicate rows
df = df.drop_duplicates()
print(df)

Modifying Data

# Update specific values using conditions
df.loc[df['Name'] == 'Amit', 'Marks'] = 88
print(df)

# Add a new column based on conditions
df['Result'] = df['Marks'].apply(lambda x: 'Pass' if x >= 40 else 'Fail')
print(df)

Output with new column:

   Student_ID   Name  Marks    City Result
0           1   Amit   88.0    Pune   Pass
1           2  Sneha   90.0  Mumbai   Pass
2           3  Rahul   84.5    Pune   Pass
3           4  Pooja   78.0  Nashik   Pass
4           5   Amit   88.0    Pune   Pass

4. Grouping and Aggregating

groupby is where Pandas starts to feel genuinely powerful. This is how you answer questions like "what's the average score by city" or "how many students are in each department."

Basic Grouping

# Calculate average marks by city
grouped_data = df.groupby('City')['Marks'].mean()
print(grouped_data)

Output:

City
Mumbai    90.000000
Nashik    78.000000
Pune      86.833333
Name: Marks, dtype: float64

Multiple Aggregations

# Apply multiple aggregation functions
agg_data = df.groupby('City').agg({
    'Marks': ['mean', 'max', 'min', 'count']
})
print(agg_data)

Output:

            Marks                  
             mean   max   min count
City                               
Mumbai  90.000000  90.0  90.0     1
Nashik  78.000000  78.0  78.0     1
Pune    86.833333  88.0  84.5     3

The multi-level column headers from agg can be annoying to work with downstream. I'll often flatten them with agg_data.columns = ['_'.join(col) for col in agg_data.columns] right after.

5. Merging and Joining DataFrames

Combining DataFrames is something I do constantly, and I still occasionally get the join type wrong on the first try.

Creating a Second DataFrame

department = {
    'Student_ID': [1, 2, 3, 4],
    'Department': ['CS', 'IT', 'CS', 'ENTC']
}

df_dept = pd.DataFrame(department)
print(df_dept)

Merging DataFrames

If you've used SQL, merge will feel familiar. It's doing the same thing as a JOIN:

# Inner join on Student_ID
merged_df = pd.merge(df, df_dept, on='Student_ID', how='inner')
print(merged_df)

Output:

   Student_ID   Name  Marks    City Result Department
0           1   Amit   88.0    Pune   Pass         CS
1           2  Sneha   90.0  Mumbai   Pass         IT
2           3  Rahul   84.5    Pune   Pass         CS
3           4  Pooja   78.0  Nashik   Pass       ENTC

Student_ID 5 disappears here because it doesn't exist in df_dept. That's inner join behavior. Use how='left' if you want to keep all rows from the left DataFrame.

Joining DataFrames

Joining is another way to combine DataFrames, typically using index:

# Set Student_ID as index
df_indexed = df.set_index('Student_ID')
df_dept_indexed = df_dept.set_index('Student_ID')

# Left join (keeps all rows from left DataFrame)
joined_df = df_indexed.join(df_dept_indexed)
print(joined_df)

Output:

             Name  Marks    City Result Department
Student_ID                                        
1            Amit   88.0    Pune   Pass         CS
2           Sneha   90.0  Mumbai   Pass         IT
3           Rahul   84.5    Pune   Pass         CS
4           Pooja   78.0  Nashik   Pass       ENTC
5            Amit   88.0    Pune   Pass        NaN

Student_ID 5 has NaN for Department because they don't exist in the department DataFrame. That's expected with a left join - you keep all your left rows and fill in NaN where there's no match on the right.

Things That Will Bite You

A few mistakes I see (and have made) regularly:

Skipping the null check before doing math. Aggregations on columns with NaN will silently drop rows, and your counts will be off.
Not checking dtypes before a merge. If Student_ID is int64 in one DataFrame and object in another, the merge will silently produce an empty result with no error.
Modifying a slice of a DataFrame directly instead of using .copy(). Pandas will sometimes raise a SettingWithCopyWarning and sometimes just fail silently. Use df.copy() when you need an independent copy.
Forgetting to call reset_index(drop=True) after filtering. The row indices stay as they were in the original, which causes confusing behavior when you iterate or do index-based operations later.

Where to Go Next

Once you've got the basics solid, pivot tables are worth learning - they're the fastest way to reshape data for summary views. Time series operations (resample, rolling, date indexing) are also worth the investment if you're working with any kind of time-stamped data. And if your DataFrames are getting large, switching from apply with a lambda to vectorized operations can make a significant difference in runtime.

Final Thoughts

These operations cover maybe 80% of what I do day-to-day with Pandas. The rest is usually domain-specific or edge-case stuff that you pick up as you run into it. The best way to get comfortable is to grab a real dataset, not a toy one, and actually work through a question you care about.

Connect With Me

I'm @amitdevx on GitHub and on LinkedIn. If you're working on something interesting with Pandas or data pipelines, I'd genuinely love to hear about it.