Mastering Pandas DataFrame Operations
Mastering Pandas DataFrame Operations
I've been working with Pandas long enough to have made most of the classic mistakes - overwriting a DataFrame I needed, spending 20 minutes debugging a shape mismatch that turned out to be a stray NaN, trying to merge two DataFrames on a column that had different dtypes on each side. This post walks through the operations I use most often, and I'll flag the spots where I've burned time so you don't have to.
Getting Started
First, let's import the essential libraries:
import pandas as pd import numpy as np
1. Creating DataFrames
There are a few ways to get a DataFrame. The two I reach for most are dictionaries for quick test data and read_csv for anything real.
Creating from Dictionary
The dictionary approach is my go-to when I'm prototyping or writing examples:
data = { 'Student_ID': [1, 2, 3, 4, 5], 'Name': ['Amit', 'Sneha', 'Rahul', 'Pooja', 'Amit'], 'Marks': [85, 90, None, 78, 85], 'City': ['Pune', 'Mumbai', 'Pune', 'Nashik', 'Pune'] } df = pd.DataFrame(data) print(df)
Output:
Student_ID Name Marks City
0 1 Amit 85.0 Pune
1 2 Sneha 90.0 Mumbai
2 3 Rahul NaN Pune
3 4 Pooja 78.0 Nashik
4 5 Amit 85.0 Pune
The None in the Marks column gets cast to NaN automatically - which is why Marks shows up as float64 instead of int64. That surprises people the first time they see it.
Loading from CSV Files
For real-world projects, you'll often load data from CSV files:
try: df_csv = pd.read_csv("students.csv") print(df_csv) except FileNotFoundError: print("Error: CSV file not found.")
2. Looking at What You've Got
I wasted a lot of time early on jumping straight into analysis without actually inspecting the data first. Don't do that. Run these before you do anything else.
Quick Data Preview
# View first 5 rows print(df.head()) # View last 5 rows print(df.tail()) # Get DataFrame dimensions print(f"Shape: {df.shape}") # (rows, columns) print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
Understanding DataFrame Structure
# Display column names print(df.columns.tolist()) # Output: ['Student_ID', 'Name', 'Marks', 'City'] # Get detailed information print(df.info())
DataFrame Info Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Student_ID 5 non-null int64
1 Name 5 non-null object
2 Marks 4 non-null float64
3 City 5 non-null object
df.info() is underrated. It shows you dtypes and null counts at a glance, which is way more useful than just printing the DataFrame when you're working with something larger.
Statistical Summary
# Get statistical summary of numerical columns print(df.describe())
Output:
Student_ID Marks
count 5.000000 4.000000
mean 3.000000 84.500000
std 1.581139 4.932883
min 1.000000 78.000000
25% 2.000000 83.250000
50% 3.000000 85.000000
75% 4.000000 86.250000
max 5.000000 90.000000
Notice count is 4 for Marks, not 5. That's your NaN showing up. If those counts don't match across columns, you've got missing data somewhere.
3. Cleaning Up the Mess
Real data is never clean. Here's what I do in roughly this order every time.
Handling Missing Values
# Detect null values print(df.isnull()) # Count null values per column print(df.isnull().sum()) # Output: Marks 1 # Fill missing values with mean df['Marks'] = df['Marks'].fillna(df['Marks'].mean()) print(df)
After filling, the missing mark (previously NaN) becomes 84.5 (the mean of existing marks).
Whether filling with the mean is the right choice depends on your data. For grades, it's often acceptable. For something like income data, it can silently bias your results in ways that are hard to catch later.
Removing Duplicates
# Check for duplicates print(f"Number of duplicates: {df.duplicated().sum()}") # Remove duplicate rows df = df.drop_duplicates() print(df)
Modifying Data
# Update specific values using conditions df.loc[df['Name'] == 'Amit', 'Marks'] = 88 print(df) # Add a new column based on conditions df['Result'] = df['Marks'].apply(lambda x: 'Pass' if x >= 40 else 'Fail') print(df)
Output with new column:
Student_ID Name Marks City Result
0 1 Amit 88.0 Pune Pass
1 2 Sneha 90.0 Mumbai Pass
2 3 Rahul 84.5 Pune Pass
3 4 Pooja 78.0 Nashik Pass
4 5 Amit 88.0 Pune Pass
4. Grouping and Aggregating
groupby is where Pandas starts to feel genuinely powerful. This is how you answer questions like "what's the average score by city" or "how many students are in each department."
Basic Grouping
# Calculate average marks by city grouped_data = df.groupby('City')['Marks'].mean() print(grouped_data)
Output:
City
Mumbai 90.000000
Nashik 78.000000
Pune 86.833333
Name: Marks, dtype: float64
Multiple Aggregations
# Apply multiple aggregation functions agg_data = df.groupby('City').agg({ 'Marks': ['mean', 'max', 'min', 'count'] }) print(agg_data)
Output:
Marks
mean max min count
City
Mumbai 90.000000 90.0 90.0 1
Nashik 78.000000 78.0 78.0 1
Pune 86.833333 88.0 84.5 3
The multi-level column headers from agg can be annoying to work with downstream. I'll often flatten them with agg_data.columns = ['_'.join(col) for col in agg_data.columns] right after.
5. Merging and Joining DataFrames
Combining DataFrames is something I do constantly, and I still occasionally get the join type wrong on the first try.
Creating a Second DataFrame
department = { 'Student_ID': [1, 2, 3, 4], 'Department': ['CS', 'IT', 'CS', 'ENTC'] } df_dept = pd.DataFrame(department) print(df_dept)
Merging DataFrames
If you've used SQL, merge will feel familiar. It's doing the same thing as a JOIN:
# Inner join on Student_ID merged_df = pd.merge(df, df_dept, on='Student_ID', how='inner') print(merged_df)
Output:
Student_ID Name Marks City Result Department
0 1 Amit 88.0 Pune Pass CS
1 2 Sneha 90.0 Mumbai Pass IT
2 3 Rahul 84.5 Pune Pass CS
3 4 Pooja 78.0 Nashik Pass ENTC
Student_ID 5 disappears here because it doesn't exist in df_dept. That's inner join behavior. Use how='left' if you want to keep all rows from the left DataFrame.
Joining DataFrames
Joining is another way to combine DataFrames, typically using index:
# Set Student_ID as index df_indexed = df.set_index('Student_ID') df_dept_indexed = df_dept.set_index('Student_ID') # Left join (keeps all rows from left DataFrame) joined_df = df_indexed.join(df_dept_indexed) print(joined_df)
Output:
Name Marks City Result Department
Student_ID
1 Amit 88.0 Pune Pass CS
2 Sneha 90.0 Mumbai Pass IT
3 Rahul 84.5 Pune Pass CS
4 Pooja 78.0 Nashik Pass ENTC
5 Amit 88.0 Pune Pass NaN
Student_ID 5 has NaN for Department because they don't exist in the department DataFrame. That's expected with a left join - you keep all your left rows and fill in NaN where there's no match on the right.
Things That Will Bite You
A few mistakes I see (and have made) regularly:
- Skipping the null check before doing math. Aggregations on columns with
NaNwill silently drop rows, and your counts will be off. - Not checking dtypes before a merge. If
Student_IDisint64in one DataFrame andobjectin another, the merge will silently produce an empty result with no error. - Modifying a slice of a DataFrame directly instead of using
.copy(). Pandas will sometimes raise aSettingWithCopyWarningand sometimes just fail silently. Usedf.copy()when you need an independent copy. - Forgetting to call
reset_index(drop=True)after filtering. The row indices stay as they were in the original, which causes confusing behavior when you iterate or do index-based operations later.
Where to Go Next
Once you've got the basics solid, pivot tables are worth learning - they're the fastest way to reshape data for summary views. Time series operations (resample, rolling, date indexing) are also worth the investment if you're working with any kind of time-stamped data. And if your DataFrames are getting large, switching from apply with a lambda to vectorized operations can make a significant difference in runtime.
Final Thoughts
These operations cover maybe 80% of what I do day-to-day with Pandas. The rest is usually domain-specific or edge-case stuff that you pick up as you run into it. The best way to get comfortable is to grab a real dataset, not a toy one, and actually work through a question you care about.
Connect With Me
I'm @amitdevx on GitHub and on LinkedIn. If you're working on something interesting with Pandas or data pipelines, I'd genuinely love to hear about it.