Pandas is one of the most popular and powerful data analysis libraries in Python. At the core of pandas is the dataframe – a two-dimensional tabular data structure with labeled axes.
Mastering dataframes is key to unlocking the full potential of pandas for data manipulation, analysis, and visualization. In this comprehensive guide, we‘ll cover everything you need to know to work efficiently with pandas dataframes.
What is a pandas DataFrame?
A pandas dataframe is a 2D labeled data structure with columns of potentially different types. It is similar to a SQL table or Excel spreadsheet.
Some key properties of pandas dataframes:
- Homogeneously typed columns
- Size mutable
- Labeled axes (rows and columns)
- Can perform arithmetic operations on rows and columns
Let‘s create a simple dataframe about popular superheroes:
import pandas as pd
data = {‘Name‘: [‘Batman‘, ‘Superman‘, ‘Wonder Woman‘],
‘Alter Ego‘: [‘Bruce Wayne‘, ‘Clark Kent‘, ‘Diana Prince‘],
‘First Appearance‘: [1939, 1938, 1941]}
df = pd.DataFrame(data)
print(df)
Name Alter Ego First Appearance
0 Batman Bruce Wayne 1939
1 Superman Clark Kent 1938
2 Wonder Woman Diana Prince 1941
We passed in a Python dictionary where:
- The keys become the column names
- The values become each column‘s data
Easy! This simplicity of constructing dataframes is why pandas is so popular for data analysis in Python.
Now let‘s dive deeper.
Essential DataFrame Attributes and Properties
Shape and Size
Use the .shape
attribute to view the dimensions of the dataframe. This returns a tuple with:
- Number of rows
- Number of columns
print(df.shape) # (3, 3)
For just the number of rows or columns, use .nrows
and .ncols
:
print(df.nrows) # 3
print(df.ncols) # 3
Column Data Types
Use .dtypes
to view the data type of each column:
print(df.dtypes)
‘‘‘
Name object
Alter Ego object
First Appearance int64
dtype: object
‘‘‘
The dataframe inferred data types based on the input data. But you can explicitly convert columns to different types using .astype()
.
Index
Each row has an index label by default (0, 1, 2 etc). These can be accessed via the .index
property:
print(df.index)
# RangeIndex(start=0, stop=3, step=1)
The index is mutable and can be set to any list or array of labels.
Columns
Columns can be accessed via the .columns
property:
print(df.columns)
# Index([‘Name‘, ‘Alter Ego‘, ‘First Appearance‘], dtype=‘object‘)
This returns an index of all column labels.
Values
The underlying 2D numpy array containing the data can be accessed via .values
:
print(df.values)
‘‘‘
[[ ‘Batman‘ ‘Bruce Wayne‘ 1939]
[ ‘Superman‘ ‘Clark Kent‘ 1938]
[ ‘Wonder Woman‘ ‘Diana Prince‘ 1941]]
‘‘‘
Info on Null Values
Call .info()
to get a concise summary of the dataframe including total non-null values and data types for each column:
print(df.info())
‘‘‘
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 3 non-null object
1 Alter Ego 3 non-null object
2 First Appearance 3 non-null int64
‘‘‘
This is super useful for initial data validation and cleaning.
We‘ve covered the basics – now let‘s learn how to manipulate, transform and analyze the data within our dataframe!
Selecting, Adding and Removing Columns
pandas provides a rich API for selecting, updating, adding and removing columns.
Select Single Column
Use column name string as key:
hero_names = df[‘Name‘]
print(hero_names)
# 0 Batman
# 1 Superman
# 2 Wonder Woman
Or use attribute access:
print(df.Name)
Select Multiple Columns
Pass in a list of column names:
two_cols = df[[‘Name‘, ‘First Appearance‘]]
print(two_cols)
Name First Appearance
0 Batman 1939
1 Superman 1938
2 Wonder Woman 1941
Add a New Column
Assign the new column to the dataframe:
df[‘Publisher‘] = [‘DC‘, ‘DC‘, ‘DC‘]
print(df)
Name Alter Ego First Appearance Publisher
0 Batman Bruce Wayne 1939 DC
1 Superman Clark Kent 1938 DC
2 Wonder Woman Diana Prince 1941 DC
The column is appended to the end by default. Add insert(loc, col_name, value)
to insert at a specific location.
Remove Column
Use pop()
or del df[col_name]
:
df.pop(‘Publisher‘) # Removes and returns column data
del df[‘Alter Ego‘] # Removes column without returning data
That covers manipulating dataframe columns. Now let‘s focus on the rows.
Selecting and Indexing Rows
pandas provides a rich API for slicing and dicing dataframe rows including:
- Positional-based (iloc)
- Label-based (loc)
- Boolean indexing
Let‘s explore examples of each method.
Positional Indexing with iloc
Use .iloc[]
to select rows by integer position.
Select row at index 1:
second_row = df.iloc[1]
print(second_row)
# Name Superman
# Alter Ego Clark Kent
# First Appearance 1938
Slice rows between 2 inclusive and 4 exclusive:
two_rows = df.iloc[2:4]
iloc
also allows selecting on both rows and columns simultaneously.
Label Indexing with loc
.loc[]
selects rows by index label rather than integer position.
first_row = df.loc[0]
wonder_woman_data = df.loc[‘Wonder Woman‘]
Powerful for retaining associations when indexing!
Boolean Indexing
Here we pass in a boolean Series to filter rows.
Select superheroes that first appeared after 1938:
recent_heroes = df[df[‘First Appearance‘] > 1938]
print(recent_heroes)
# Name Alter Ego First Appearance
# 2 Wonder Woman Diana Prince 1941
The filtered dataframe contains only rows where the boolean condition held True.
This just scratches the surface of the filtering capabilities. You can pass multiple boolean conditions, custom functions that return booleans, and more.
Now that you know how to slice and filter rows, let‘s cover some common transformations.
Transforming DataFrames
I frequently need to transform dataframes for analysis by:
- Adding new columns
- Removing missing data
- Formatting columns
- Encoding categoricals
pandas provides vectorized dataframe transformations for speed and concision.
Let‘s walk through examples of essential transformations:
Add New Columns
This typically involves element-wise operations or using existing columns to derive new data.
Calculate the difference between current year and first appearance:
import datetime as dt
current_year = dt.date.today().year
df[‘Age‘] = current_year - df[‘First Appearance‘]
print(df)
Apply custom logic into new columns as well – the world is your oyster!
Formatting Columns
Use pandas vectorized string methods to format column values:
df[‘Publication Date‘] = df[‘First Appearance‘].astype(str) + ‘ (MD)‘
print(df[‘Publication Date‘])
# 0 1939 (MD)
# 1 1938 (MD)
# 2 1941 (MD)
Much more efficient than iteration or list comprehensions.
Filter Out Missing Data
NULL values propagate in computations. Use .dropna()
to filter rows with NA/missing values:
df.dropna() # Removes rows with ANY null values
More options available to remove rows with certain thresholds of missing values.
Encode Categoricals
Use .astype()
to convert object columns into encoded integer representations:
df[‘name_enc‘] = df[‘name‘].astype(‘category‘).cat.codes
Encoding needed before many models.
The main takeaway is pandas vectorization allows you to make sweeping dataframe changes with 1-2 lines of code – leading to fast, clean transformations.
Up next, let‘s go over performing analysis on our dataframe.
Data Analysis Methods
While pandas provides tons of analysis methods, I find myself using 3 the most:
- Aggregates
- Grouping and pivoting
- Merging/joining
These simple methods enable complete exploratory data analysis workflows.
1. Aggregates
Use builtin aggregate methods like .mean()
to understand column distributions:
print(df[‘First Appearance‘].mean()) # 1939
Other useful aggregates: count()
, max()
, min()
, sum()
, median()
etc.
These generally exclude missing data and operate column-wise by default. Pass axis=1
to aggregate row-wise instead.
2. Groupby
The .groupby()
method is perfect for aggregated analysis on column groupings.
Group by publisher and take the min and max years:
df.groupby(‘Publisher‘)[[‘First Appearance‘]].agg([‘min‘,‘max‘])
min max
Publisher
DC 1938 1941
Such a common pattern that groupby is a must-know tool for performant summarized analysis.
3. Joins and Merging
Merge two pandas objects based on one or more keys:
users = pd.DataFrame({‘id‘: [1,2,3], ‘name‘: [‘Alice‘, ‘Bob‘, ‘Charlie‘]})
order_amounts = pd.DataFrame({‘id‘: [1,2,3], ‘order_total‘: [34.99, 16.50, 98.43]})
pd.merge(users, order_amounts, on=‘id‘)
id name order_total
0 1 Alice 34.99
1 2 Bob 16.50
2 3 Charlie 98.43
This opens the door to joining all kinds of datasets together for analysis – structured and unstructured data alike.
Hopefully you now have pandas analysis basics covered end-to-end! We went from creation to transformation to analysis.
Before wrapping up, I want to provide some quick tips and tricks that I constantly use working with pandas.
Top pandas Tips and Tricks
Here are 3 tips to level up your pandas skills:
1. Enable Views for No Copy Operations
pandas sometimes makes a copy of the dataframe when slicing rather than a view.
Use df.copy()
to explicitly copy. Otherwise, enable views with:
pd.set_option(‘mode.chained_assignment‘,None)
Then slicing will create lightweight views instead of expensive copies.
2. Monitor Memory Usage
Keep an eye on memory consumption via:
import sys
print(df.memory_usage(index=True, deep=True).sum() / 1024**2)
Data exploration workflows create lots of temporary dataframes – important to clean up objects when done to avoid OOM crashes.
Which brings me to…
3. Minimize Memory Overhead with Category Types
Reduce memory usage by converting object columns to category types:
df[‘col‘] = df[‘col‘].astype(‘category‘)
Categoricals store labels as integers mapped to a lookup rather than objects. Major savings!
These 3 tips will accelerate your pandas proficiency – use them daily!
We‘ve covered a ton of ground on essential pandas dataframe usage. Now over to you! I sincerely hope you feel empowered to start analyzing tabular data programmatically with pandas.