Skip to content

An In-Depth Guide to pandas Dataframes

Pandas is one of the most popular and powerful data analysis libraries in Python. At the core of pandas is the dataframe – a two-dimensional tabular data structure with labeled axes.

Mastering dataframes is key to unlocking the full potential of pandas for data manipulation, analysis, and visualization. In this comprehensive guide, we‘ll cover everything you need to know to work efficiently with pandas dataframes.

What is a pandas DataFrame?

A pandas dataframe is a 2D labeled data structure with columns of potentially different types. It is similar to a SQL table or Excel spreadsheet.

Some key properties of pandas dataframes:

  • Homogeneously typed columns
  • Size mutable
  • Labeled axes (rows and columns)
  • Can perform arithmetic operations on rows and columns

Let‘s create a simple dataframe about popular superheroes:

import pandas as pd

data = {‘Name‘: [‘Batman‘, ‘Superman‘, ‘Wonder Woman‘], 
        ‘Alter Ego‘: [‘Bruce Wayne‘, ‘Clark Kent‘, ‘Diana Prince‘], 
        ‘First Appearance‘: [1939, 1938, 1941]}

df = pd.DataFrame(data)

print(df)
   Name Alter Ego  First Appearance
0  Batman  Bruce Wayne              1939   
1  Superman   Clark Kent              1938
2  Wonder Woman  Diana Prince              1941

We passed in a Python dictionary where:

  • The keys become the column names
  • The values become each column‘s data

Easy! This simplicity of constructing dataframes is why pandas is so popular for data analysis in Python.

Now let‘s dive deeper.

Essential DataFrame Attributes and Properties

Shape and Size

Use the .shape attribute to view the dimensions of the dataframe. This returns a tuple with:

  • Number of rows
  • Number of columns
print(df.shape) # (3, 3)

For just the number of rows or columns, use .nrows and .ncols:

print(df.nrows) # 3
print(df.ncols) # 3

Column Data Types

Use .dtypes to view the data type of each column:

print(df.dtypes)

‘‘‘
Name              object
Alter Ego         object
First Appearance     int64
dtype: object
‘‘‘

The dataframe inferred data types based on the input data. But you can explicitly convert columns to different types using .astype().

Index

Each row has an index label by default (0, 1, 2 etc). These can be accessed via the .index property:

print(df.index)

# RangeIndex(start=0, stop=3, step=1)  

The index is mutable and can be set to any list or array of labels.

Columns

Columns can be accessed via the .columns property:

print(df.columns)

# Index([‘Name‘, ‘Alter Ego‘, ‘First Appearance‘], dtype=‘object‘)

This returns an index of all column labels.

Values

The underlying 2D numpy array containing the data can be accessed via .values:

print(df.values)

‘‘‘
[[ ‘Batman‘ ‘Bruce Wayne‘ 1939]
 [ ‘Superman‘ ‘Clark Kent‘ 1938]
 [ ‘Wonder Woman‘ ‘Diana Prince‘ 1941]]
‘‘‘

Info on Null Values

Call .info() to get a concise summary of the dataframe including total non-null values and data types for each column:

print(df.info())

‘‘‘
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Name             3 non-null      object
 1   Alter Ego        3 non-null      object
 2   First Appearance  3 non-null      int64 
‘‘‘

This is super useful for initial data validation and cleaning.

We‘ve covered the basics – now let‘s learn how to manipulate, transform and analyze the data within our dataframe!

Selecting, Adding and Removing Columns

pandas provides a rich API for selecting, updating, adding and removing columns.

Select Single Column

Use column name string as key:

hero_names = df[‘Name‘]  
print(hero_names)

# 0    Batman
# 1    Superman  
# 2    Wonder Woman

Or use attribute access:

print(df.Name) 

Select Multiple Columns

Pass in a list of column names:

two_cols = df[[‘Name‘, ‘First Appearance‘]]  

print(two_cols)
   Name  First Appearance
0  Batman              1939
1  Superman             1938 
2  Wonder Woman         1941

Add a New Column

Assign the new column to the dataframe:

df[‘Publisher‘] = [‘DC‘, ‘DC‘, ‘DC‘]  

print(df)
   Name Alter Ego  First Appearance Publisher
0  Batman  Bruce Wayne              1939        DC
1  Superman   Clark Kent              1938        DC     
2  Wonder Woman Diana Prince              1941        DC

The column is appended to the end by default. Add insert(loc, col_name, value) to insert at a specific location.

Remove Column

Use pop() or del df[col_name]:

df.pop(‘Publisher‘) # Removes and returns column data
del df[‘Alter Ego‘] # Removes column without returning data

That covers manipulating dataframe columns. Now let‘s focus on the rows.

Selecting and Indexing Rows

pandas provides a rich API for slicing and dicing dataframe rows including:

  • Positional-based (iloc)
  • Label-based (loc)
  • Boolean indexing

Let‘s explore examples of each method.

Positional Indexing with iloc

Use .iloc[] to select rows by integer position.

Select row at index 1:

second_row = df.iloc[1] 
print(second_row)

# Name         Superman
# Alter Ego    Clark Kent         
# First Appearance    1938

Slice rows between 2 inclusive and 4 exclusive:

two_rows = df.iloc[2:4]

iloc also allows selecting on both rows and columns simultaneously.

Label Indexing with loc

.loc[] selects rows by index label rather than integer position.

first_row = df.loc[0]  

wonder_woman_data = df.loc[‘Wonder Woman‘]

Powerful for retaining associations when indexing!

Boolean Indexing

Here we pass in a boolean Series to filter rows.

Select superheroes that first appeared after 1938:

recent_heroes = df[df[‘First Appearance‘] > 1938]  

print(recent_heroes)

#    Name   Alter Ego First Appearance 
# 2  Wonder Woman Diana Prince         1941

The filtered dataframe contains only rows where the boolean condition held True.

This just scratches the surface of the filtering capabilities. You can pass multiple boolean conditions, custom functions that return booleans, and more.

Now that you know how to slice and filter rows, let‘s cover some common transformations.

Transforming DataFrames

I frequently need to transform dataframes for analysis by:

  • Adding new columns
  • Removing missing data
  • Formatting columns
  • Encoding categoricals

pandas provides vectorized dataframe transformations for speed and concision.

Let‘s walk through examples of essential transformations:

Add New Columns

This typically involves element-wise operations or using existing columns to derive new data.

Calculate the difference between current year and first appearance:

import datetime as dt

current_year = dt.date.today().year  

df[‘Age‘] = current_year - df[‘First Appearance‘]

print(df)

Apply custom logic into new columns as well – the world is your oyster!

Formatting Columns

Use pandas vectorized string methods to format column values:

df[‘Publication Date‘] = df[‘First Appearance‘].astype(str) + ‘ (MD)‘

print(df[‘Publication Date‘]) 

# 0     1939 (MD)
# 1     1938 (MD)       
# 2    1941 (MD)

Much more efficient than iteration or list comprehensions.

Filter Out Missing Data

NULL values propagate in computations. Use .dropna() to filter rows with NA/missing values:

df.dropna() # Removes rows with ANY null values

More options available to remove rows with certain thresholds of missing values.

Encode Categoricals

Use .astype() to convert object columns into encoded integer representations:

df[‘name_enc‘] = df[‘name‘].astype(‘category‘).cat.codes

Encoding needed before many models.

The main takeaway is pandas vectorization allows you to make sweeping dataframe changes with 1-2 lines of code – leading to fast, clean transformations.

Up next, let‘s go over performing analysis on our dataframe.

Data Analysis Methods

While pandas provides tons of analysis methods, I find myself using 3 the most:

  1. Aggregates
  2. Grouping and pivoting
  3. Merging/joining

These simple methods enable complete exploratory data analysis workflows.

1. Aggregates

Use builtin aggregate methods like .mean() to understand column distributions:

print(df[‘First Appearance‘].mean()) # 1939

Other useful aggregates: count(), max(), min(), sum(), median() etc.

These generally exclude missing data and operate column-wise by default. Pass axis=1 to aggregate row-wise instead.

2. Groupby

The .groupby() method is perfect for aggregated analysis on column groupings.

Group by publisher and take the min and max years:

df.groupby(‘Publisher‘)[[‘First Appearance‘]].agg([‘min‘,‘max‘])
             min   max
Publisher
DC          1938  1941

Such a common pattern that groupby is a must-know tool for performant summarized analysis.

3. Joins and Merging

Merge two pandas objects based on one or more keys:

users = pd.DataFrame({‘id‘: [1,2,3], ‘name‘: [‘Alice‘, ‘Bob‘, ‘Charlie‘]}) 

order_amounts = pd.DataFrame({‘id‘: [1,2,3], ‘order_total‘: [34.99, 16.50, 98.43]})

pd.merge(users, order_amounts, on=‘id‘)  
   id     name  order_total
0   1     Alice        34.99    
1   2       Bob        16.50
2   3   Charlie        98.43

This opens the door to joining all kinds of datasets together for analysis – structured and unstructured data alike.

Hopefully you now have pandas analysis basics covered end-to-end! We went from creation to transformation to analysis.

Before wrapping up, I want to provide some quick tips and tricks that I constantly use working with pandas.

Top pandas Tips and Tricks

Here are 3 tips to level up your pandas skills:

1. Enable Views for No Copy Operations

pandas sometimes makes a copy of the dataframe when slicing rather than a view.

Use df.copy() to explicitly copy. Otherwise, enable views with:

pd.set_option(‘mode.chained_assignment‘,None)  

Then slicing will create lightweight views instead of expensive copies.

2. Monitor Memory Usage

Keep an eye on memory consumption via:

import sys
print(df.memory_usage(index=True, deep=True).sum() / 1024**2) 

Data exploration workflows create lots of temporary dataframes – important to clean up objects when done to avoid OOM crashes.

Which brings me to…

3. Minimize Memory Overhead with Category Types

Reduce memory usage by converting object columns to category types:

df[‘col‘] = df[‘col‘].astype(‘category‘)  

Categoricals store labels as integers mapped to a lookup rather than objects. Major savings!

These 3 tips will accelerate your pandas proficiency – use them daily!

We‘ve covered a ton of ground on essential pandas dataframe usage. Now over to you! I sincerely hope you feel empowered to start analyzing tabular data programmatically with pandas.