Getting started with Pandas Pivot
Pandas is a powerhouse tool that allows you to do anything and everything with tabular or columnar data and is a vital component of Python’s data science stack. Among its vast array of features, the Pandas Pivot is an incredibly useful method for reshaping your DataFrame. Whether you are a beginner just starting out in data manipulation or an experienced professional looking for ways to make data analysis more efficient, understanding the Pandas Pivot function will be extremely beneficial.
1. Brief Overview of Pandas Pivot
Pandas Pivot is essentially used for data transformation. It provides an easy way to restructure and reorganize your data, converting it from a ’long’ format to a ‘wide’ format, or vice versa. It’s particularly useful when you have repetitive entries in your DataFrame and you want to understand the data from a different angle.
2. Definition and Purpose
The pivot function in Pandas allows you to reshape or pivot the
DataFrame by converting columns into rows and rows into columns. This is
particularly useful for data analysis and visualization, as it allows
for a more structured and readable form of the data.
3. Syntax Overview
The basic syntax of the Pandas Pivot function is:
DataFrame.pivot(index=None, columns=None, values=None)
index: The column whose unique values will become the index of the new DataFrame.columns: The column whose unique values will become the columns of the new DataFrame.values: The column(s) whose values we want to populate in the new DataFrame.
4. Simple Example with Explanation
Suppose we have a DataFrame of sales data like this:
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({
'Date': ['2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02'],
'Product': ['A', 'B', 'A', 'B'],
'Sales': [100, 150, 200, 50]
})
print(df)
Output:
Date Product Sales
0 2021-01-01 A 100
1 2021-01-01 B 150
2 2021-01-02 A 200
3 2021-01-02 B 50
To get a better view of sales for each product on each date, you can pivot the data.
# Pivot the DataFrame
pivot_df = df.pivot(index='Date', columns='Product', values='Sales')
print(pivot_df)
Output:
Product A B
Date
2021-01-01 100 150
2021-01-02 200 50
In this example, the unique values in the Date column have become the
index of the new DataFrame, the unique Product values have become the
new columns, and the Sales values fill in the new DataFrame.
Parameters used in Pandas Pivot
The pivot function in Pandas is highly flexible, thanks in part to its
various parameters. Understanding these parameters can help you
manipulate your data more effectively.
1. index: Setting Index Column(s)
The index parameter is used to specify the column whose unique values
will become the index of the newly shaped DataFrame. It’s important to
ensure that the column chosen for index doesn’t have duplicate values
for the combination of other columns, or else a ValueError will be
raised.
Example:
# Using 'Date' as the index
pivot_df = df.pivot(index='Date', columns='Product', values='Sales')
2. columns: Specifying Columns
The columns parameter specifies which column’s unique values will be
used as the columns in the reshaped DataFrame.
Example:
# Using 'Product' as the columns
pivot_df = df.pivot(index='Date', columns='Product', values='Sales')
3. values: Choosing Values
The values parameter specifies which column’s values will populate the
cells in the DataFrame. This is the data that you’re typically
interested in analyzing after reshaping.
Example:
# Using 'Sales' as the values
pivot_df = df.pivot(index='Date', columns='Product', values='Sales')
4. aggfunc: Using Aggregate Functions
It’s worth noting that the basic Pandas pivot function doesn’t have an
aggfunc parameter. If you need to perform aggregations on your data
while pivoting, you can use the pivot_table method, which has an
aggfunc parameter to allow this. The aggfunc parameter takes a
function or list of functions to perform on the reshaped data (e.g.,
'sum', 'mean', custom function).
Example:
# Using pivot_table with aggfunc
pivot_df = df.pivot_table(index='Date', columns='Product', values='Sales', aggfunc='sum')
Handling Missing Data in Pivot
When using Pandas Pivot, you may encounter situations where the pivoted
DataFrame has missing data. These are represented as NaN values.
Understanding how to deal with these missing values is crucial for
accurate data analysis.
1. Dealing with NaN Values
After performing a pivot operation, the newly formed DataFrame might
contain NaN values. These typically occur when the original DataFrame
has missing combinations for the specified index and columns.
Example:
import pandas as pd
# Create DataFrame with missing values
df = pd.DataFrame({
'Date': ['2021-01-01', '2021-01-01', '2021-01-02'],
'Product': ['A', 'B', 'A'],
'Sales': [100, 150, 200]
})
# Perform Pandas Pivot
pivot_df = df.pivot(index='Date', columns='Product', values='Sales')
print(pivot_df)
Output:
Product A B
Date
2021-01-01 100 150
2021-01-02 200 NaN
Here, the combination for Date='2021-01-02' and Product='B' is
missing, leading to a NaN value in the pivoted DataFrame.
2. Using fill_value Parameter
The Pandas pivot function itself does not have a fill_value
parameter. However, you can replace NaN values after the pivot
operation using the fillna method on the DataFrame.
Example:
# Fill NaN values with 0
filled_pivot_df = pivot_df.fillna(0)
print(filled_pivot_df)
Output:
Product A B
Date
2021-01-01 100 150
2021-01-02 200 0
Alternatively, if you need to fill missing values while pivoting, you
can use the pivot_table function, which does have a fill_value
parameter.
Example:
# Using pivot_table with fill_value
pivot_table_df = df.pivot_table(index='Date', columns='Product', values='Sales', fill_value=0)
print(pivot_table_df)
Output:
Product A B
Date
2021-01-01 100 150
2021-01-02 200 0
Multi-index Pivoting
Pandas Pivot also provides the capability for multi-index pivoting, allowing you to reshape your data in more complex ways. You can use multiple indexes or specify multiple columns and values to get a hierarchical structure in your pivoted DataFrame.
1. Using Multiple Indexes
If you want to use multiple indexes, the Pandas pivot_table function
is your go-to method. By supplying a list of column names to the index
parameter, you can create a multi-index DataFrame.
Example:
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
'Date': ['2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02'],
'City': ['New York', 'Chicago', 'New York', 'Chicago'],
'Product': ['A', 'A', 'B', 'B'],
'Sales': [100, 150, 200, 50]
})
# Perform Pandas Pivot with multiple indexes
multi_index_pivot = pd.pivot_table(df, index=['Date', 'City'], columns='Product', values='Sales')
print(multi_index_pivot)
Output:
Product A B
Date City
2021-01-01 Chicago 150 NaN
New York 100 NaN
2021-01-02 Chicago NaN 50
New York NaN 200
2. Using Multiple Columns and Values
You can also pivot with multiple columns and values by specifying lists
for the columns and values parameters.
Example:
# Pivot with multiple columns and values
multi_col_val_pivot = pd.pivot_table(df, index='Date', columns=['City', 'Product'], values='Sales')
print(multi_col_val_pivot)
Output:
City Chicago New York
Product A B A B
Date
2021-01-01 150.0 NaN 100.0 NaN
2021-01-02 NaN 50.0 NaN 200.0
Aggregate Functions in Pandas Pivot
When it comes to aggregation, the pivot function in Pandas may not
provide a direct way to use aggregate functions. However, you can take
advantage of the pivot_table function, which allows you to use both
built-in and custom aggregate functions.
1. Using Built-in Functions (mean, sum, etc.)
You can easily use built-in functions like mean, sum, max, and
others by passing them to the aggfunc parameter in pivot_table. This
is particularly useful when your DataFrame contains duplicate entries
that you’d like to aggregate.
Example:
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
'Date': ['2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02'],
'Product': ['A', 'A', 'B', 'B'],
'Sales': [100, 150, 200, 50]
})
# Perform Pandas Pivot with aggregate function sum
pivot_sum = pd.pivot_table(df, index='Date', columns='Product', values='Sales', aggfunc='sum')
print(pivot_sum)
Output:
Product A B
Date
2021-01-01 250 NaN
2021-01-02 NaN 250
2. Using Custom Functions
If the built-in functions are not sufficient for your needs, you can
create a custom function and pass it to the aggfunc parameter.
Example:
# Custom aggregate function to calculate the range (max - min)
def custom_range(series):
return series.max() - series.min()
# Perform Pandas Pivot with custom aggregate function
pivot_custom = pd.pivot_table(df, index='Date', columns='Product', values='Sales', aggfunc=custom_range)
print(pivot_custom)
Output:
Product A B
Date
2021-01-01 50 NaN
2021-01-02 NaN 150
Combining DataFrames Using Pivot
Pandas offers several ways to combine DataFrames, and you can take
advantage of these methods even after you have pivoted your data. Below
are examples that demonstrate how to combine pivoted DataFrames using
merge() and
<a href="https://www.golinuxcloud.com/pandas-concat-function/" data-type="post" data-id="26953">concat()</a>.
1. Pivot with merge()
If you have multiple DataFrames that you would like to combine based on
a common column, you can first pivot them and then use
<a href="https://www.golinuxcloud.com/pandas-merge-concat-append-join-dataframe/" data-type="link" data-id="https://www.golinuxcloud.com/pandas-merge-concat-append-join-dataframe/">merge()</a>.
Example:
import pandas as pd
# First DataFrame
df1 = pd.DataFrame({
'Date': ['2021-01-01', '2021-01-02'],
'Product': ['A', 'B'],
'Sales': [100, 150]
})
# Second DataFrame
df2 = pd.DataFrame({
'Date': ['2021-01-01', '2021-01-02'],
'Product': ['C', 'D'],
'Sales': [200, 50]
})
# Pivot the DataFrames
pivot_df1 = pd.pivot(df1, index='Date', columns='Product', values='Sales')
pivot_df2 = pd.pivot(df2, index='Date', columns='Product', values='Sales')
# Merge the pivoted DataFrames
merged_pivot = pd.merge(pivot_df1, pivot_df2, on='Date', how='outer')
print(merged_pivot)
Output:
Product A B C D
Date
2021-01-01 100.0 NaN 200.0 NaN
2021-01-02 150.0 NaN NaN 50.0
2. Pivot with concat()
If you would like to concatenate DataFrames along a particular axis, you
can use concat() to achieve this. This method is especially useful if
your DataFrames have the same columns or indices.
Example:
# Concatenate the pivoted DataFrames along columns
concat_pivot = pd.concat([pivot_df1, pivot_df2], axis=1)
print(concat_pivot)
Output:
Product A B C D
Date
2021-01-01 100.0 NaN 200.0 NaN
2021-01-02 150.0 NaN NaN 50.0
In this example, concat() combined the two pivoted DataFrames along
the columns. You can also concatenate along rows by setting axis=0.
Pivot vs. Pivot_Table in Pandas
Both pivot and pivot_table functions in Pandas are used for
reshaping or transforming data. However, they differ in functionality
and flexibility. Understanding their differences can help you choose the
right function for your specific needs.
Comparison of pivot() and pivot_table()
- Uniqueness of Entries:
pivotrequires that the data set has unique index/column combinations. If there are duplicate entries, it will throw a ValueError.pivot_table, on the other hand, is more flexible as it allows duplicate entries by aggregating them using an aggregate function (aggfunc). - Handling Missing Data:
pivot_tableallows you to replace missing values using thefill_valueparameter, butpivotdoes not have this option. - Aggregate Functions:
pivotdoes not support aggregate functions. If you need to perform an aggregation like summing or averaging duplicate entries, you would have to usepivot_tableand itsaggfuncparameter.
When to Use Which?
- Use Pandas
pivotwhen you have a DataFrame with unique index/column pairs and don’t need to perform any aggregation. - Use Pandas
pivot_tablewhen you have duplicate index/column pairs or when you want to perform aggregation or replace missing values.
Examples
Example 1: Using pivot()
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
'Date': ['2021-01-01', '2021-01-02'],
'Product': ['A', 'B'],
'Sales': [100, 200]
})
# Pivot DataFrame
pivot_df = df.pivot(index='Date', columns='Product', values='Sales')
print(pivot_df)
Output:
Product A B
Date
2021-01-01 100.0 NaN
2021-01-02 NaN 200.0
Example 2: Using pivot_table()
# Sample DataFrame with duplicate entries
df_duplicate = pd.DataFrame({
'Date': ['2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02'],
'Product': ['A', 'A', 'B', 'B'],
'Sales': [100, 50, 200, 50]
})
# Pivot DataFrame using pivot_table with aggregation
pivot_table_df = pd.pivot_table(df_duplicate, index='Date', columns='Product', values='Sales', aggfunc='sum')
print(pivot_table_df)
Output:
Product A B
Date
2021-01-01 150 NaN
2021-01-02 NaN 250
In summary, Pandas pivot is simpler and faster for basic reshaping
tasks, but if you need more advanced features like handling duplicate
entries and aggregation, then pivot_table is a better choice.
Some Examples using Pandas Pivot
In this section, we’ll tackle some of the most commonly asked questions related to Pandas pivot functionalities on online forums. These questions often come up from users of varying experience levels, so addressing them can provide valuable insights for both beginners and seasoned professionals.
Example 1: Basic Pivot to Reshape Data
This is the simplest form of pivoting and is often asked by beginners.
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
'Date': ['2021-01-01', '2021-01-02', '2021-01-03'],
'Product': ['A', 'B', 'C'],
'Sales': [100, 150, 200]
})
# Basic Pivot
pivot_df = df.pivot(index='Date', columns='Product', values='Sales')
print(pivot_df)
Here, we pivot the ‘Sales’ column around the ‘Date’ and ‘Product’. The resulting DataFrame has dates as rows and products as columns.
Example 2: Reshaping When You Have More Than One Column to Pivot
Frequently asked by users who want to reshape but have more than one column they wish to pivot.
# Create a new DataFrame
df['Profit'] = [10, 20, 30]
# Pivot to reshape for multiple columns
pivot_df = df.pivot(index='Date', columns='Product')
print(pivot_df)
Here, both the ‘Sales’ and ‘Profit’ columns are pivoted around the ‘Date’ and ‘Product’. The result will have a multi-level column index.
Example 3: ‘Index contains duplicate entries, cannot reshape’ Error
This is a common question when users encounter an error while using
Pandas pivot.
# Sample DataFrame with duplicate entries
df_duplicate = pd.DataFrame({
'Date': ['2021-01-01', '2021-01-01'],
'Product': ['A', 'A'],
'Sales': [100, 50]
})
try:
pivot_df = df_duplicate.pivot(index='Date', columns='Product', values='Sales')
except Exception as e:
print(f"An error occurred: {e}")
The error arises because pivot cannot handle duplicate entries. In
such cases, you should use pivot_table with appropriate aggregation.
Common Errors and How to Resolve Them
When working with Pandas pivot and pivot_table functions, you may
encounter some common issues. This section will discuss how to handle
them.
1. ‘Index contains duplicate entries, cannot reshape’
This error often appears when using the Pandas pivot function if your
data has duplicate index/column pairs.
Example:
import pandas as pd
# Sample DataFrame with duplicate entries
df_duplicate = pd.DataFrame({
'Date': ['2021-01-01', '2021-01-01'],
'Product': ['A', 'A'],
'Sales': [100, 50]
})
try:
pivot_df = df_duplicate.pivot(index='Date', columns='Product', values='Sales')
except Exception as e:
print(f"An error occurred: {e}")
Output:
An error occurred: Index contains duplicate entries, cannot reshape
Solution:
Use the pivot_table function with the aggfunc parameter to aggregate
the duplicate entries.
# Using pivot_table to handle duplicates
pivot_table_df = pd.pivot_table(df_duplicate, index='Date', columns='Product', values='Sales', aggfunc='sum')
print(pivot_table_df)
Output:
Product A
Date
2021-01-01 150
2. Memory Errors
Memory errors can occur when you are working with a very large
DataFrame, and you try to pivot it. This is especially true for
pivot_table when aggregation happens, which could create even larger
DataFrames.
Solution:
- Filter your DataFrame to only include the essential columns before pivoting.
- Use inbuilt functions like
chunksizeto read only portions of large files into memory at one time if you are reading from a large file.
Example of Filtering:
# Assume df_large is a large DataFrame
filtered_df = df_large[['Date', 'Product', 'Sales']]
pivot_table_df = pd.pivot_table(filtered_df, index='Date', columns='Product', values='Sales', aggfunc='sum')
Tips and Best Practices
The Pandas pivot function is a versatile tool for reshaping your data.
However, it’s crucial to consider some best practices to make your data
manipulation tasks more efficient and error-free.
1. Performance Considerations
When dealing with large DataFrames, using Pandas pivot can be
computationally intensive and slow. For
better performance, consider the
following tips:
Pre-filter the DataFrame: Filter out unnecessary rows or columns
from the DataFrame before applying the pivot operation to improve
speed.
# Filter the DataFrame to only include relevant data
filtered_df = df[df['Date'] > '2021-01-01']
# Perform pivot operation
pivot_df = filtered_df.pivot(index='Date', columns='Product', values='Sales')
In this example, we first filter the DataFrame to only include rows
where the date is greater than ‘2021-01-01’. Then, we use Pandas
pivot. This will be faster than pivoting the entire DataFrame.
2. Memory Usage
pivot can significantly increase memory usage if your DataFrame has a
large number of unique index or column values. Be cautious when working
with big data sets.
Check memory usage before and after pivot: Use the memory_usage
method to check the DataFrame’s memory consumption before and after the
pivot operation.
# Check memory usage before pivot
print("Memory usage before pivot:", df.memory_usage().sum())
# Perform pivot operation
pivot_df = df.pivot(index='Date', columns='Product', values='Sales')
# Check memory usage after pivot
print("Memory usage after pivot:", pivot_df.memory_usage().sum())
By checking the memory usage before and after the pivot operation, you can have a better understanding of how the operation affects your system’s memory.
Frequently Asked Questions
What is the Difference Between Pandas pivot and pivot_table?
pivot is used for reshaping data and cannot handle duplicate values
for the specified columns. pivot_table can perform aggregations when
you have duplicate entries.
How Do I Handle Duplicate Entries in pivot?
If your DataFrame contains duplicate entries for your specified columns,
you’ll get an error. In this case, you can use pivot_table and specify
an aggregate function like mean or sum to handle duplicates.
Can I Pivot on Multiple Columns?
Yes, you can use multiple columns in the index, columns, or values
parameters using lists. However, this will create multi-index
DataFrames.
What are the Memory Requirements for Using pivot?
The memory requirements can shoot up if there are many unique values.
Always check memory usage using the memory_usage() method before and
after using pivot.
Can I Use Pandas pivot for Time Series Data?
Absolutely. You can pivot time series data to reshape it, but ensure that the date or time column is set as the index for effective pivoting.
How Do I Fill Missing Values in a Pivoted DataFrame?
After using pivot, you can use the fillna() method on the resulting
DataFrame to fill in missing values.
Is It Possible to Use Pandas pivot with String Data?
Yes, pivot works with any data type, but it might not be meaningful to
pivot string data unless you’re organizing it in a particular way.
How Do I Flatten a Multi-index DataFrame After Using pivot?
Use the reset_index() method and then use appropriate DataFrame
methods to rename your columns if needed.
Can I Use a Custom Aggregate Function with pivot?
No, for custom aggregate functions, you would need to use pivot_table,
as pivot doesn’t offer aggregation capabilities.
How Do I Reverse a pivot Operation?
You can use the melt function to reverse a pivot operation and go
from a wide-form DataFrame back to a long-form DataFrame.
Summary
This article has provided a comprehensive guide to using the pivot
function in pandas for Python. We’ve covered everything from basic
syntax and parameters to handling missing data and performance
considerations.
Key Takeaways
Pandaspivotis used for reshaping or reorganizing data in pandas.- Parameters like
index,columns, andvaluescontrol the structure of the pivoted DataFrame. - It’s crucial to be cautious of memory usage and performance, especially when working with large DataFrames.
- For more complex operations and to handle duplicate values,
pivot_tablecan be used as an alternative.
Additional Resources
For those who want to dive deeper into pivoting in pandas, here are some valuable resources:

![Pandas Pivot Simplified [In-Depth Tutorial]](/pandas-pivot/python-pandas-pivot.jpg)
