Pivot Table with Python
Abstract
PivotTables facilitate the exploration of data to uncover insights that would have otherwise remained hidden without the construction thereof or the implementation of some arduous mathematical processes. In this notebook, we will contruct a PivotTable using the Pandas library for the Python programming language. The data we will utilize to perform the data analysis will be a sample of fictitious sales transactions that occurred on December 23rd, 2021.
Python Packages
The packages we will use for the anaylisys of the sales transaction dataset will be NumPy and Pandas. Begin by importing the packages:
import numpy as np
import pandas as pd
Data
The sales transaction dataset can be downloaded at the following link: https://www.kaggle.com/kevinbarreiro/sales-transactions. Once downloaded, we need to provide an appropriate file path to the CSV file to import the dataset. Upon the CSV file being located and read by a function within the Pandas package we imported in the previous step, a DataFrame will be constructed from the CSV file and referenced by a variable we will call stdf
, which means sales transaction dataframe.
stdf = pd.read_csv('sales-transaction.csv')print(stdf)
Data Cleaning
Notice that the data in the Amount column of the stdf
dataframe is of the str
data type because of the dollar symbol attached to each amount. We need to remove the dollar symbol from each amount and cast the column to the float
data type. No other columns will be modified for the sake of this notebook.
stdf['Amount'] = stdf['Amount'].str\
.replace(r'$', '', regex=True)\
.astype(float)print(stdf)
Pivot Table
Given that the data has been imported and cleaned up, we will construct two PivotTables from the stdf
dataframe containing sales transaction data for December 23rd, 2021.
Revenue by Region
table = pd.pivot_table(
stdf,
index=['Region'],
margins=True,
margins_name='Grand Total',
aggfunc=np.sum)print(table.filter(['Amount']))
The Revenue by Region PivotTable displays revenue accumulated from sales transactions on December 23rd, 2021 for each region, with the last row displaying the total revenue collected on that date from sales transactions.
Product Revenue by Region
table = pd.pivot_table(
stdf,
index=['Region'],
columns=['Product'],
margins=True,
margins_name='Grand Total',
aggfunc=np.sum)print(table['Amount'])
The Product Revenue by Region PivotTable displays the revenue accumulated from sales transactions on December 23rd, 2021 for each product, by region. The Grand Total column displays the revenue accumulated for each region for all products in the region, while the Grand Total row displays the total revenue accumulated for each product in all regions. The value in the last row and last column displays the total revenue accumulated in all regions and products.
Region Revenue by Source and Time of Day
table = pd.pivot_table(
stdf,
index=['Source of Payment', 'Occurred At'],
columns=['Region'],
margins=True,
margins_name='Grand Total',
aggfunc=np.sum)print(table['Amount'])
The Region Revenue by Source and Ocurred At PivotTable displays the revenue accumulated from sales transactions on December 23rd, 2021 organized by the source of the payment, namely either through Email, In-Store, or the Web, then by the actual time the transaction took place in each source. The PivotTable then segments the sales transactions further into its corresponding region. The Grand Total column and row is the same concept as the previous PivotTable we constructed with the value in the last row and the last column displaying the total revenue collected from all sales transactions that occurred.
Conclusion
Thanks for reading; I hope it was useful.
Best regards,
Kevin Barreiro [themastersdev]