This Jupyter Notebook is available on GitHub.
Context¶
Today I show to how to generate a report for cross-sell performance. The question is how many customers bought product a and b, which is often presented on board meetings for summarising previous retail performance and giving hint for making decision of product bundles.
To demonstrate the task I make random transaction dataset.
import numpy as np
import pandas as pd
from itertools import combinations
dat = pd.DataFrame(
{
'customer_id': np.random.randint(0, 3, 30),
'product_id': np.random.choice(['a', 'b', 'c'], 30)
}
)
dat.head()
Target¶
The result table will look something like this.
fake_array = np.reshape(np.random.randint(1, 5, 9), (3, 3))
# to make it symmetric
sym_data = fake_array + fake_array.T
# zero elements on the diagnal
sym_data[list(range(3)), list(range(3))] = 0
fake_res = pd.DataFrame(sym_data, columns=['a', 'b', 'c'], index=['a', 'b', 'c'])
fake_res
Note: the result has to be symmetric as "Number of customers who bought a and b is absolutely equal to number of customers who bought b and a."
Unique Products Bought¶
This looks like a simple pivot table. Not so fast! Actually much more complicated than that. For each customer_id
all unique product pairs have to be created separatedly.
First we need to know what unique products each customers have bought.
unique_prods = dat.groupby('customer_id')['product_id'].unique()
unique_prods
Product Pairs¶
In this Series, the index encompasses the customer id's from the raw dataset. Then we need to create the product pairs for each customer. To do this, we apply a lambda function to each element of the Series.
unique_prod_pairs = unique_prods.apply(lambda prods: list(combinations(prods, 2)))
unique_prod_pairs
The final result is just around the corner. We next count the occurance of each pairs. To facilitate that, we make a wholesale list that contains all occurances. Luckily the sum
definition for Python list does just for that.
For example,
[1, 2] + [3, 4]
So we call the sum
method.
all_pairs = unique_prod_pairs.sum()
all_pairs
To take advantage of pandas pivot_table
function, we make a dataframe, in which the combination of two columns is the product pairs.
unique_prod_pairs_split = pd.DataFrame(all_pairs)
unique_prod_pairs_split
Make Pivot Table¶
Finnally this problem has been converted into a privotting task.
raw_pivot_table = pd.pivot_table(unique_prod_pairs_split, index=0, columns=1, aggfunc=len)
raw_pivot_table
Hmm, doesn't really look like what we want.
First, it isn't symmetric.
Second, there isn't actually guarantee that for example "a" is in both index and columns.
Well we have already managed the final result but just need a little tweak.
We can impose the index and columns must have all the products. First let's get all the unique products.
unique_prods_list = sorted(dat.product_id.unique().tolist())
unique_prods_list
This imposition is realised by the reindex
method.
pivot_table = raw_pivot_table.reindex(index=unique_prods_list, columns=unique_prods_list)
pivot_table
Fill the NA values with zeros.
pivot_table = pivot_table.fillna(0)
pivot_table
Done¶
Only thing left is to make it symmetric.
pivot_table = pivot_table + pivot_table.T
pivot_table
Awesome, go ahead and check which two products are better to be bundled!
No comments:
Post a Comment