This Jupyter Notebook is available on GitHub.

Context¶

Today I show to how to generate a report for cross-sell performance. The question is how many customers bought product a and b, which is often presented on board meetings for summarising previous retail performance and giving hint for making decision of product bundles.

To demonstrate the task I make random transaction dataset.

In [1]:

import numpy as np
import pandas as pd
from itertools import combinations

In [2]:

dat = pd.DataFrame(
    {
        'customer_id': np.random.randint(0, 3, 30),
        'product_id': np.random.choice(['a', 'b', 'c'], 30)    
    }
)
dat.head()

Out[2]:

	customer_id	product_id
0	0	c
1	2	b
2	2	a
3	2	c
4	0	c

Target¶

The result table will look something like this.

In [3]:

fake_array = np.reshape(np.random.randint(1, 5, 9), (3, 3))
# to make it symmetric
sym_data = fake_array + fake_array.T
# zero elements on the diagnal
sym_data[list(range(3)), list(range(3))] = 0
fake_res = pd.DataFrame(sym_data, columns=['a', 'b', 'c'], index=['a', 'b', 'c'])
fake_res

Out[3]:

	a	b	c
a	0	4	5
b	4	0	3
c	5	3	0

Note: the result has to be symmetric as "Number of customers who bought a and b is absolutely equal to number of customers who bought b and a."

Unique Products Bought¶

This looks like a simple pivot table. Not so fast! Actually much more complicated than that. For each customer_id all unique product pairs have to be created separatedly.

First we need to know what unique products each customers have bought.

In [4]:

unique_prods = dat.groupby('customer_id')['product_id'].unique()
unique_prods

Out[4]:

customer_id
0    [c, b, a]
1       [a, b]
2    [b, a, c]
Name: product_id, dtype: object

Product Pairs¶

In this Series, the index encompasses the customer id's from the raw dataset. Then we need to create the product pairs for each customer. To do this, we apply a lambda function to each element of the Series.

In [5]:

unique_prod_pairs = unique_prods.apply(lambda prods: list(combinations(prods, 2)))
unique_prod_pairs

Out[5]:

customer_id
0    [(c, b), (c, a), (b, a)]
1                    [(a, b)]
2    [(b, a), (b, c), (a, c)]
Name: product_id, dtype: object

The final result is just around the corner. We next count the occurance of each pairs. To facilitate that, we make a wholesale list that contains all occurances. Luckily the sum definition for Python list does just for that.

For example,

In [6]:

[1, 2] + [3, 4]

Out[6]:

[1, 2, 3, 4]

So we call the sum method.

In [7]:

all_pairs = unique_prod_pairs.sum()
all_pairs

Out[7]:

[('c', 'b'),
 ('c', 'a'),
 ('b', 'a'),
 ('a', 'b'),
 ('b', 'a'),
 ('b', 'c'),
 ('a', 'c')]

To take advantage of pandas pivot_table function, we make a dataframe, in which the combination of two columns is the product pairs.

In [8]:

unique_prod_pairs_split = pd.DataFrame(all_pairs)
unique_prod_pairs_split

Out[8]:

	0	1
0	c	b
1	c	a
2	b	a
3	a	b
4	b	a
5	b	c
6	a	c

Make Pivot Table¶

Finnally this problem has been converted into a privotting task.

In [9]:

raw_pivot_table = pd.pivot_table(unique_prod_pairs_split, index=0, columns=1, aggfunc=len)
raw_pivot_table

Out[9]:

1	a	b	c
0
a	NaN	1.0	1.0
b	2.0	NaN	1.0
c	1.0	1.0	NaN

Hmm, doesn't really look like what we want.

First, it isn't symmetric.

Second, there isn't actually guarantee that for example "a" is in both index and columns.

Well we have already managed the final result but just need a little tweak.

We can impose the index and columns must have all the products. First let's get all the unique products.

In [10]:

unique_prods_list = sorted(dat.product_id.unique().tolist())
unique_prods_list

Out[10]:

['a', 'b', 'c']

This imposition is realised by the reindex method.

In [11]:

pivot_table = raw_pivot_table.reindex(index=unique_prods_list, columns=unique_prods_list)
pivot_table

Out[11]:

1	a	b	c
0
a	NaN	1.0	1.0
b	2.0	NaN	1.0
c	1.0	1.0	NaN

Fill the NA values with zeros.

In [12]:

pivot_table = pivot_table.fillna(0)
pivot_table

Out[12]:

1	a	b	c
0
a	0.0	1.0	1.0
b	2.0	0.0	1.0
c	1.0	1.0	0.0

Done¶

Only thing left is to make it symmetric.

In [13]:

pivot_table = pivot_table + pivot_table.T
pivot_table

Out[13]:

1	a	b	c
0
a	0.0	3.0	2.0
b	3.0	0.0	2.0
c	2.0	2.0	0.0

Awesome, go ahead and check which two products are better to be bundled!

Dr Fei's Cave

05 March 2017

How I Get Cross-sell