05 March 2017

How I Get Cross-sell

This Jupyter Notebook is available on GitHub.

Context

Today I show to how to generate a report for cross-sell performance. The question is how many customers bought product a and b, which is often presented on board meetings for summarising previous retail performance and giving hint for making decision of product bundles.

To demonstrate the task I make random transaction dataset.

In [1]:
import numpy as np
import pandas as pd
from itertools import combinations
In [2]:
dat = pd.DataFrame(
    {
        'customer_id': np.random.randint(0, 3, 30),
        'product_id': np.random.choice(['a', 'b', 'c'], 30)    
    }
)
dat.head()
Out[2]:
customer_id product_id
0 0 c
1 2 b
2 2 a
3 2 c
4 0 c

Target

The result table will look something like this.

In [3]:
fake_array = np.reshape(np.random.randint(1, 5, 9), (3, 3))
# to make it symmetric
sym_data = fake_array + fake_array.T
# zero elements on the diagnal
sym_data[list(range(3)), list(range(3))] = 0
fake_res = pd.DataFrame(sym_data, columns=['a', 'b', 'c'], index=['a', 'b', 'c'])
fake_res
Out[3]:
a b c
a 0 4 5
b 4 0 3
c 5 3 0

Note: the result has to be symmetric as "Number of customers who bought a and b is absolutely equal to number of customers who bought b and a."

Unique Products Bought

This looks like a simple pivot table. Not so fast! Actually much more complicated than that. For each customer_id all unique product pairs have to be created separatedly.

First we need to know what unique products each customers have bought.

In [4]:
unique_prods = dat.groupby('customer_id')['product_id'].unique()
unique_prods
Out[4]:
customer_id
0    [c, b, a]
1       [a, b]
2    [b, a, c]
Name: product_id, dtype: object

Product Pairs

In this Series, the index encompasses the customer id's from the raw dataset. Then we need to create the product pairs for each customer. To do this, we apply a lambda function to each element of the Series.

In [5]:
unique_prod_pairs = unique_prods.apply(lambda prods: list(combinations(prods, 2)))
unique_prod_pairs
Out[5]:
customer_id
0    [(c, b), (c, a), (b, a)]
1                    [(a, b)]
2    [(b, a), (b, c), (a, c)]
Name: product_id, dtype: object

The final result is just around the corner. We next count the occurance of each pairs. To facilitate that, we make a wholesale list that contains all occurances. Luckily the sum definition for Python list does just for that.

For example,

In [6]:
[1, 2] + [3, 4]
Out[6]:
[1, 2, 3, 4]

So we call the sum method.

In [7]:
all_pairs = unique_prod_pairs.sum()
all_pairs
Out[7]:
[('c', 'b'),
 ('c', 'a'),
 ('b', 'a'),
 ('a', 'b'),
 ('b', 'a'),
 ('b', 'c'),
 ('a', 'c')]

To take advantage of pandas pivot_table function, we make a dataframe, in which the combination of two columns is the product pairs.

In [8]:
unique_prod_pairs_split = pd.DataFrame(all_pairs)
unique_prod_pairs_split
Out[8]:
0 1
0 c b
1 c a
2 b a
3 a b
4 b a
5 b c
6 a c

Make Pivot Table

Finnally this problem has been converted into a privotting task.

In [9]:
raw_pivot_table = pd.pivot_table(unique_prod_pairs_split, index=0, columns=1, aggfunc=len)
raw_pivot_table
Out[9]:
1 a b c
0
a NaN 1.0 1.0
b 2.0 NaN 1.0
c 1.0 1.0 NaN

Hmm, doesn't really look like what we want.

First, it isn't symmetric.

Second, there isn't actually guarantee that for example "a" is in both index and columns.

Well we have already managed the final result but just need a little tweak.

We can impose the index and columns must have all the products. First let's get all the unique products.

In [10]:
unique_prods_list = sorted(dat.product_id.unique().tolist())
unique_prods_list
Out[10]:
['a', 'b', 'c']

This imposition is realised by the reindex method.

In [11]:
pivot_table = raw_pivot_table.reindex(index=unique_prods_list, columns=unique_prods_list)
pivot_table
Out[11]:
1 a b c
0
a NaN 1.0 1.0
b 2.0 NaN 1.0
c 1.0 1.0 NaN

Fill the NA values with zeros.

In [12]:
pivot_table = pivot_table.fillna(0)
pivot_table
Out[12]:
1 a b c
0
a 0.0 1.0 1.0
b 2.0 0.0 1.0
c 1.0 1.0 0.0

Done

Only thing left is to make it symmetric.

In [13]:
pivot_table = pivot_table + pivot_table.T
pivot_table
Out[13]:
1 a b c
0
a 0.0 3.0 2.0
b 3.0 0.0 2.0
c 2.0 2.0 0.0

Awesome, go ahead and check which two products are better to be bundled!

No comments:

Post a Comment