24 February 2018

How to Drop only Local Duplicates

drop_duplicates_while_keeping_order

This notebook is available on GitHub.

Context

The problem encountered was more or less a customer journey. A customer may first do A, and then B, then B again, then C three times, then do B again twice. A series like ABBCCCBB. The goal is to remove the duplicates found in the neighbouring events. If there is another element between two same elements, the two elements are not duplicates. In our example above, we want the final result as ABCB. We try to achieve this with drop_duplicates method of Pandas data frame. In this blog, I would like to share the frustration and lesson I learned from solving this problem.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np

Problem

Unfortunately, there isn't a built-in function in Pandas that can do that. Let's first see what we get with the built-in function.

In [2]:
dat = pd.DataFrame({'event': list('ABBCCCBB')})
dat
Out[2]:
event
0 A
1 B
2 B
3 C
4 C
5 C
6 B
7 B
In [3]:
dat.drop_duplicates()
Out[3]:
event
0 A
1 B
3 C

Solution in a Simple Case

The method clearly counts there are only three unique elements. However, it drops the trailing B's we actually want to keep. One vital observation leading to the solution is if a cell differs from the cell above, then it should be kept. So let's shift the data frame downwards so we can see a cell and the cell above in the row.

In [4]:
dat.loc[:, 'event_shifted'] = dat.event.shift()
dat.loc[:, 'is_different'] = dat.event != dat.event_shifted
dat
Out[4]:
event event_shifted is_different
0 A NaN True
1 B A True
2 B B False
3 C B True
4 C C False
5 C C False
6 B C True
7 B B False

Therefore, if I keep only the rows that is_different is true, problem would be solved.

In [5]:
dat.loc[dat.is_different, ['event']]
Out[5]:
event
0 A
1 B
3 C
6 B

Retrospectively, this totally makes sense as the range of our defination for "duplicates" is limited to the row above rather than the whole column as is assumed in drop_duplicates.

A Little More Complex Case

While this simple solution works magically, what about the situation of multiple columns. For example, instead of one we now have to customers. Running the same solution occasionaly will err. Like the example below.

In [6]:
dat = pd.DataFrame({'event': list('ABBCCCBBBBCCCBB'), 'customer_id': [1]*8 + [2]*7})
dat
Out[6]:
customer_id event
0 1 A
1 1 B
2 1 B
3 1 C
4 1 C
5 1 C
6 1 B
7 1 B
8 2 B
9 2 B
10 2 C
11 2 C
12 2 C
13 2 B
14 2 B
In [7]:
dat.loc[:, 'event_shifted'] = dat.event.shift()
dat.loc[:, 'is_different'] = dat.event != dat.event_shifted
dat.loc[dat.is_different, ['customer_id', 'event']]
Out[7]:
customer_id event
0 1 A
1 1 B
3 1 C
6 1 B
10 2 C
13 2 B

The first event of customer 2 was removed as is the same as the last event of customer 1. Therefore customer id should also be compared.

In [8]:
dat = pd.DataFrame({'event': list('ABBCCCBBBBCCCBB'), 'customer_id': [1]*8 + [2]*7})
shifted = dat.shift()
is_different = (dat.customer_id != shifted.customer_id) | (dat.event != shifted.event)
dat.loc[is_different]
Out[8]:
customer_id event
0 1 A
1 1 B
3 1 C
6 1 B
8 2 B
10 2 C
13 2 B

Now we got the correct final data set.

What Learned

We find a simple solution to drop duplicates only across neighbouring rows. We fully implement Pandas built-in methods or functions. No iteration through the rows, which means fast speed.