30 July 2017

Count Number of Customers at Certain Time Points with Merge_asof

This notebook is available on GitHub.

This will probably look like a trivial example but message I would like to deliver is merge_asof is an awesome tool to merge time series datasets.

It's fairly new, recently added from probably 0.19.0.

First off, let's create some dummy data.

In [1]:
import pandas as pd
In [2]:
signup = pd.DataFrame(pd.date_range('2016-01-01', '2017-01-01', freq='1m'), columns=['signup-date'])
In [3]:
signup
Out[3]:
signup-date
0 2016-01-31
1 2016-02-29
2 2016-03-31
3 2016-04-30
4 2016-05-31
5 2016-06-30
6 2016-07-31
7 2016-08-31
8 2016-09-30
9 2016-10-31
10 2016-11-30
11 2016-12-31

Now your task is to find how many customers you had at the time points as the following.

In [4]:
check_date = pd.DataFrame(
    [pd.datetime(2016, 4, 17), pd.datetime(2016, 5, 15), pd.datetime(2016, 6, 10)],
    columns=['check-date']
)
check_date
Out[4]:
check-date
0 2016-04-17
1 2016-05-15
2 2016-06-10

With merge_asof, the joining keys don't have to be equal. By definition, it's a left join. With the default set-up, each row joins to a row in the right dataframe that has the biggest value for the joining key but no greater than the value in the left dataframe.

Hope you can wrap your head around what I just said. Fortunately it's easy to explain it with an example.

Let's merge check_date and signup and focus on the first row having date '2016-04-17'.

When merging happens, it looks up all the dates in signup. Eventually, it finds '2016-03-31'. This is because '2016-03-31' is the last date in signup earlier (smaller) than '2016-04-17'.

Have a look at the mergin results.

In [5]:
pd.merge_asof(check_date, signup, left_on='check-date', right_on='signup-date')
Out[5]:
check-date signup-date
0 2016-04-17 2016-03-31
1 2016-05-15 2016-04-30
2 2016-06-10 2016-05-31

With this in mind, we only need one more column showing the number of customers after the customer in that row signed up.

In [6]:
signup.loc[:, 'count'] = list(range(1, len(signup)+1))
signup
Out[6]:
signup-date count
0 2016-01-31 1
1 2016-02-29 2
2 2016-03-31 3
3 2016-04-30 4
4 2016-05-31 5
5 2016-06-30 6
6 2016-07-31 7
7 2016-08-31 8
8 2016-09-30 9
9 2016-10-31 10
10 2016-11-30 11
11 2016-12-31 12

Then merge_asof will give us the count at the date requested.

In [7]:
pd.merge_asof(check_date, signup, left_on='check-date', right_on='signup-date')
Out[7]:
check-date signup-date count
0 2016-04-17 2016-03-31 3
1 2016-05-15 2016-04-30 4
2 2016-06-10 2016-05-31 5

There are still a lot more options with which you can tune the behaviour of merge_asof to fit your merging goal. For example, it is possible to join one row in the left dataframe to multiple rows in the right dataframe for time series where you can define a tolerant range, like "with 10 days".

In [8]:
pd.merge_asof(check_date, signup, left_on='check-date', right_on='signup-date', tolerance=pd.Timedelta('10days'))
Out[8]:
check-date signup-date count
0 2016-04-17 NaT NaN
1 2016-05-15 NaT NaN
2 2016-06-10 2016-05-31 5.0

Can you see that the first two rows couldn't find any within 10 days backwards?

Now it's your time to explore the usefulness of this tool!