This notebook is available on GitHub.
This will probably look like a trivial example but message I would like to deliver is merge_asof
is an awesome tool to merge time series datasets.
It's fairly new, recently added from probably 0.19.0.
First off, let's create some dummy data.
import pandas as pd
signup = pd.DataFrame(pd.date_range('2016-01-01', '2017-01-01', freq='1m'), columns=['signup-date'])
signup
Now your task is to find how many customers you had at the time points as the following.
check_date = pd.DataFrame(
[pd.datetime(2016, 4, 17), pd.datetime(2016, 5, 15), pd.datetime(2016, 6, 10)],
columns=['check-date']
)
check_date
With merge_asof
, the joining keys don't have to be equal. By definition, it's a left join. With the default set-up, each row joins to a row in the right dataframe that has the biggest value for the joining key but no greater than the value in the left dataframe.
Hope you can wrap your head around what I just said. Fortunately it's easy to explain it with an example.
Let's merge check_date
and signup
and focus on the first row having date '2016-04-17'.
When merging happens, it looks up all the dates in signup
. Eventually, it finds '2016-03-31'. This is because '2016-03-31' is the last date in signup
earlier (smaller) than '2016-04-17'.
Have a look at the mergin results.
pd.merge_asof(check_date, signup, left_on='check-date', right_on='signup-date')
With this in mind, we only need one more column showing the number of customers after the customer in that row signed up.
signup.loc[:, 'count'] = list(range(1, len(signup)+1))
signup
Then merge_asof
will give us the count at the date requested.
pd.merge_asof(check_date, signup, left_on='check-date', right_on='signup-date')
There are still a lot more options with which you can tune the behaviour of merge_asof
to fit your merging goal. For example, it is possible to join one row in the left dataframe to multiple rows in the right dataframe for time series where you can define a tolerant range, like "with 10 days".
pd.merge_asof(check_date, signup, left_on='check-date', right_on='signup-date', tolerance=pd.Timedelta('10days'))
Can you see that the first two rows couldn't find any within 10 days backwards?
Now it's your time to explore the usefulness of this tool!
No comments:
Post a Comment