18 June 2017

How to Find the First Day in a Streak with shift Method

Context

In certain types of businesses, customer visits can be fairly frequent, on daily or weekly basis. Keeping customers maintain their streak can generate constant revenue, which is obviously sweet to business. Marketing professinals would find out what drives customers start their streak. To this end, we may try to figure out what happened on their first day/week. This blog shows you how to locate the first day.

This notebook is available on GitHub.

Prepare Dummy Data

Here we create dummy data for demonstration. Assume a customer visited on the following dates saved in a Pandas series. The current streak started on Feb 1, 2000.

In [1]:
import pandas as pd
dates1 = pd.Series(pd.date_range('2000-01-20', '2000-01-30', freq='2D'))
dates2 = pd.Series(pd.date_range('2000-02-01', '2000-02-10', freq='D'))
dates = pd.concat([dates1, dates2]).reset_index(drop=True)
dates
Out[1]:
0    2000-01-20
1    2000-01-22
2    2000-01-24
3    2000-01-26
4    2000-01-28
5    2000-01-30
6    2000-02-01
7    2000-02-02
8    2000-02-03
9    2000-02-04
10   2000-02-05
11   2000-02-06
12   2000-02-07
13   2000-02-08
14   2000-02-09
15   2000-02-10
dtype: datetime64[ns]

Introduce the Tool

To get that date, we introduce the shift method. Have a read here if you are interested in more details. This method can move the data entity in named directions with named steps. For example, To move the series downwards a row:

In [2]:
dates.shift()
Out[2]:
0           NaT
1    2000-01-20
2    2000-01-22
3    2000-01-24
4    2000-01-26
5    2000-01-28
6    2000-01-30
7    2000-02-01
8    2000-02-02
9    2000-02-03
10   2000-02-04
11   2000-02-05
12   2000-02-06
13   2000-02-07
14   2000-02-08
15   2000-02-09
dtype: datetime64[ns]

Can you see that all the rows are shifted downwards a row? The new first row is filled with a missing value and the old last row disappeared. What if you want it move upwards? For that you need to give it a negative step:

In [3]:
dates.shift(-1)
Out[3]:
0    2000-01-22
1    2000-01-24
2    2000-01-26
3    2000-01-28
4    2000-01-30
5    2000-02-01
6    2000-02-02
7    2000-02-03
8    2000-02-04
9    2000-02-05
10   2000-02-06
11   2000-02-07
12   2000-02-08
13   2000-02-09
14   2000-02-10
15          NaT
dtype: datetime64[ns]

Operation

By now, you probably have already the idea in your mind of what I will do next. After shift, the dates in streak differ in one day, aka, time basis, in the same row. To make this clear, we put the serieses before and after the shifting in one single dataframe. Focus on Row 6.

In [4]:
dates_df = pd.DataFrame(
    {'before': dates, 'after_downwards': dates.shift(), 'after_upwards': dates.shift(-1)},
    columns=['before', 'after_downwards', 'after_upwards']
)
dates_df
Out[4]:
before after_downwards after_upwards
0 2000-01-20 NaT 2000-01-22
1 2000-01-22 2000-01-20 2000-01-24
2 2000-01-24 2000-01-22 2000-01-26
3 2000-01-26 2000-01-24 2000-01-28
4 2000-01-28 2000-01-26 2000-01-30
5 2000-01-30 2000-01-28 2000-02-01
6 2000-02-01 2000-01-30 2000-02-02
7 2000-02-02 2000-02-01 2000-02-03
8 2000-02-03 2000-02-02 2000-02-04
9 2000-02-04 2000-02-03 2000-02-05
10 2000-02-05 2000-02-04 2000-02-06
11 2000-02-06 2000-02-05 2000-02-07
12 2000-02-07 2000-02-06 2000-02-08
13 2000-02-08 2000-02-07 2000-02-09
14 2000-02-09 2000-02-08 2000-02-10
15 2000-02-10 2000-02-09 NaT

Therefore querying the day difference can return the streak area or the non streak area.

Streak area:

In [5]:
dates.loc[(dates.shift(-1) - dates).dt.days == 1]
Out[5]:
6    2000-02-01
7    2000-02-02
8    2000-02-03
9    2000-02-04
10   2000-02-05
11   2000-02-06
12   2000-02-07
13   2000-02-08
14   2000-02-09
dtype: datetime64[ns]

Non-streak area:

In [6]:
dates.loc[(dates - dates.shift()).dt.days != 1]
Out[6]:
0   2000-01-20
1   2000-01-22
2   2000-01-24
3   2000-01-26
4   2000-01-28
5   2000-01-30
6   2000-02-01
dtype: datetime64[ns]

The intersection is of course the first day!

In [7]:
dates.loc[((dates.shift(-1) - dates).dt.days == 1) & ((dates - dates.shift()).dt.days != 1)]
Out[7]:
6   2000-02-01
dtype: datetime64[ns]

Conclusion

This blog shows you how to query the first day of a treak, which is useful for marketing experts to further investigate the drivers for customer's consecutive visits and therefore be able to make the success story bigger.

Note this method can be extended to other time basis like week, month, etc. This method is fast as it uses Pandas idioms instead of for loops.

There is only one streak in the example shown above, however, this method is robust to examples having mutiple streaks.

Hope this trick helps your data mining!