17 February 2017

Use Pandas apply to be a scheduler

Context

At work, I helped a coworker with his task where he needed to build a touchpoint date scheduler for a client.

Suppose the client has two different regular service plans for its customers Service A and Service B. Service A is supposed to be operated every two months while Service B is to be done every three months. We are provided the last touchpoint date by the client. It looks something like the following.

In [1]:
import pandas as pd
In [2]:
touchpoints = pd.DataFrame(
    {
        'customerId': [1, 2, 3],
        'Service A': [pd.datetime(2017, 1, 20), pd.datetime(2017, 1, 28), pd.datetime(2017, 2, 1)],
        'Service B': [pd.datetime(2017, 1, 10), pd.datetime(2017, 1, 1), pd.datetime(2016, 12, 1)]
    },
    columns=['customerId', 'Service A', 'Service B']
)
In [3]:
touchpoints
Out[3]:
customerId Service A Service B
0 1 2017-01-20 2017-01-10
1 2 2017-01-28 2017-01-01
2 3 2017-02-01 2016-12-01

The task is to find the next two services coming up, which heavily depends on the last date when the two services were done. For customer 1, the next service will be Service A on 21.3.2017 followed by Service B on 11.4.2017. In contrast, for customer 3, the next sevice will be Service B on 2.3.2017 followed by Service A on 2.4.2017 one month later.

Pull the up-coming events together for the next half year since the last service the table above would look like the following.

In [4]:
touchpoints.loc[:, 'Service A in two months'] = (touchpoints['Service A'] + pd.Timedelta(2, unit='M')).dt.date
touchpoints.loc[:, 'Service A in four months'] = (touchpoints['Service A'] + pd.Timedelta(4, unit='M')).dt.date
touchpoints.loc[:, 'Service A in six months'] = (touchpoints['Service A'] + pd.Timedelta(6, unit='M')).dt.date
touchpoints.loc[:, 'Service B in three months'] = (touchpoints['Service B'] + pd.Timedelta(3, unit='M')).dt.date
touchpoints.loc[:, 'Service B in six months'] = (touchpoints['Service B'] + pd.Timedelta(6, unit='M')).dt.date
In [5]:
touchpoints
Out[5]:
customerId Service A Service B Service A in two months Service A in four months Service A in six months Service B in three months Service B in six months
0 1 2017-01-20 2017-01-10 2017-03-21 2017-05-21 2017-07-21 2017-04-11 2017-07-11
1 2 2017-01-28 2017-01-01 2017-03-29 2017-05-29 2017-07-29 2017-04-02 2017-07-02
2 3 2017-02-01 2016-12-01 2017-04-02 2017-06-02 2017-08-02 2017-03-02 2017-06-01

So we need a way to figure out the first two earliest dates from column Service A in two months to column Service B in six month and we also have to keep track of which date is for what service. Consequently, four columns are supposed to be added with these information: next service, next service date, second service, second service date.

Apparently, this is a fairly complicated operation. No expectation can be laid that Pandas dev team would have made a ready idiom for you and you have got to run a function that does that through apply method. Nevertheless, we can still use Pandas methods like sort_values. Note that sort_values is a recent term, which used to be called sort.

Core function

First we define the fucntion that will be passed on to apply.

In [6]:
def scheduler(row):
    # sort values
    row = row.sort_values()
    # find the first nevents dates
    nevents = 2
    dates = row.iloc[:nevents].tolist()
    # find the events. Note events are stored in the index of the series `row`.
    # and we only need to keep `Service X` and remove `in x months`.
    events = row.index[:nevents].str.extract(r'(Service \w+) .+', expand=False).tolist()
    # now make a series holding the resulting info
    index = ['next service', 'second service', 'next service date', 'second service date']
    return pd.Series(events+dates, index=index)

Here we go. Let's go ahead and find what we will get. The apply method should work on row by row by setting axis=1, which is along the COLUMN axis. I know it's a bit brain-twisting. You will get used to it then it will actually make sense to you when dealing with axis in other methods like concat below.

In [7]:
next_services = touchpoints.loc[:, 'Service A in two months':'Service B in six months'].apply(scheduler, axis=1)
next_services
Out[7]:
next service second service next service date second service date
0 Service A Service B 2017-03-21 2017-04-11
1 Service A Service B 2017-03-29 2017-04-02
2 Service B Service A 2017-03-02 2017-04-02

Looks like what we want to have. You might tweak the order of the columns with method reindex if you like next service date immediately following next service. Leave it to you or I'll make another blog on how to do this.

Push back

Now we append it to the original dataframe. See how axis=1 makes a lot more sense here as you glue the two dataframe together along y-axis, for which axis=1.

In [8]:
touchpoints = pd.concat([touchpoints, next_services], axis=1)
In [9]:
touchpoints
Out[9]:
customerId Service A Service B Service A in two months Service A in four months Service A in six months Service B in three months Service B in six months next service second service next service date second service date
0 1 2017-01-20 2017-01-10 2017-03-21 2017-05-21 2017-07-21 2017-04-11 2017-07-11 Service A Service B 2017-03-21 2017-04-11
1 2 2017-01-28 2017-01-01 2017-03-29 2017-05-29 2017-07-29 2017-04-02 2017-07-02 Service A Service B 2017-03-29 2017-04-02
2 3 2017-02-01 2016-12-01 2017-04-02 2017-06-02 2017-08-02 2017-03-02 2017-06-01 Service B Service A 2017-03-02 2017-04-02

Let me know anything still unclear to you in the comments. I'll be more than happy to answer it.

You can also ask me other Pandas questions. Tell me what you want to do with Pandas and I probably will make a blog special for you.

This Jupyter Notebook is available on GitHub

No comments:

Post a Comment