26 March 2017

Select Time Series Data within a Window without Date

This notebook is available on GitHub.

Quite often we analyse time series data. Pandas ships with a plethra of out-of-box tools for that purpose, some of which is also easily customisable. Useful tutorials can be found in its documentation.

Context

In this blog, I will show you how to how to select data in a time window regardless the date. For instance, the data points between 0 am and 1 am everyday. Start off, we create some dummy data.

In [1]:
from datetime import datetime
import numpy as np
import pandas as pd
In [2]:
times = pd.date_range('2010-01-01', periods=48, freq='H')
dat = pd.Series(np.random.random(len(times)), index=times)
dat
Out[2]:
2010-01-01 00:00:00    0.352284
2010-01-01 01:00:00    0.241143
2010-01-01 02:00:00    0.146389
2010-01-01 03:00:00    0.545087
2010-01-01 04:00:00    0.037369
2010-01-01 05:00:00    0.961454
2010-01-01 06:00:00    0.797817
2010-01-01 07:00:00    0.250311
2010-01-01 08:00:00    0.782042
2010-01-01 09:00:00    0.063493
2010-01-01 10:00:00    0.620593
2010-01-01 11:00:00    0.411620
2010-01-01 12:00:00    0.166519
2010-01-01 13:00:00    0.990196
2010-01-01 14:00:00    0.272079
2010-01-01 15:00:00    0.600731
2010-01-01 16:00:00    0.314604
2010-01-01 17:00:00    0.317686
2010-01-01 18:00:00    0.622527
2010-01-01 19:00:00    0.466031
2010-01-01 20:00:00    0.631393
2010-01-01 21:00:00    0.987018
2010-01-01 22:00:00    0.748490
2010-01-01 23:00:00    0.285035
2010-01-02 00:00:00    0.901956
2010-01-02 01:00:00    0.502751
2010-01-02 02:00:00    0.246344
2010-01-02 03:00:00    0.932478
2010-01-02 04:00:00    0.726944
2010-01-02 05:00:00    0.171858
2010-01-02 06:00:00    0.289558
2010-01-02 07:00:00    0.938964
2010-01-02 08:00:00    0.115186
2010-01-02 09:00:00    0.260931
2010-01-02 10:00:00    0.843929
2010-01-02 11:00:00    0.938050
2010-01-02 12:00:00    0.671973
2010-01-02 13:00:00    0.898492
2010-01-02 14:00:00    0.315328
2010-01-02 15:00:00    0.339190
2010-01-02 16:00:00    0.257052
2010-01-02 17:00:00    0.916770
2010-01-02 18:00:00    0.706080
2010-01-02 19:00:00    0.014060
2010-01-02 20:00:00    0.633365
2010-01-02 21:00:00    0.010800
2010-01-02 22:00:00    0.383637
2010-01-02 23:00:00    0.850857
Freq: H, dtype: float64

Data within any time window is easy to select. Let's see I want data between 1 am and 2 am on Jan 1, 2010. Here is how you do it.

In [3]:
dat.loc['2010-01-01 01:00:00':'2010-01-01 01:59:59']
Out[3]:
2010-01-01 01:00:00    0.241143
Freq: H, dtype: float64

Attempts

But the problem is there isn't way to call out data between 1am and 2am everyday as straightward as the above. You can't do something like this.

In [4]:
dat.loc['01:00:00':'01:59:59']
---------------------------------------------------------------------------
OutOfBoundsDatetime                       Traceback (most recent call last)
<ipython-input-4-b3315877ca89> in <module>()
----> 1 dat.loc['01:00:00':'01:59:59']

/home/fei/miniconda3/envs/python3/lib/python3.5/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1310             return self._getitem_tuple(key)
   1311         else:
-> 1312             return self._getitem_axis(key, axis=0)
   1313 
   1314     def _getitem_axis(self, key, axis=0):

/home/fei/miniconda3/envs/python3/lib/python3.5/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1451         if isinstance(key, slice):
   1452             self._has_valid_type(key, axis)
-> 1453             return self._get_slice_axis(key, axis=axis)
   1454         elif is_bool_indexer(key):
   1455             return self._getbool_axis(key, axis=axis)

/home/fei/miniconda3/envs/python3/lib/python3.5/site-packages/pandas/core/indexing.py in _get_slice_axis(self, slice_obj, axis)
   1332         labels = obj._get_axis(axis)
   1333         indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop,
-> 1334                                        slice_obj.step, kind=self.name)
   1335 
   1336         if isinstance(indexer, slice):

/home/fei/miniconda3/envs/python3/lib/python3.5/site-packages/pandas/tseries/index.py in slice_indexer(self, start, end, step, kind)
   1496 
   1497         try:
-> 1498             return Index.slice_indexer(self, start, end, step, kind=kind)
   1499         except KeyError:
   1500             # For historical reasons DatetimeIndex by default supports

/home/fei/miniconda3/envs/python3/lib/python3.5/site-packages/pandas/indexes/base.py in slice_indexer(self, start, end, step, kind)
   2995         """
   2996         start_slice, end_slice = self.slice_locs(start, end, step=step,
-> 2997                                                  kind=kind)
   2998 
   2999         # return a slice

/home/fei/miniconda3/envs/python3/lib/python3.5/site-packages/pandas/indexes/base.py in slice_locs(self, start, end, step, kind)
   3174         start_slice = None
   3175         if start is not None:
-> 3176             start_slice = self.get_slice_bound(start, 'left', kind)
   3177         if start_slice is None:
   3178             start_slice = 0

/home/fei/miniconda3/envs/python3/lib/python3.5/site-packages/pandas/indexes/base.py in get_slice_bound(self, label, side, kind)
   3113         # For datetime indices label may be a string that has to be converted
   3114         # to datetime boundary according to its resolution.
-> 3115         label = self._maybe_cast_slice_bound(label, side, kind)
   3116 
   3117         # we need to look up the label

/home/fei/miniconda3/envs/python3/lib/python3.5/site-packages/pandas/tseries/index.py in _maybe_cast_slice_bound(self, label, side, kind)
   1450                            getattr(self, 'inferred_freq', None))
   1451             _, parsed, reso = parse_time_string(label, freq)
-> 1452             lower, upper = self._parsed_string_to_bounds(reso, parsed)
   1453             # lower, upper form the half-open interval:
   1454             #   [parsed, parsed + 1 freq)

/home/fei/miniconda3/envs/python3/lib/python3.5/site-packages/pandas/tseries/index.py in _parsed_string_to_bounds(self, reso, parsed)
   1281                           hour=parsed.hour, minute=parsed.minute,
   1282                           second=parsed.second)
-> 1283             return (Timestamp(st, tz=self.tz),
   1284                     Timestamp(Timestamp(st + offsets.Second(),
   1285                                         tz=self.tz).value - 1))

pandas/tslib.pyx in pandas.tslib.Timestamp.__new__ (pandas/tslib.c:10447)()

pandas/tslib.pyx in pandas.tslib.convert_to_tsobject (pandas/tslib.c:28700)()

pandas/tslib.pyx in pandas.tslib._check_dts_bounds (pandas/tslib.c:32835)()

OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-01-01 01:00:00

Or you can't do something like this:

In [5]:
dat.loc[(dat.index.time > '01:00:00') & (dat.index.time < '02:00:00')]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-656590af5d6e> in <module>()
----> 1 dat.loc[(dat.index.time > '01:00:00') & (dat.index.time < '02:00:00')]

TypeError: unorderable types: datetime.time() > str()

New Light

It is good though, the last example shows us the possibility. The method time returns a datetime.time object.

In [6]:
type(dat.index.time[0])
Out[6]:
datetime.time

Therefore the left question is how to create a datetime.time so on two sides of the binary operator there will be objects of the same type.

The help info isn't that useful.

In [7]:
help(datetime.time)
Help on method_descriptor:

time(...)
    Return time object with same time but with tzinfo=None.

I don't know what can be its arguements. So I just try anything reasonable.

In [8]:
datetime.time('01:00:00')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-a8d40eab456a> in <module>()
----> 1 datetime.time('01:00:00')

TypeError: descriptor 'time' requires a 'datetime.datetime' object but received a 'str'

Nope, it doesn't work but it provides crucial hints. We know time is a method function of datetime.datetime object. The error message reveals that time wants an same object, which is just itself. Knowing this, we can hack it. To get a time, e.g., 1 am, we can create a datetime object, without caring on which date and then we call time function!

In [9]:
one_am = datetime(2000, 1, 1, 1).time()
one_am
Out[9]:
datetime.time(1, 0)
In [10]:
two_am = datetime(2000, 1, 1, 2).time()
two_am
Out[10]:
datetime.time(2, 0)

So after all these struggling, we achieve our goal.

In [11]:
dat.loc[(dat.index.time >= one_am) & (dat.index.time < two_am)]
Out[11]:
2010-01-01 01:00:00    0.241143
2010-01-02 01:00:00    0.502751
Freq: 24H, dtype: float64

Conclusion

You can see we get the data in the time window on both dates available! But I don't like hacking, which is unreliable in long term. I will update this blog once I find a better way. Or you can leave a comment if you know how to do it.

21 March 2017

R: Check Newline Symbols When Having Weird Issues

This notebook is available on GitHub.

Context

I have demonstrated in the last blog in Pandas that excessive newline symbols can cause issues that don't crash the reading process but result in frustrations at a later stage. In this blog, I show you the same issue arises in R as well. Therefore be always aware of this bug no matter which side you are in the holy war between Python and R.

First off, this is the correct data set.

In [1]:
correct_data_string <- "
a,b,c
1,2,3
4,5,6
"
cat(correct_data_string)
a,b,c
1,2,3
4,5,6

Suppose we like to have the following data in a dataframe.

In [2]:
correct_data <- read.csv(textConnection(correct_data_string))
correct_data
abc
123
456

However, not unusually you may have a data file like the following. In column "b", the cell having 2 unexpectedly has a newline symbol.

In [3]:
wrong_data_string <- "
a,b,c
1,2\n,3
4,5,6
"
cat(wrong_data_string)
a,b,c
1,2
,3
4,5,6

The reading utility isn't smart enough to count the number of elements in a row. So we end up having three rows.

In [4]:
wrong_data <- read.csv(textConnection(wrong_data_string))
wrong_data
abc
12 NA
NA3 NA
45 6

This data quality doesn't cause issue at this stage but later. So when you have some weird issue, it is an good idea to open up the data file and check if there are newline symbols show up not at the end of a line. More subtle is when dealing with data files created by Microsoft softwares where a different newline symbol which appears as ^M in vim. Check that symbol as well.

12 March 2017

Check Newline Symbols When Having Weird Issues

This notebook is available on GitHub.

Context

Sometimes there is frustrating moment when Pandas report errors and you have no idea what caused it. Here is one of them I had in my work. This situation happens when the data files have fields whose data were manually input, e.g. customer residential address. The problem is Pandas starts a newline whenever there is a newline symbol.

In [1]:
import pandas as pd
from io import StringIO

Suppose we like to have the following data in a dataframe.

In [2]:
correct_data_string = StringIO(
    '''
    a,b,c
    1,2,3
    4,5,6
    '''
)
In [3]:
correct_data = pd.read_csv(correct_data_string)
correct_data
Out[3]:
a b c
0 1 2 3
1 4 5 6

However, not unusually you may have a data file like the following. In column "b", the cell having 2 unexpectedly has a newline symbol.

In [4]:
wrong_data_string = StringIO(
    '''
    a,b,c
    1,2\n,3
    4,5,6
    '''
)
In [5]:
wrong_data = pd.read_csv(wrong_data_string)
wrong_data
Out[5]:
a b c
0 1.0 2 NaN
1 NaN 3 NaN
2 4.0 5 6.0

This data quality doesn't cause issue at this stage but later. So when you have some weird issue, it is an good idea to open up the data file and check if there are newline symbols show up not at the end of a line. More subtle is when dealing with data files created by Microsoft softwares where a different newline symbol which appears as ^M in vim. Check that symbol as well.

05 March 2017

How I Get Cross-sell

This Jupyter Notebook is available on GitHub.

Context

Today I show to how to generate a report for cross-sell performance. The question is how many customers bought product a and b, which is often presented on board meetings for summarising previous retail performance and giving hint for making decision of product bundles.

To demonstrate the task I make random transaction dataset.

In [1]:
import numpy as np
import pandas as pd
from itertools import combinations
In [2]:
dat = pd.DataFrame(
    {
        'customer_id': np.random.randint(0, 3, 30),
        'product_id': np.random.choice(['a', 'b', 'c'], 30)    
    }
)
dat.head()
Out[2]:
customer_id product_id
0 0 c
1 2 b
2 2 a
3 2 c
4 0 c

Target

The result table will look something like this.

In [3]:
fake_array = np.reshape(np.random.randint(1, 5, 9), (3, 3))
# to make it symmetric
sym_data = fake_array + fake_array.T
# zero elements on the diagnal
sym_data[list(range(3)), list(range(3))] = 0
fake_res = pd.DataFrame(sym_data, columns=['a', 'b', 'c'], index=['a', 'b', 'c'])
fake_res
Out[3]:
a b c
a 0 4 5
b 4 0 3
c 5 3 0

Note: the result has to be symmetric as "Number of customers who bought a and b is absolutely equal to number of customers who bought b and a."

Unique Products Bought

This looks like a simple pivot table. Not so fast! Actually much more complicated than that. For each customer_id all unique product pairs have to be created separatedly.

First we need to know what unique products each customers have bought.

In [4]:
unique_prods = dat.groupby('customer_id')['product_id'].unique()
unique_prods
Out[4]:
customer_id
0    [c, b, a]
1       [a, b]
2    [b, a, c]
Name: product_id, dtype: object

Product Pairs

In this Series, the index encompasses the customer id's from the raw dataset. Then we need to create the product pairs for each customer. To do this, we apply a lambda function to each element of the Series.

In [5]:
unique_prod_pairs = unique_prods.apply(lambda prods: list(combinations(prods, 2)))
unique_prod_pairs
Out[5]:
customer_id
0    [(c, b), (c, a), (b, a)]
1                    [(a, b)]
2    [(b, a), (b, c), (a, c)]
Name: product_id, dtype: object

The final result is just around the corner. We next count the occurance of each pairs. To facilitate that, we make a wholesale list that contains all occurances. Luckily the sum definition for Python list does just for that.

For example,

In [6]:
[1, 2] + [3, 4]
Out[6]:
[1, 2, 3, 4]

So we call the sum method.

In [7]:
all_pairs = unique_prod_pairs.sum()
all_pairs
Out[7]:
[('c', 'b'),
 ('c', 'a'),
 ('b', 'a'),
 ('a', 'b'),
 ('b', 'a'),
 ('b', 'c'),
 ('a', 'c')]

To take advantage of pandas pivot_table function, we make a dataframe, in which the combination of two columns is the product pairs.

In [8]:
unique_prod_pairs_split = pd.DataFrame(all_pairs)
unique_prod_pairs_split
Out[8]:
0 1
0 c b
1 c a
2 b a
3 a b
4 b a
5 b c
6 a c

Make Pivot Table

Finnally this problem has been converted into a privotting task.

In [9]:
raw_pivot_table = pd.pivot_table(unique_prod_pairs_split, index=0, columns=1, aggfunc=len)
raw_pivot_table
Out[9]:
1 a b c
0
a NaN 1.0 1.0
b 2.0 NaN 1.0
c 1.0 1.0 NaN

Hmm, doesn't really look like what we want.

First, it isn't symmetric.

Second, there isn't actually guarantee that for example "a" is in both index and columns.

Well we have already managed the final result but just need a little tweak.

We can impose the index and columns must have all the products. First let's get all the unique products.

In [10]:
unique_prods_list = sorted(dat.product_id.unique().tolist())
unique_prods_list
Out[10]:
['a', 'b', 'c']

This imposition is realised by the reindex method.

In [11]:
pivot_table = raw_pivot_table.reindex(index=unique_prods_list, columns=unique_prods_list)
pivot_table
Out[11]:
1 a b c
0
a NaN 1.0 1.0
b 2.0 NaN 1.0
c 1.0 1.0 NaN

Fill the NA values with zeros.

In [12]:
pivot_table = pivot_table.fillna(0)
pivot_table
Out[12]:
1 a b c
0
a 0.0 1.0 1.0
b 2.0 0.0 1.0
c 1.0 1.0 0.0

Done

Only thing left is to make it symmetric.

In [13]:
pivot_table = pivot_table + pivot_table.T
pivot_table
Out[13]:
1 a b c
0
a 0.0 3.0 2.0
b 3.0 0.0 2.0
c 2.0 2.0 0.0

Awesome, go ahead and check which two products are better to be bundled!