26 March 2017

Select Time Series Data within a Window without Date

This notebook is available on GitHub.

Quite often we analyse time series data. Pandas ships with a plethra of out-of-box tools for that purpose, some of which is also easily customisable. Useful tutorials can be found in its documentation.

Context

In this blog, I will show you how to how to select data in a time window regardless the date. For instance, the data points between 0 am and 1 am everyday. Start off, we create some dummy data.

In [1]:
from datetime import datetime
import numpy as np
import pandas as pd
In [2]:
times = pd.date_range('2010-01-01', periods=48, freq='H')
dat = pd.Series(np.random.random(len(times)), index=times)
dat
Out[2]:
2010-01-01 00:00:00    0.352284
2010-01-01 01:00:00    0.241143
2010-01-01 02:00:00    0.146389
2010-01-01 03:00:00    0.545087
2010-01-01 04:00:00    0.037369
2010-01-01 05:00:00    0.961454
2010-01-01 06:00:00    0.797817
2010-01-01 07:00:00    0.250311
2010-01-01 08:00:00    0.782042
2010-01-01 09:00:00    0.063493
2010-01-01 10:00:00    0.620593
2010-01-01 11:00:00    0.411620
2010-01-01 12:00:00    0.166519
2010-01-01 13:00:00    0.990196
2010-01-01 14:00:00    0.272079
2010-01-01 15:00:00    0.600731
2010-01-01 16:00:00    0.314604
2010-01-01 17:00:00    0.317686
2010-01-01 18:00:00    0.622527
2010-01-01 19:00:00    0.466031
2010-01-01 20:00:00    0.631393
2010-01-01 21:00:00    0.987018
2010-01-01 22:00:00    0.748490
2010-01-01 23:00:00    0.285035
2010-01-02 00:00:00    0.901956
2010-01-02 01:00:00    0.502751
2010-01-02 02:00:00    0.246344
2010-01-02 03:00:00    0.932478
2010-01-02 04:00:00    0.726944
2010-01-02 05:00:00    0.171858
2010-01-02 06:00:00    0.289558
2010-01-02 07:00:00    0.938964
2010-01-02 08:00:00    0.115186
2010-01-02 09:00:00    0.260931
2010-01-02 10:00:00    0.843929
2010-01-02 11:00:00    0.938050
2010-01-02 12:00:00    0.671973
2010-01-02 13:00:00    0.898492
2010-01-02 14:00:00    0.315328
2010-01-02 15:00:00    0.339190
2010-01-02 16:00:00    0.257052
2010-01-02 17:00:00    0.916770
2010-01-02 18:00:00    0.706080
2010-01-02 19:00:00    0.014060
2010-01-02 20:00:00    0.633365
2010-01-02 21:00:00    0.010800
2010-01-02 22:00:00    0.383637
2010-01-02 23:00:00    0.850857
Freq: H, dtype: float64

Data within any time window is easy to select. Let's see I want data between 1 am and 2 am on Jan 1, 2010. Here is how you do it.

In [3]:
dat.loc['2010-01-01 01:00:00':'2010-01-01 01:59:59']
Out[3]:
2010-01-01 01:00:00    0.241143
Freq: H, dtype: float64

Attempts

But the problem is there isn't way to call out data between 1am and 2am everyday as straightward as the above. You can't do something like this.

In [4]:
dat.loc['01:00:00':'01:59:59']
---------------------------------------------------------------------------
OutOfBoundsDatetime                       Traceback (most recent call last)
<ipython-input-4-b3315877ca89> in <module>()
----> 1 dat.loc['01:00:00':'01:59:59']

/home/fei/miniconda3/envs/python3/lib/python3.5/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1310             return self._getitem_tuple(key)
   1311         else:
-> 1312             return self._getitem_axis(key, axis=0)
   1313 
   1314     def _getitem_axis(self, key, axis=0):

/home/fei/miniconda3/envs/python3/lib/python3.5/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1451         if isinstance(key, slice):
   1452             self._has_valid_type(key, axis)
-> 1453             return self._get_slice_axis(key, axis=axis)
   1454         elif is_bool_indexer(key):
   1455             return self._getbool_axis(key, axis=axis)

/home/fei/miniconda3/envs/python3/lib/python3.5/site-packages/pandas/core/indexing.py in _get_slice_axis(self, slice_obj, axis)
   1332         labels = obj._get_axis(axis)
   1333         indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop,
-> 1334                                        slice_obj.step, kind=self.name)
   1335 
   1336         if isinstance(indexer, slice):

/home/fei/miniconda3/envs/python3/lib/python3.5/site-packages/pandas/tseries/index.py in slice_indexer(self, start, end, step, kind)
   1496 
   1497         try:
-> 1498             return Index.slice_indexer(self, start, end, step, kind=kind)
   1499         except KeyError:
   1500             # For historical reasons DatetimeIndex by default supports

/home/fei/miniconda3/envs/python3/lib/python3.5/site-packages/pandas/indexes/base.py in slice_indexer(self, start, end, step, kind)
   2995         """
   2996         start_slice, end_slice = self.slice_locs(start, end, step=step,
-> 2997                                                  kind=kind)
   2998 
   2999         # return a slice

/home/fei/miniconda3/envs/python3/lib/python3.5/site-packages/pandas/indexes/base.py in slice_locs(self, start, end, step, kind)
   3174         start_slice = None
   3175         if start is not None:
-> 3176             start_slice = self.get_slice_bound(start, 'left', kind)
   3177         if start_slice is None:
   3178             start_slice = 0

/home/fei/miniconda3/envs/python3/lib/python3.5/site-packages/pandas/indexes/base.py in get_slice_bound(self, label, side, kind)
   3113         # For datetime indices label may be a string that has to be converted
   3114         # to datetime boundary according to its resolution.
-> 3115         label = self._maybe_cast_slice_bound(label, side, kind)
   3116 
   3117         # we need to look up the label

/home/fei/miniconda3/envs/python3/lib/python3.5/site-packages/pandas/tseries/index.py in _maybe_cast_slice_bound(self, label, side, kind)
   1450                            getattr(self, 'inferred_freq', None))
   1451             _, parsed, reso = parse_time_string(label, freq)
-> 1452             lower, upper = self._parsed_string_to_bounds(reso, parsed)
   1453             # lower, upper form the half-open interval:
   1454             #   [parsed, parsed + 1 freq)

/home/fei/miniconda3/envs/python3/lib/python3.5/site-packages/pandas/tseries/index.py in _parsed_string_to_bounds(self, reso, parsed)
   1281                           hour=parsed.hour, minute=parsed.minute,
   1282                           second=parsed.second)
-> 1283             return (Timestamp(st, tz=self.tz),
   1284                     Timestamp(Timestamp(st + offsets.Second(),
   1285                                         tz=self.tz).value - 1))

pandas/tslib.pyx in pandas.tslib.Timestamp.__new__ (pandas/tslib.c:10447)()

pandas/tslib.pyx in pandas.tslib.convert_to_tsobject (pandas/tslib.c:28700)()

pandas/tslib.pyx in pandas.tslib._check_dts_bounds (pandas/tslib.c:32835)()

OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-01-01 01:00:00

Or you can't do something like this:

In [5]:
dat.loc[(dat.index.time > '01:00:00') & (dat.index.time < '02:00:00')]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-656590af5d6e> in <module>()
----> 1 dat.loc[(dat.index.time > '01:00:00') & (dat.index.time < '02:00:00')]

TypeError: unorderable types: datetime.time() > str()

New Light

It is good though, the last example shows us the possibility. The method time returns a datetime.time object.

In [6]:
type(dat.index.time[0])
Out[6]:
datetime.time

Therefore the left question is how to create a datetime.time so on two sides of the binary operator there will be objects of the same type.

The help info isn't that useful.

In [7]:
help(datetime.time)
Help on method_descriptor:

time(...)
    Return time object with same time but with tzinfo=None.

I don't know what can be its arguements. So I just try anything reasonable.

In [8]:
datetime.time('01:00:00')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-a8d40eab456a> in <module>()
----> 1 datetime.time('01:00:00')

TypeError: descriptor 'time' requires a 'datetime.datetime' object but received a 'str'

Nope, it doesn't work but it provides crucial hints. We know time is a method function of datetime.datetime object. The error message reveals that time wants an same object, which is just itself. Knowing this, we can hack it. To get a time, e.g., 1 am, we can create a datetime object, without caring on which date and then we call time function!

In [9]:
one_am = datetime(2000, 1, 1, 1).time()
one_am
Out[9]:
datetime.time(1, 0)
In [10]:
two_am = datetime(2000, 1, 1, 2).time()
two_am
Out[10]:
datetime.time(2, 0)

So after all these struggling, we achieve our goal.

In [11]:
dat.loc[(dat.index.time >= one_am) & (dat.index.time < two_am)]
Out[11]:
2010-01-01 01:00:00    0.241143
2010-01-02 01:00:00    0.502751
Freq: 24H, dtype: float64

Conclusion

You can see we get the data in the time window on both dates available! But I don't like hacking, which is unreliable in long term. I will update this blog once I find a better way. Or you can leave a comment if you know how to do it.

No comments:

Post a Comment