28 February 2017

R: Find Outliers with IQR in Dataframe

Prewords

So far I've been only blogging data analysis done with Pandas in Python. Today's tool will be another widely accepted tool in the landscape of data science: R.

I actually started with R earlier than Pandas. Later due to the job I quickly learned Pandas and found it very awesome. However tools in R are equally awesome. So in this post, I'll be doing the topic we have worked on in Python with R. I think it would be interesting to see the difference. A quick finding is R tends to use more independent functions while Pandas has more method function associated with data types.

Context

One of method to find outliers is to make a boxplot. Outliers are shown as dots when they are 1.5 IQR above the third quartile or below the first quartile. In Pandas, we have methods to calculate quartiles and it's no brainer to calculate IQR from them. Therefore, finding outliers just requires a few steps.

To start with, we make some dummy data.

In [1]:
library(dplyr)
library(matrixStats)
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

matrixStats v0.51.0 (2016-10-08) successfully loaded. See ?matrixStats for help.

Attaching package: ‘matrixStats’

The following object is masked from ‘package:dplyr’:

    count

In [2]:
dat <- data.frame(a=runif(10), b=runif(10), row.names=1:10)
dat
ab
0.741223470.96371549
0.418084210.67254335
0.388538940.70647360
0.142924600.49835354
0.746279260.16966896
0.380967530.31313015
0.455506530.11488518
0.893022690.06733067
0.939578900.60182827
0.029483390.46074236

To make sure there are outliers to find I manually manipulate some values.

In [3]:
dat[1, 'a'] <- -1.5
dat[1, 'b'] <- 5.2
dat[2, 'a'] <- -3.3
dat
ab
-1.500000005.20000000
-3.300000000.67254335
0.388538940.70647360
0.142924600.49835354
0.746279260.16966896
0.380967530.31313015
0.455506530.11488518
0.893022690.06733067
0.939578900.60182827
0.029483390.46074236
In [4]:
boxplot(dat)

Quartiles

In [5]:
first.quartiles <- colQuantiles(dat, prob=0.25)
first.quartiles
  1. 0.0578436961513944
  2. 0.205534257518593
In [6]:
third.quartiles <- colQuantiles(dat, prob=0.75)
third.quartiles
  1. 0.673586076707579
  2. 0.654864581360016

IQR

In [7]:
iqr <- third.quartiles - first.quartiles
iqr
  1. 0.615742380556185
  2. 0.449330323841423
In [8]:
is.outlier <- (dat < first.quartiles - 1.5 * iqr) | (dat > third.quartiles + 1.5 * iqr)
is.outlier
ab
1 TRUE TRUE
2 TRUEFALSE
3FALSEFALSE
4FALSEFALSE
5FALSEFALSE
6FALSEFALSE
7FALSEFALSE
8FALSEFALSE
9FALSEFALSE
10FALSEFALSE

Here we go. One more step is how you want to treat a row is an outlier form the other rows. Two simple ways:

  1. If one of the elements is an outlier.
  2. If all of the elements are outliers.

For the first case:

In [9]:
apply(is.outlier, 1, any)
1
TRUE
2
TRUE
3
FALSE
4
FALSE
5
FALSE
6
FALSE
7
FALSE
8
FALSE
9
FALSE
10
FALSE

Note the usage patter of apply is nearly exactly the same as in Pandas, except in R apply is an independent function. Both of them interpret 0 as along the vertical x-axis and 1 as along the horizontal y-axis.

For the second case:

In [10]:
apply(is.outlier, 1, all)
1
TRUE
2
FALSE
3
FALSE
4
FALSE
5
FALSE
6
FALSE
7
FALSE
8
FALSE
9
FALSE
10
FALSE

Can you see the difference?

To pull the outlier rows from the dataframe, we select by logic arrays. For example, we pull the outliers for the first case:

In [11]:
dat[apply(is.outlier, 1, any), ]
ab
-1.5 5.2000000
-3.3 0.6725434

Selecting this way is generic across analysis tools. Like some syntax is "Pythonic", we can do it in a "Ric" way with filter from Dplyr library:

In [12]:
dat %>% filter(apply(is.outlier, 1, any))
ab
-1.5 5.2000000
-3.3 0.6725434

This notebook is available on GitHub.

24 February 2017

Find Outliers with IQR in Dataframe

Context

One of method to find outliers is to make a boxplot. Outliers are shown as dots when they are 1.5 IQR above the third quartile or below the first quartile. In Pandas, we have methods to calculate quartiles and it's no brainer to calculate IQR from them. Therefore, finding outliers just requires a few steps.

To start with, we make some dummy data.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
In [2]:
dat = pd.DataFrame({'a': np.random.random(10), 'b': np.random.random(10)})
dat
Out[2]:
a b
0 0.226132 0.485436
1 0.780740 0.921301
2 0.145161 0.012100
3 0.714989 0.893943
4 0.785227 0.655811
5 0.999116 0.163410
6 0.979852 0.409046
7 0.137181 0.685542
8 0.571706 0.455979
9 0.643498 0.024336

To make sure there are outliers to find I manually manipulate some values.

In [3]:
dat.at[0, 'a'] = -1.5
dat.at[0, 'b'] = 5.2
dat.at[1, 'a'] = -3.3
In [4]:
dat.boxplot(grid=False)
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f468d556e10>

Quartiles

In [5]:
first_quartiles = dat.quantile(0.25)
first_quartiles
Out[5]:
a    0.139176
b    0.224819
Name: 0.25, dtype: float64
In [6]:
third_quartiles = dat.quantile(0.75)
third_quartiles
Out[6]:
a    0.767668
b    0.841842
Name: 0.75, dtype: float64

IQR

In [7]:
iqr = third_quartiles - first_quartiles
iqr
Out[7]:
a    0.628491
b    0.617023
dtype: float64

Outliers

Now we can compare the values in dataframe to quartiles and IQR's to find outliers.

In [8]:
is_outlier = (dat < first_quartiles - 1.5 * iqr) | (dat > third_quartiles + 1.5 * iqr)
is_outlier
Out[8]:
a b
0 True True
1 True False
2 False False
3 False False
4 False False
5 False False
6 False False
7 False False
8 False False
9 False False

Here we go. One more step is how you want to treat a row is an outlier form the other rows. Two simple ways:

  1. If one of the elements is an outlier.
  2. If all of the elements are an outliers.

For the first case:

In [9]:
is_outlier.any(axis=1)
Out[9]:
0     True
1     True
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

For the second case:

In [10]:
is_outlier.all(axis=1)
Out[10]:
0     True
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

Can you see the difference?

To pull the outlier rows from the dataframe, we select by logic arrays. For example, we pull the outliers for the first case:

In [11]:
dat.loc[is_outlier.any(axis=1)]
Out[11]:
a b
0 -1.5 5.200000
1 -3.3 0.921301

This Jupyter Notebook is available on GitHub

22 February 2017

How to Get the Most of "describe" Method?

Context

Whenever you get a new dataset, one of the first things you want to do first is have an overview. What are the values in each column? How are they distributed? It turns out a Pandas dataframe method called describe can provide you these info. It's actually pretty simple to use but pretty comprehensive and powerful.

Without further ado, let's get started with creating some dummy data to play with.

In [1]:
import numpy as np
import pandas as pd
In [2]:
dat = pd.DataFrame(
   {
       'num1': np.random.random(10),
       'num2': np.random.random(10),
       'cat1': np.random.choice(['cat', 'dog'], 10),
       'cat2': np.random.choice(['a', 'b', 'c'], 10)
   }
)
In [3]:
dat
Out[3]:
cat1 cat2 num1 num2
0 dog a 0.102079 0.106746
1 dog b 0.603775 0.581288
2 dog a 0.144284 0.345447
3 cat a 0.603948 0.825914
4 dog c 0.051512 0.031002
5 dog b 0.847561 0.749578
6 dog b 0.143133 0.692005
7 cat a 0.541666 0.640678
8 dog c 0.255207 0.587646
9 dog b 0.549595 0.173111

Simplest Usage

Simply calling the function can already tell you most of the stats you want to know.

In [4]:
dat.describe()
Out[4]:
num1 num2
count 10.000000 10.000000
mean 0.384276 0.473342
std 0.276072 0.286298
min 0.051512 0.031002
25% 0.143421 0.216195
50% 0.398437 0.584467
75% 0.590230 0.679174
max 0.847561 0.825914

Numerical and Categorical cases

Surely you would ask, "Where are the other two columns gone?". Well the issue here is all the quantities except count is meaningless to them, as they have categorical contents and mean is no way to be gotten. Pandas strategy is numerical columns and associated quantities are the default. If you really want to see them under the same roof, the argument include can be assigned all just for that purpose.

In [5]:
dat.describe(include='all')
Out[5]:
cat1 cat2 num1 num2
count 10 10 10.000000 10.000000
unique 2 3 NaN NaN
top dog b NaN NaN
freq 8 4 NaN NaN
mean NaN NaN 0.384276 0.473342
std NaN NaN 0.276072 0.286298
min NaN NaN 0.051512 0.031002
25% NaN NaN 0.143421 0.216195
50% NaN NaN 0.398437 0.584467
75% NaN NaN 0.590230 0.679174
max NaN NaN 0.847561 0.825914

This result looks a bit messy. Categorical columns have their own technical overviews unique, top, and freq. Personally, I don't prefer this way as it wastes too much valuable printing space.

I like to overview the categorical columns separately. You are probably able to figure it out on your own that include is what you need to tweak on. It must be given a list of data type to include. 'O', plausibly standing for 'object', is to choose string type. If you like, you can exclude numbers for the same purpose.

In [6]:
dat.describe(include=['O'])
Out[6]:
cat1 cat2
count 10 10
unique 2 3
top dog b
freq 8 4

Quantiles

So far so good, however what if you need certain quantiles beyond the default first quartile, median, and third quartile. There is another argument ready to complete its mission. You can pass the quantiles as a number between 0 and 1. For example, here I want the deciles.

In [7]:
percentiles = np.linspace(0.1, 0.9, 9)
percentiles
Out[7]:
array([ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9])
In [8]:
dat.describe(percentiles=percentiles)
Out[8]:
num1 num2
count 10.000000 10.000000
mean 0.384276 0.473342
std 0.276072 0.286298
min 0.051512 0.031002
10% 0.097022 0.099172
20% 0.134922 0.159838
30.0% 0.143939 0.293746
40% 0.210838 0.486952
50% 0.398437 0.584467
60% 0.544838 0.608859
70% 0.565849 0.656076
80% 0.603810 0.703520
90% 0.628309 0.757211
max 0.847561 0.825914

This drills down to more details of how the number is distributed. Interesting to see the index for 30% standing out having '.0' which is unnecessary. Maybe it's a tiny bug.

Wonder if we can get similar results categorical values. Actually it's tempting to ask what if we have defined an order for categorical values.

In [10]:
dat.cat1.astype('category', categories=['dog', 'cat']).describe()
Out[10]:
count      10
unique      2
top       dog
freq        8
Name: cat1, dtype: object

I'll explain what categories does in a different blog or you can refer to this great youtube video.

Well no miracle happens. Maybe this can be a feature added in Pandas later.

An example plan is as follows:

To get the second decile of dog and cat, number of example gives to 2. By looking at the second element in the ordered category items, which turns out to be a dog as there are eight dogs. Therefore the second decile is a dog. Sounds silly, but it's probably useful.

This Jupyter Notebook is available on GitHub

17 February 2017

Use Pandas apply to be a scheduler

Context

At work, I helped a coworker with his task where he needed to build a touchpoint date scheduler for a client.

Suppose the client has two different regular service plans for its customers Service A and Service B. Service A is supposed to be operated every two months while Service B is to be done every three months. We are provided the last touchpoint date by the client. It looks something like the following.

In [1]:
import pandas as pd
In [2]:
touchpoints = pd.DataFrame(
    {
        'customerId': [1, 2, 3],
        'Service A': [pd.datetime(2017, 1, 20), pd.datetime(2017, 1, 28), pd.datetime(2017, 2, 1)],
        'Service B': [pd.datetime(2017, 1, 10), pd.datetime(2017, 1, 1), pd.datetime(2016, 12, 1)]
    },
    columns=['customerId', 'Service A', 'Service B']
)
In [3]:
touchpoints
Out[3]:
customerId Service A Service B
0 1 2017-01-20 2017-01-10
1 2 2017-01-28 2017-01-01
2 3 2017-02-01 2016-12-01

The task is to find the next two services coming up, which heavily depends on the last date when the two services were done. For customer 1, the next service will be Service A on 21.3.2017 followed by Service B on 11.4.2017. In contrast, for customer 3, the next sevice will be Service B on 2.3.2017 followed by Service A on 2.4.2017 one month later.

Pull the up-coming events together for the next half year since the last service the table above would look like the following.

In [4]:
touchpoints.loc[:, 'Service A in two months'] = (touchpoints['Service A'] + pd.Timedelta(2, unit='M')).dt.date
touchpoints.loc[:, 'Service A in four months'] = (touchpoints['Service A'] + pd.Timedelta(4, unit='M')).dt.date
touchpoints.loc[:, 'Service A in six months'] = (touchpoints['Service A'] + pd.Timedelta(6, unit='M')).dt.date
touchpoints.loc[:, 'Service B in three months'] = (touchpoints['Service B'] + pd.Timedelta(3, unit='M')).dt.date
touchpoints.loc[:, 'Service B in six months'] = (touchpoints['Service B'] + pd.Timedelta(6, unit='M')).dt.date
In [5]:
touchpoints
Out[5]:
customerId Service A Service B Service A in two months Service A in four months Service A in six months Service B in three months Service B in six months
0 1 2017-01-20 2017-01-10 2017-03-21 2017-05-21 2017-07-21 2017-04-11 2017-07-11
1 2 2017-01-28 2017-01-01 2017-03-29 2017-05-29 2017-07-29 2017-04-02 2017-07-02
2 3 2017-02-01 2016-12-01 2017-04-02 2017-06-02 2017-08-02 2017-03-02 2017-06-01

So we need a way to figure out the first two earliest dates from column Service A in two months to column Service B in six month and we also have to keep track of which date is for what service. Consequently, four columns are supposed to be added with these information: next service, next service date, second service, second service date.

Apparently, this is a fairly complicated operation. No expectation can be laid that Pandas dev team would have made a ready idiom for you and you have got to run a function that does that through apply method. Nevertheless, we can still use Pandas methods like sort_values. Note that sort_values is a recent term, which used to be called sort.

Core function

First we define the fucntion that will be passed on to apply.

In [6]:
def scheduler(row):
    # sort values
    row = row.sort_values()
    # find the first nevents dates
    nevents = 2
    dates = row.iloc[:nevents].tolist()
    # find the events. Note events are stored in the index of the series `row`.
    # and we only need to keep `Service X` and remove `in x months`.
    events = row.index[:nevents].str.extract(r'(Service \w+) .+', expand=False).tolist()
    # now make a series holding the resulting info
    index = ['next service', 'second service', 'next service date', 'second service date']
    return pd.Series(events+dates, index=index)

Here we go. Let's go ahead and find what we will get. The apply method should work on row by row by setting axis=1, which is along the COLUMN axis. I know it's a bit brain-twisting. You will get used to it then it will actually make sense to you when dealing with axis in other methods like concat below.

In [7]:
next_services = touchpoints.loc[:, 'Service A in two months':'Service B in six months'].apply(scheduler, axis=1)
next_services
Out[7]:
next service second service next service date second service date
0 Service A Service B 2017-03-21 2017-04-11
1 Service A Service B 2017-03-29 2017-04-02
2 Service B Service A 2017-03-02 2017-04-02

Looks like what we want to have. You might tweak the order of the columns with method reindex if you like next service date immediately following next service. Leave it to you or I'll make another blog on how to do this.

Push back

Now we append it to the original dataframe. See how axis=1 makes a lot more sense here as you glue the two dataframe together along y-axis, for which axis=1.

In [8]:
touchpoints = pd.concat([touchpoints, next_services], axis=1)
In [9]:
touchpoints
Out[9]:
customerId Service A Service B Service A in two months Service A in four months Service A in six months Service B in three months Service B in six months next service second service next service date second service date
0 1 2017-01-20 2017-01-10 2017-03-21 2017-05-21 2017-07-21 2017-04-11 2017-07-11 Service A Service B 2017-03-21 2017-04-11
1 2 2017-01-28 2017-01-01 2017-03-29 2017-05-29 2017-07-29 2017-04-02 2017-07-02 Service A Service B 2017-03-29 2017-04-02
2 3 2017-02-01 2016-12-01 2017-04-02 2017-06-02 2017-08-02 2017-03-02 2017-06-01 Service B Service A 2017-03-02 2017-04-02

Let me know anything still unclear to you in the comments. I'll be more than happy to answer it.

You can also ask me other Pandas questions. Tell me what you want to do with Pandas and I probably will make a blog special for you.

This Jupyter Notebook is available on GitHub

15 February 2017

A Few Words at the Geneis

   I have been working as a data specialist at a data consulting agency for two years. Though still short, I had a high-profile start. I did data analysis and data modeling for renowned international corporations as well as corporations in this country.

  As gaining more valuable experience, I love more what I am doing. A pattern I find is data is more and more becoming an asset for a business. Companies have piled heaps of data in their database. How to extract data, clean data, distill actionable insights from the data asset is easy to talk than real work.

  Many organisations seem to be still relying on SQL for this purpose, which is slow as there are limited amount of analysis tools available in the toolbox. Debugging is the headache that is even not worthy of mentioning.

  Meanwhile, there are several lightweight and powerful off-the-shelf libraries available in data science programming languages like Python and R. Pandas is one of the analysis tool, with which I have smashed lots of complicated-looking calculation that really impressed our clients. As using it more and more nearly everyday, I get really good at it. I've been the person you resort to for whatever Pandas related questions.

  One problem I found in learning to use libraries like Pandas is people learn the tools separately from tutorials. BTW, Pandas has awesome documentation that is also a good tutorial. Few are able to connect the dots. Therefore I often see among data science enthusiasts a lot of genius minds are spending time on reinventing the wheel which has been in the Pandas and is generally faster.

  To close this gap, I am happy to share my knowledge of writing analysis codes that reinvent no wheels and use as many idioms in the toolbox as possible in the coming-up posts. I am also thinking making youtube video and posting the links here to show you how I do it dynamically.

  Hope everyone enjoys it here at Enjoyable Data Analysis.