24 February 2017

Find Outliers with IQR in Dataframe

Context

One of method to find outliers is to make a boxplot. Outliers are shown as dots when they are 1.5 IQR above the third quartile or below the first quartile. In Pandas, we have methods to calculate quartiles and it's no brainer to calculate IQR from them. Therefore, finding outliers just requires a few steps.

To start with, we make some dummy data.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
In [2]:
dat = pd.DataFrame({'a': np.random.random(10), 'b': np.random.random(10)})
dat
Out[2]:
a b
0 0.226132 0.485436
1 0.780740 0.921301
2 0.145161 0.012100
3 0.714989 0.893943
4 0.785227 0.655811
5 0.999116 0.163410
6 0.979852 0.409046
7 0.137181 0.685542
8 0.571706 0.455979
9 0.643498 0.024336

To make sure there are outliers to find I manually manipulate some values.

In [3]:
dat.at[0, 'a'] = -1.5
dat.at[0, 'b'] = 5.2
dat.at[1, 'a'] = -3.3
In [4]:
dat.boxplot(grid=False)
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f468d556e10>

Quartiles

In [5]:
first_quartiles = dat.quantile(0.25)
first_quartiles
Out[5]:
a    0.139176
b    0.224819
Name: 0.25, dtype: float64
In [6]:
third_quartiles = dat.quantile(0.75)
third_quartiles
Out[6]:
a    0.767668
b    0.841842
Name: 0.75, dtype: float64

IQR

In [7]:
iqr = third_quartiles - first_quartiles
iqr
Out[7]:
a    0.628491
b    0.617023
dtype: float64

Outliers

Now we can compare the values in dataframe to quartiles and IQR's to find outliers.

In [8]:
is_outlier = (dat < first_quartiles - 1.5 * iqr) | (dat > third_quartiles + 1.5 * iqr)
is_outlier
Out[8]:
a b
0 True True
1 True False
2 False False
3 False False
4 False False
5 False False
6 False False
7 False False
8 False False
9 False False

Here we go. One more step is how you want to treat a row is an outlier form the other rows. Two simple ways:

  1. If one of the elements is an outlier.
  2. If all of the elements are an outliers.

For the first case:

In [9]:
is_outlier.any(axis=1)
Out[9]:
0     True
1     True
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

For the second case:

In [10]:
is_outlier.all(axis=1)
Out[10]:
0     True
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

Can you see the difference?

To pull the outlier rows from the dataframe, we select by logic arrays. For example, we pull the outliers for the first case:

In [11]:
dat.loc[is_outlier.any(axis=1)]
Out[11]:
a b
0 -1.5 5.200000
1 -3.3 0.921301

This Jupyter Notebook is available on GitHub

No comments:

Post a Comment