Context¶

One of method to find outliers is to make a boxplot. Outliers are shown as dots when they are 1.5 IQR above the third quartile or below the first quartile. In Pandas, we have methods to calculate quartiles and it's no brainer to calculate IQR from them. Therefore, finding outliers just requires a few steps.

To start with, we make some dummy data.

In [1]:

%matplotlib inline
import pandas as pd
import numpy as np

In [2]:

dat = pd.DataFrame({'a': np.random.random(10), 'b': np.random.random(10)})
dat

Out[2]:

	a	b
0	0.226132	0.485436
1	0.780740	0.921301
2	0.145161	0.012100
3	0.714989	0.893943
4	0.785227	0.655811
5	0.999116	0.163410
6	0.979852	0.409046
7	0.137181	0.685542
8	0.571706	0.455979
9	0.643498	0.024336

To make sure there are outliers to find I manually manipulate some values.

In [3]:

dat.at[0, 'a'] = -1.5
dat.at[0, 'b'] = 5.2
dat.at[1, 'a'] = -3.3

In [4]:

dat.boxplot(grid=False)

Out[4]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f468d556e10>

Quartiles¶

In [5]:

first_quartiles = dat.quantile(0.25)
first_quartiles

Out[5]:

a    0.139176
b    0.224819
Name: 0.25, dtype: float64

In [6]:

third_quartiles = dat.quantile(0.75)
third_quartiles

Out[6]:

a    0.767668
b    0.841842
Name: 0.75, dtype: float64

IQR¶

In [7]:

iqr = third_quartiles - first_quartiles
iqr

Out[7]:

a    0.628491
b    0.617023
dtype: float64

Outliers¶

Now we can compare the values in dataframe to quartiles and IQR's to find outliers.

In [8]:

is_outlier = (dat < first_quartiles - 1.5 * iqr) | (dat > third_quartiles + 1.5 * iqr)
is_outlier

Out[8]:

	a	b
0	True	True
1	True	False
2	False	False
3	False	False
4	False	False
5	False	False
6	False	False
7	False	False
8	False	False
9	False	False

Here we go. One more step is how you want to treat a row is an outlier form the other rows. Two simple ways:

If one of the elements is an outlier.
If all of the elements are an outliers.

For the first case:

In [9]:

is_outlier.any(axis=1)

Out[9]:

0     True
1     True
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

For the second case:

In [10]:

is_outlier.all(axis=1)

Out[10]:

0     True
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

Can you see the difference?

To pull the outlier rows from the dataframe, we select by logic arrays. For example, we pull the outliers for the first case:

In [11]:

dat.loc[is_outlier.any(axis=1)]

Out[11]:

	a	b
0	-1.5	5.200000
1	-3.3	0.921301

This Jupyter Notebook is available on GitHub

Dr Fei's Cave

24 February 2017

Find Outliers with IQR in Dataframe

Context¶

Quartiles¶

IQR¶

Outliers¶

No comments:

Post a Comment