Context¶
One of method to find outliers is to make a boxplot. Outliers are shown as dots when they are 1.5 IQR above the third quartile or below the first quartile. In Pandas, we have methods to calculate quartiles and it's no brainer to calculate IQR from them. Therefore, finding outliers just requires a few steps.
To start with, we make some dummy data.
%matplotlib inline
import pandas as pd
import numpy as np
dat = pd.DataFrame({'a': np.random.random(10), 'b': np.random.random(10)})
dat
To make sure there are outliers to find I manually manipulate some values.
dat.at[0, 'a'] = -1.5
dat.at[0, 'b'] = 5.2
dat.at[1, 'a'] = -3.3
dat.boxplot(grid=False)
Quartiles¶
first_quartiles = dat.quantile(0.25)
first_quartiles
third_quartiles = dat.quantile(0.75)
third_quartiles
IQR¶
iqr = third_quartiles - first_quartiles
iqr
Outliers¶
Now we can compare the values in dataframe to quartiles and IQR's to find outliers.
is_outlier = (dat < first_quartiles - 1.5 * iqr) | (dat > third_quartiles + 1.5 * iqr)
is_outlier
Here we go. One more step is how you want to treat a row is an outlier form the other rows. Two simple ways:
- If one of the elements is an outlier.
- If all of the elements are an outliers.
For the first case:
is_outlier.any(axis=1)
For the second case:
is_outlier.all(axis=1)
Can you see the difference?
To pull the outlier rows from the dataframe, we select by logic arrays. For example, we pull the outliers for the first case:
dat.loc[is_outlier.any(axis=1)]
This Jupyter Notebook is available on GitHub
No comments:
Post a Comment