28 February 2017

R: Find Outliers with IQR in Dataframe

Prewords

So far I've been only blogging data analysis done with Pandas in Python. Today's tool will be another widely accepted tool in the landscape of data science: R.

I actually started with R earlier than Pandas. Later due to the job I quickly learned Pandas and found it very awesome. However tools in R are equally awesome. So in this post, I'll be doing the topic we have worked on in Python with R. I think it would be interesting to see the difference. A quick finding is R tends to use more independent functions while Pandas has more method function associated with data types.

Context

One of method to find outliers is to make a boxplot. Outliers are shown as dots when they are 1.5 IQR above the third quartile or below the first quartile. In Pandas, we have methods to calculate quartiles and it's no brainer to calculate IQR from them. Therefore, finding outliers just requires a few steps.

To start with, we make some dummy data.

In [1]:
library(dplyr)
library(matrixStats)
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

matrixStats v0.51.0 (2016-10-08) successfully loaded. See ?matrixStats for help.

Attaching package: ‘matrixStats’

The following object is masked from ‘package:dplyr’:

    count

In [2]:
dat <- data.frame(a=runif(10), b=runif(10), row.names=1:10)
dat
ab
0.741223470.96371549
0.418084210.67254335
0.388538940.70647360
0.142924600.49835354
0.746279260.16966896
0.380967530.31313015
0.455506530.11488518
0.893022690.06733067
0.939578900.60182827
0.029483390.46074236

To make sure there are outliers to find I manually manipulate some values.

In [3]:
dat[1, 'a'] <- -1.5
dat[1, 'b'] <- 5.2
dat[2, 'a'] <- -3.3
dat
ab
-1.500000005.20000000
-3.300000000.67254335
0.388538940.70647360
0.142924600.49835354
0.746279260.16966896
0.380967530.31313015
0.455506530.11488518
0.893022690.06733067
0.939578900.60182827
0.029483390.46074236
In [4]:
boxplot(dat)

Quartiles

In [5]:
first.quartiles <- colQuantiles(dat, prob=0.25)
first.quartiles
  1. 0.0578436961513944
  2. 0.205534257518593
In [6]:
third.quartiles <- colQuantiles(dat, prob=0.75)
third.quartiles
  1. 0.673586076707579
  2. 0.654864581360016

IQR

In [7]:
iqr <- third.quartiles - first.quartiles
iqr
  1. 0.615742380556185
  2. 0.449330323841423
In [8]:
is.outlier <- (dat < first.quartiles - 1.5 * iqr) | (dat > third.quartiles + 1.5 * iqr)
is.outlier
ab
1 TRUE TRUE
2 TRUEFALSE
3FALSEFALSE
4FALSEFALSE
5FALSEFALSE
6FALSEFALSE
7FALSEFALSE
8FALSEFALSE
9FALSEFALSE
10FALSEFALSE

Here we go. One more step is how you want to treat a row is an outlier form the other rows. Two simple ways:

  1. If one of the elements is an outlier.
  2. If all of the elements are outliers.

For the first case:

In [9]:
apply(is.outlier, 1, any)
1
TRUE
2
TRUE
3
FALSE
4
FALSE
5
FALSE
6
FALSE
7
FALSE
8
FALSE
9
FALSE
10
FALSE

Note the usage patter of apply is nearly exactly the same as in Pandas, except in R apply is an independent function. Both of them interpret 0 as along the vertical x-axis and 1 as along the horizontal y-axis.

For the second case:

In [10]:
apply(is.outlier, 1, all)
1
TRUE
2
FALSE
3
FALSE
4
FALSE
5
FALSE
6
FALSE
7
FALSE
8
FALSE
9
FALSE
10
FALSE

Can you see the difference?

To pull the outlier rows from the dataframe, we select by logic arrays. For example, we pull the outliers for the first case:

In [11]:
dat[apply(is.outlier, 1, any), ]
ab
-1.5 5.2000000
-3.3 0.6725434

Selecting this way is generic across analysis tools. Like some syntax is "Pythonic", we can do it in a "Ric" way with filter from Dplyr library:

In [12]:
dat %>% filter(apply(is.outlier, 1, any))
ab
-1.5 5.2000000
-3.3 0.6725434

This notebook is available on GitHub.

No comments:

Post a Comment