Prewords¶

So far I've been only blogging data analysis done with Pandas in Python. Today's tool will be another widely accepted tool in the landscape of data science: R.

I actually started with R earlier than Pandas. Later due to the job I quickly learned Pandas and found it very awesome. However tools in R are equally awesome. So in this post, I'll be doing the topic we have worked on in Python with R. I think it would be interesting to see the difference. A quick finding is R tends to use more independent functions while Pandas has more method function associated with data types.

Context¶

One of method to find outliers is to make a boxplot. Outliers are shown as dots when they are 1.5 IQR above the third quartile or below the first quartile. In Pandas, we have methods to calculate quartiles and it's no brainer to calculate IQR from them. Therefore, finding outliers just requires a few steps.

To start with, we make some dummy data.

In [1]:

library(dplyr)
library(matrixStats)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

matrixStats v0.51.0 (2016-10-08) successfully loaded. See ?matrixStats for help.

Attaching package: ‘matrixStats’

The following object is masked from ‘package:dplyr’:

    count

In [2]:

dat <- data.frame(a=runif(10), b=runif(10), row.names=1:10)
dat

a	b
0.74122347	0.96371549
0.41808421	0.67254335
0.38853894	0.70647360
0.14292460	0.49835354
0.74627926	0.16966896
0.38096753	0.31313015
0.45550653	0.11488518
0.89302269	0.06733067
0.93957890	0.60182827
0.02948339	0.46074236

To make sure there are outliers to find I manually manipulate some values.

In [3]:

dat[1, 'a'] <- -1.5
dat[1, 'b'] <- 5.2
dat[2, 'a'] <- -3.3
dat

a	b
-1.50000000	5.20000000
-3.30000000	0.67254335
0.38853894	0.70647360
0.14292460	0.49835354
0.74627926	0.16966896
0.38096753	0.31313015
0.45550653	0.11488518
0.89302269	0.06733067
0.93957890	0.60182827
0.02948339	0.46074236

In [4]:

boxplot(dat)

Quartiles¶

In [5]:

first.quartiles <- colQuantiles(dat, prob=0.25)
first.quartiles

0.0578436961513944
0.205534257518593

In [6]:

third.quartiles <- colQuantiles(dat, prob=0.75)
third.quartiles

0.673586076707579
0.654864581360016

IQR¶

In [7]:

iqr <- third.quartiles - first.quartiles
iqr

0.615742380556185
0.449330323841423

In [8]:

is.outlier <- (dat < first.quartiles - 1.5 * iqr) | (dat > third.quartiles + 1.5 * iqr)
is.outlier

	a	b
1	TRUE	TRUE
2	TRUE	FALSE
3	FALSE	FALSE
4	FALSE	FALSE
5	FALSE	FALSE
6	FALSE	FALSE
7	FALSE	FALSE
8	FALSE	FALSE
9	FALSE	FALSE
10	FALSE	FALSE

Here we go. One more step is how you want to treat a row is an outlier form the other rows. Two simple ways:

If one of the elements is an outlier.
If all of the elements are outliers.

For the first case:

In [9]:

apply(is.outlier, 1, any)

1: TRUE
2: TRUE
3: FALSE
4: FALSE
5: FALSE
6: FALSE
7: FALSE
8: FALSE
9: FALSE
10: FALSE

Note the usage patter of apply is nearly exactly the same as in Pandas, except in R apply is an independent function. Both of them interpret 0 as along the vertical x-axis and 1 as along the horizontal y-axis.

For the second case:

In [10]:

apply(is.outlier, 1, all)

1: TRUE
2: FALSE
3: FALSE
4: FALSE
5: FALSE
6: FALSE
7: FALSE
8: FALSE
9: FALSE
10: FALSE

Can you see the difference?

To pull the outlier rows from the dataframe, we select by logic arrays. For example, we pull the outliers for the first case:

In [11]:

dat[apply(is.outlier, 1, any), ]

a	b
-1.5	5.2000000
-3.3	0.6725434

Selecting this way is generic across analysis tools. Like some syntax is "Pythonic", we can do it in a "Ric" way with filter from Dplyr library:

In [12]:

dat %>% filter(apply(is.outlier, 1, any))

a	b
-1.5	5.2000000
-3.3	0.6725434

This notebook is available on GitHub.

Dr Fei's Cave

28 February 2017

R: Find Outliers with IQR in Dataframe

Prewords¶

Context¶

Quartiles¶

IQR¶

No comments:

Post a Comment