Prewords¶
So far I've been only blogging data analysis done with Pandas in Python. Today's tool will be another widely accepted tool in the landscape of data science: R.
I actually started with R earlier than Pandas. Later due to the job I quickly learned Pandas and found it very awesome. However tools in R are equally awesome. So in this post, I'll be doing the topic we have worked on in Python with R. I think it would be interesting to see the difference. A quick finding is R tends to use more independent functions while Pandas has more method function associated with data types.
Context¶
One of method to find outliers is to make a boxplot. Outliers are shown as dots when they are 1.5 IQR above the third quartile or below the first quartile. In Pandas, we have methods to calculate quartiles and it's no brainer to calculate IQR from them. Therefore, finding outliers just requires a few steps.
To start with, we make some dummy data.
library(dplyr)
library(matrixStats)
dat <- data.frame(a=runif(10), b=runif(10), row.names=1:10)
dat
To make sure there are outliers to find I manually manipulate some values.
dat[1, 'a'] <- -1.5
dat[1, 'b'] <- 5.2
dat[2, 'a'] <- -3.3
dat
boxplot(dat)
Quartiles¶
first.quartiles <- colQuantiles(dat, prob=0.25)
first.quartiles
third.quartiles <- colQuantiles(dat, prob=0.75)
third.quartiles
IQR¶
iqr <- third.quartiles - first.quartiles
iqr
is.outlier <- (dat < first.quartiles - 1.5 * iqr) | (dat > third.quartiles + 1.5 * iqr)
is.outlier
Here we go. One more step is how you want to treat a row is an outlier form the other rows. Two simple ways:
- If one of the elements is an outlier.
- If all of the elements are outliers.
For the first case:
apply(is.outlier, 1, any)
Note the usage patter of apply
is nearly exactly the same as in Pandas, except in R apply
is an independent function. Both of them interpret 0
as along the vertical x
-axis and 1
as along the horizontal y
-axis.
For the second case:
apply(is.outlier, 1, all)
Can you see the difference?
To pull the outlier rows from the dataframe, we select by logic arrays. For example, we pull the outliers for the first case:
dat[apply(is.outlier, 1, any), ]
Selecting this way is generic across analysis tools. Like some syntax is "Pythonic", we can do it in a "Ric" way with filter
from Dplyr library:
dat %>% filter(apply(is.outlier, 1, any))
This notebook is available on GitHub.