21 March 2017

R: Check Newline Symbols When Having Weird Issues

This notebook is available on GitHub.

Context

I have demonstrated in the last blog in Pandas that excessive newline symbols can cause issues that don't crash the reading process but result in frustrations at a later stage. In this blog, I show you the same issue arises in R as well. Therefore be always aware of this bug no matter which side you are in the holy war between Python and R.

First off, this is the correct data set.

In [1]:
correct_data_string <- "
a,b,c
1,2,3
4,5,6
"
cat(correct_data_string)
a,b,c
1,2,3
4,5,6

Suppose we like to have the following data in a dataframe.

In [2]:
correct_data <- read.csv(textConnection(correct_data_string))
correct_data
abc
123
456

However, not unusually you may have a data file like the following. In column "b", the cell having 2 unexpectedly has a newline symbol.

In [3]:
wrong_data_string <- "
a,b,c
1,2\n,3
4,5,6
"
cat(wrong_data_string)
a,b,c
1,2
,3
4,5,6

The reading utility isn't smart enough to count the number of elements in a row. So we end up having three rows.

In [4]:
wrong_data <- read.csv(textConnection(wrong_data_string))
wrong_data
abc
12 NA
NA3 NA
45 6

This data quality doesn't cause issue at this stage but later. So when you have some weird issue, it is an good idea to open up the data file and check if there are newline symbols show up not at the end of a line. More subtle is when dealing with data files created by Microsoft softwares where a different newline symbol which appears as ^M in vim. Check that symbol as well.

No comments:

Post a Comment