12 March 2017

Check Newline Symbols When Having Weird Issues

This notebook is available on GitHub.

Context

Sometimes there is frustrating moment when Pandas report errors and you have no idea what caused it. Here is one of them I had in my work. This situation happens when the data files have fields whose data were manually input, e.g. customer residential address. The problem is Pandas starts a newline whenever there is a newline symbol.

In [1]:
import pandas as pd
from io import StringIO

Suppose we like to have the following data in a dataframe.

In [2]:
correct_data_string = StringIO(
    '''
    a,b,c
    1,2,3
    4,5,6
    '''
)
In [3]:
correct_data = pd.read_csv(correct_data_string)
correct_data
Out[3]:
a b c
0 1 2 3
1 4 5 6

However, not unusually you may have a data file like the following. In column "b", the cell having 2 unexpectedly has a newline symbol.

In [4]:
wrong_data_string = StringIO(
    '''
    a,b,c
    1,2\n,3
    4,5,6
    '''
)
In [5]:
wrong_data = pd.read_csv(wrong_data_string)
wrong_data
Out[5]:
a b c
0 1.0 2 NaN
1 NaN 3 NaN
2 4.0 5 6.0

This data quality doesn't cause issue at this stage but later. So when you have some weird issue, it is an good idea to open up the data file and check if there are newline symbols show up not at the end of a line. More subtle is when dealing with data files created by Microsoft softwares where a different newline symbol which appears as ^M in vim. Check that symbol as well.

No comments:

Post a Comment