This notebook is available on GitHub.
Context¶
Sometimes there is frustrating moment when Pandas report errors and you have no idea what caused it. Here is one of them I had in my work. This situation happens when the data files have fields whose data were manually input, e.g. customer residential address. The problem is Pandas starts a newline whenever there is a newline symbol.
import pandas as pd
from io import StringIO
Suppose we like to have the following data in a dataframe.
correct_data_string = StringIO(
'''
a,b,c
1,2,3
4,5,6
'''
)
correct_data = pd.read_csv(correct_data_string)
correct_data
However, not unusually you may have a data file like the following. In column "b", the cell having 2
unexpectedly has a newline symbol.
wrong_data_string = StringIO(
'''
a,b,c
1,2\n,3
4,5,6
'''
)
wrong_data = pd.read_csv(wrong_data_string)
wrong_data
This data quality doesn't cause issue at this stage but later. So when you have some weird issue, it is an good idea to open up the data file and check if there are newline symbols show up not at the end of a line. More subtle is when dealing with data files created by Microsoft softwares where a different newline symbol which appears as ^M
in vim. Check that symbol as well.
No comments:
Post a Comment