Context¶
Whenever you get a new dataset, one of the first things you want to do first is have an overview. What are the values in each column? How are they distributed? It turns out a Pandas dataframe method called describe
can provide you these info. It's actually pretty simple to use but pretty comprehensive and powerful.
Without further ado, let's get started with creating some dummy data to play with.
import numpy as np
import pandas as pd
dat = pd.DataFrame(
{
'num1': np.random.random(10),
'num2': np.random.random(10),
'cat1': np.random.choice(['cat', 'dog'], 10),
'cat2': np.random.choice(['a', 'b', 'c'], 10)
}
)
dat
Simplest Usage¶
Simply calling the function can already tell you most of the stats you want to know.
dat.describe()
Numerical and Categorical cases¶
Surely you would ask, "Where are the other two columns gone?". Well the issue here is all the quantities except count
is meaningless to them, as they have categorical contents and mean
is no way to be gotten. Pandas strategy is numerical columns and associated quantities are the default. If you really want to see them under the same roof, the argument include
can be assigned all
just for that purpose.
dat.describe(include='all')
This result looks a bit messy. Categorical columns have their own technical overviews unique
, top
, and freq
. Personally, I don't prefer this way as it wastes too much valuable printing space.
I like to overview the categorical columns separately. You are probably able to figure it out on your own that include
is what you need to tweak on. It must be given a list of data type to include. 'O', plausibly standing for 'object', is to choose string type. If you like, you can exclude
numbers for the same purpose.
dat.describe(include=['O'])
Quantiles¶
So far so good, however what if you need certain quantiles beyond the default first quartile, median, and third quartile. There is another argument ready to complete its mission. You can pass the quantiles as a number between 0 and 1. For example, here I want the deciles.
percentiles = np.linspace(0.1, 0.9, 9)
percentiles
dat.describe(percentiles=percentiles)
This drills down to more details of how the number is distributed. Interesting to see the index for 30% standing out having '.0' which is unnecessary. Maybe it's a tiny bug.
Wonder if we can get similar results categorical values. Actually it's tempting to ask what if we have defined an order for categorical values.
dat.cat1.astype('category', categories=['dog', 'cat']).describe()
I'll explain what categories
does in a different blog or you can refer to this great youtube video.
Well no miracle happens. Maybe this can be a feature added in Pandas later.
An example plan is as follows:
To get the second decile of dog and cat, number of example gives to 2. By looking at the second element in the ordered category items, which turns out to be a dog as there are eight dogs. Therefore the second decile is a dog. Sounds silly, but it's probably useful.
This Jupyter Notebook is available on GitHub
No comments:
Post a Comment