22 February 2017

How to Get the Most of "describe" Method?

Context

Whenever you get a new dataset, one of the first things you want to do first is have an overview. What are the values in each column? How are they distributed? It turns out a Pandas dataframe method called describe can provide you these info. It's actually pretty simple to use but pretty comprehensive and powerful.

Without further ado, let's get started with creating some dummy data to play with.

In [1]:
import numpy as np
import pandas as pd
In [2]:
dat = pd.DataFrame(
   {
       'num1': np.random.random(10),
       'num2': np.random.random(10),
       'cat1': np.random.choice(['cat', 'dog'], 10),
       'cat2': np.random.choice(['a', 'b', 'c'], 10)
   }
)
In [3]:
dat
Out[3]:
cat1 cat2 num1 num2
0 dog a 0.102079 0.106746
1 dog b 0.603775 0.581288
2 dog a 0.144284 0.345447
3 cat a 0.603948 0.825914
4 dog c 0.051512 0.031002
5 dog b 0.847561 0.749578
6 dog b 0.143133 0.692005
7 cat a 0.541666 0.640678
8 dog c 0.255207 0.587646
9 dog b 0.549595 0.173111

Simplest Usage

Simply calling the function can already tell you most of the stats you want to know.

In [4]:
dat.describe()
Out[4]:
num1 num2
count 10.000000 10.000000
mean 0.384276 0.473342
std 0.276072 0.286298
min 0.051512 0.031002
25% 0.143421 0.216195
50% 0.398437 0.584467
75% 0.590230 0.679174
max 0.847561 0.825914

Numerical and Categorical cases

Surely you would ask, "Where are the other two columns gone?". Well the issue here is all the quantities except count is meaningless to them, as they have categorical contents and mean is no way to be gotten. Pandas strategy is numerical columns and associated quantities are the default. If you really want to see them under the same roof, the argument include can be assigned all just for that purpose.

In [5]:
dat.describe(include='all')
Out[5]:
cat1 cat2 num1 num2
count 10 10 10.000000 10.000000
unique 2 3 NaN NaN
top dog b NaN NaN
freq 8 4 NaN NaN
mean NaN NaN 0.384276 0.473342
std NaN NaN 0.276072 0.286298
min NaN NaN 0.051512 0.031002
25% NaN NaN 0.143421 0.216195
50% NaN NaN 0.398437 0.584467
75% NaN NaN 0.590230 0.679174
max NaN NaN 0.847561 0.825914

This result looks a bit messy. Categorical columns have their own technical overviews unique, top, and freq. Personally, I don't prefer this way as it wastes too much valuable printing space.

I like to overview the categorical columns separately. You are probably able to figure it out on your own that include is what you need to tweak on. It must be given a list of data type to include. 'O', plausibly standing for 'object', is to choose string type. If you like, you can exclude numbers for the same purpose.

In [6]:
dat.describe(include=['O'])
Out[6]:
cat1 cat2
count 10 10
unique 2 3
top dog b
freq 8 4

Quantiles

So far so good, however what if you need certain quantiles beyond the default first quartile, median, and third quartile. There is another argument ready to complete its mission. You can pass the quantiles as a number between 0 and 1. For example, here I want the deciles.

In [7]:
percentiles = np.linspace(0.1, 0.9, 9)
percentiles
Out[7]:
array([ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9])
In [8]:
dat.describe(percentiles=percentiles)
Out[8]:
num1 num2
count 10.000000 10.000000
mean 0.384276 0.473342
std 0.276072 0.286298
min 0.051512 0.031002
10% 0.097022 0.099172
20% 0.134922 0.159838
30.0% 0.143939 0.293746
40% 0.210838 0.486952
50% 0.398437 0.584467
60% 0.544838 0.608859
70% 0.565849 0.656076
80% 0.603810 0.703520
90% 0.628309 0.757211
max 0.847561 0.825914

This drills down to more details of how the number is distributed. Interesting to see the index for 30% standing out having '.0' which is unnecessary. Maybe it's a tiny bug.

Wonder if we can get similar results categorical values. Actually it's tempting to ask what if we have defined an order for categorical values.

In [10]:
dat.cat1.astype('category', categories=['dog', 'cat']).describe()
Out[10]:
count      10
unique      2
top       dog
freq        8
Name: cat1, dtype: object

I'll explain what categories does in a different blog or you can refer to this great youtube video.

Well no miracle happens. Maybe this can be a feature added in Pandas later.

An example plan is as follows:

To get the second decile of dog and cat, number of example gives to 2. By looking at the second element in the ordered category items, which turns out to be a dog as there are eight dogs. Therefore the second decile is a dog. Sounds silly, but it's probably useful.

This Jupyter Notebook is available on GitHub

No comments:

Post a Comment