Context¶

Whenever you get a new dataset, one of the first things you want to do first is have an overview. What are the values in each column? How are they distributed? It turns out a Pandas dataframe method called describe can provide you these info. It's actually pretty simple to use but pretty comprehensive and powerful.

Without further ado, let's get started with creating some dummy data to play with.

In [1]:

import numpy as np
import pandas as pd

In [2]:

dat = pd.DataFrame(
   {
       'num1': np.random.random(10),
       'num2': np.random.random(10),
       'cat1': np.random.choice(['cat', 'dog'], 10),
       'cat2': np.random.choice(['a', 'b', 'c'], 10)
   }
)

In [3]:

dat

Out[3]:

	cat1	cat2	num1	num2
0	dog	a	0.102079	0.106746
1	dog	b	0.603775	0.581288
2	dog	a	0.144284	0.345447
3	cat	a	0.603948	0.825914
4	dog	c	0.051512	0.031002
5	dog	b	0.847561	0.749578
6	dog	b	0.143133	0.692005
7	cat	a	0.541666	0.640678
8	dog	c	0.255207	0.587646
9	dog	b	0.549595	0.173111

Simplest Usage¶

Simply calling the function can already tell you most of the stats you want to know.

In [4]:

dat.describe()

Out[4]:

	num1	num2
count	10.000000	10.000000
mean	0.384276	0.473342
std	0.276072	0.286298
min	0.051512	0.031002
25%	0.143421	0.216195
50%	0.398437	0.584467
75%	0.590230	0.679174
max	0.847561	0.825914

Numerical and Categorical cases¶

Surely you would ask, "Where are the other two columns gone?". Well the issue here is all the quantities except count is meaningless to them, as they have categorical contents and mean is no way to be gotten. Pandas strategy is numerical columns and associated quantities are the default. If you really want to see them under the same roof, the argument include can be assigned all just for that purpose.

In [5]:

dat.describe(include='all')

Out[5]:

	cat1	cat2	num1	num2
count	10	10	10.000000	10.000000
unique	2	3	NaN	NaN
top	dog	b	NaN	NaN
freq	8	4	NaN	NaN
mean	NaN	NaN	0.384276	0.473342
std	NaN	NaN	0.276072	0.286298
min	NaN	NaN	0.051512	0.031002
25%	NaN	NaN	0.143421	0.216195
50%	NaN	NaN	0.398437	0.584467
75%	NaN	NaN	0.590230	0.679174
max	NaN	NaN	0.847561	0.825914

This result looks a bit messy. Categorical columns have their own technical overviews unique, top, and freq. Personally, I don't prefer this way as it wastes too much valuable printing space.

I like to overview the categorical columns separately. You are probably able to figure it out on your own that include is what you need to tweak on. It must be given a list of data type to include. 'O', plausibly standing for 'object', is to choose string type. If you like, you can exclude numbers for the same purpose.

In [6]:

dat.describe(include=['O'])

Out[6]:

	cat1	cat2
count	10	10
unique	2	3
top	dog	b
freq	8	4

Quantiles¶

So far so good, however what if you need certain quantiles beyond the default first quartile, median, and third quartile. There is another argument ready to complete its mission. You can pass the quantiles as a number between 0 and 1. For example, here I want the deciles.

In [7]:

percentiles = np.linspace(0.1, 0.9, 9)
percentiles

Out[7]:

array([ 0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9])

In [8]:

dat.describe(percentiles=percentiles)

Out[8]:

	num1	num2
count	10.000000	10.000000
mean	0.384276	0.473342
std	0.276072	0.286298
min	0.051512	0.031002
10%	0.097022	0.099172
20%	0.134922	0.159838
30.0%	0.143939	0.293746
40%	0.210838	0.486952
50%	0.398437	0.584467
60%	0.544838	0.608859
70%	0.565849	0.656076
80%	0.603810	0.703520
90%	0.628309	0.757211
max	0.847561	0.825914

This drills down to more details of how the number is distributed. Interesting to see the index for 30% standing out having '.0' which is unnecessary. Maybe it's a tiny bug.

Wonder if we can get similar results categorical values. Actually it's tempting to ask what if we have defined an order for categorical values.

In [10]:

dat.cat1.astype('category', categories=['dog', 'cat']).describe()

Out[10]:

count      10
unique      2
top       dog
freq        8
Name: cat1, dtype: object

I'll explain what categories does in a different blog or you can refer to this great youtube video.

Well no miracle happens. Maybe this can be a feature added in Pandas later.

An example plan is as follows:

To get the second decile of dog and cat, number of example gives to 2. By looking at the second element in the ordered category items, which turns out to be a dog as there are eight dogs. Therefore the second decile is a dog. Sounds silly, but it's probably useful.

This Jupyter Notebook is available on GitHub

Dr Fei's Cave

22 February 2017

How to Get the Most of "describe" Method?

Context¶

Simplest Usage¶

Numerical and Categorical cases¶

Quantiles¶

No comments:

Post a Comment