4.3 Basic Descriptive Statistics


: 20 minutes

Recall from last section, out toy dataset for the BMI study:

Just like NumPy, Pandas offers basic mathematical and statistical operations. These basic summary statistics are great for quick exploratory assessments about a DataFrame.

We have already seen the .sum() function to compute the row or column sums of a DataFrame.

Caution

If a row or column has all NaN values, 0 is returned as the sum. If any value is not NaN, then the result is NaN. This can be disabled with the skipna=True option, in which case any NaN value in a row or column names the corresponding result NaN.

The .mean() function follows the exact same pattern.

To gather all a brief descriptive statistical summary, use the .describe() function.

The most commonly used summary statistics.

Summary Statistics (McKinney 2017)
Method Description
count Number of non-NA values
describe Compute set of summary statistics
min, max Compute minimum and maximum values
argmin, argmax Compute index locations (integers) at which minimum or maximum value is obtained, respectively; not available on DataFrame objects
idxmin, idxmax Compute index labels at which minimum or maximum value is obtained, respectively
quantile Compute sample quantile ranging from 0 to 1 (default: 0.5)
sum Sum of values
mean Mean of values
median Arithmetic median (50% quantile) of values
mad Mean absolute deviation from mean value
prod Product of all values
var Sample variance of values
std Sample standard deviation of values
skew Sample skewness (third moment) of values
kurt Sample kurtosis (fourth moment) of values
cumsum Cumulative sum of values
cummin, cummax Cumulative minimum or maximum of values, respectively
cumprod Cumulative product of values
diff Compute first arithmetic difference (useful for time series)
pct_change Compute percent changes
McKinney, Wes. 2017. Python for Data Analysis. 2nd ed. O’Reilly Media. https://www.oreilly.com/library/view/python-for-data/9781491957653/.