4.3 Basic Descriptive Statistics

: 20 minutes

Recall from last section, out toy dataset for the BMI study:

Just like NumPy, Pandas offers basic mathematical and statistical operations. These basic summary statistics are great for quick exploratory assessments about a DataFrame.

We have already seen the .sum() function to compute the row or column sums of a DataFrame.

Caution

If a row or column has all NaN values, 0 is returned as the sum. If any value is not NaN, then the result is NaN. This can be disabled with the skipna=True option, in which case any NaN value in a row or column names the corresponding result NaN.

The .mean() function follows the exact same pattern.

To gather all a brief descriptive statistical summary, use the .describe() function.

The most commonly used summary statistics.

Summary Statistics (McKinney 2017)
Method	Description
`count`	Number of non-NA values
`describe`	Compute set of summary statistics
`min`, `max`	Compute minimum and maximum values
`argmin`, `argmax`	Compute index locations (integers) at which minimum or maximum value is obtained, respectively; not available on DataFrame objects
`idxmin`, `idxmax`	Compute index labels at which minimum or maximum value is obtained, respectively
`quantile`	Compute sample quantile ranging from 0 to 1 (default: 0.5)
`sum`	Sum of values
`mean`	Mean of values
`median`	Arithmetic median (50% quantile) of values
`mad`	Mean absolute deviation from mean value
`prod`	Product of all values
`var`	Sample variance of values
`std`	Sample standard deviation of values
`skew`	Sample skewness (third moment) of values
`kurt`	Sample kurtosis (fourth moment) of values
`cumsum`	Cumulative sum of values
`cummin`, `cummax`	Cumulative minimum or maximum of values, respectively
`cumprod`	Cumulative product of values
`diff`	Compute first arithmetic difference (useful for time series)
`pct_change`	Compute percent changes

McKinney, Wes. 2017. Python for Data Analysis. 2nd ed. O’Reilly Media. https://www.oreilly.com/library/view/python-for-data/9781491957653/.