4.3 Basic Descriptive Statistics
: 20 minutes
Recall from last section, out toy dataset for the BMI study:
Just like NumPy, Pandas offers basic mathematical and statistical operations. These basic summary statistics are great for quick exploratory assessments about a DataFrame.
We have already seen the .sum()
function to compute the row or column sums of a DataFrame.
If a row or column has all NaN
values, 0 is returned as the sum. If any value is not NaN
, then the result is NaN
. This can be disabled with the skipna=True
option, in which case any NaN
value in a row or column names the corresponding result NaN
.
The .mean()
function follows the exact same pattern.
To gather all a brief descriptive statistical summary, use the .describe()
function.
The most commonly used summary statistics.
Method | Description |
---|---|
count |
Number of non-NA values |
describe |
Compute set of summary statistics |
min , max |
Compute minimum and maximum values |
argmin , argmax |
Compute index locations (integers) at which minimum or maximum value is obtained, respectively; not available on DataFrame objects |
idxmin , idxmax |
Compute index labels at which minimum or maximum value is obtained, respectively |
quantile |
Compute sample quantile ranging from 0 to 1 (default: 0.5) |
sum |
Sum of values |
mean |
Mean of values |
median |
Arithmetic median (50% quantile) of values |
mad |
Mean absolute deviation from mean value |
prod |
Product of all values |
var |
Sample variance of values |
std |
Sample standard deviation of values |
skew |
Sample skewness (third moment) of values |
kurt |
Sample kurtosis (fourth moment) of values |
cumsum |
Cumulative sum of values |
cummin , cummax |
Cumulative minimum or maximum of values, respectively |
cumprod |
Cumulative product of values |
diff |
Compute first arithmetic difference (useful for time series) |
pct_change |
Compute percent changes |