3.2 Statistical Methods


: 15 minutes

NumPy arrays have accessible methods to compute basic statistics of the entire array or along an axis. Some of the most commonly used statistical methods are listed in the table below.

Mathematical and Statistical Methods (McKinney 2017)
Method Description
sum Sum of all the elements in the array or along an axis; zero-length arrays have sum 0
mean Arithmetic mean; invalid (returns NaN) on zero-length arrays
std, var Standard deviation and variance, respectively
min, max Minimum and maximum
argmin, argmax Indices of minimum and maximum elements, respectively
cumsum Cumulative sum of elements starting from 0
cumprod Cumulative product of elements starting from 1
McKinney, Wes. 2017. Python for Data Analysis. 2nd ed. O’Reilly Media. https://www.oreilly.com/library/view/python-for-data/9781491957653/.

For example, the mean of an entire array A can be computed using A.mean() without any arguments; equivalently numpy.mean(A). The output, in this case, is a scalar.

Instead of the whole average, we often like to extract average values along a specific axis, say axis=0. In case A is a matrix, A.mean(axis=0) (equivalently numpy.mean(A, axis=0)) computes the column averages—producing a vector of size same to the number of columns.

For higher-dimensional arrays, the output array has dimension one less; its shape skips the particular axis.

Other Functions

numpy.where

The numpy.where function can come in handy in data science in many ways.

Let’s suppose that you have generated a random array from the standard normal distribution and values that are less than -2 or greater than 2 seem too extreme for your purpose. You can numpy.where in the following way to replace all such extreme occurrence (i.e., >|2|) with 2.

There are two occurrence as seen below.

We now replace them with 2.

The general form of the function np.where(cond, arr1, arr2) takes a boolean array cond and decides to choose a value from either arr1 or arr2 depending on whether the corresponding value in cond is True. Note that one or both of arr1 and arr2 can also be scalar, as in our example above.

numpy.sort

A NumPy array arr can be sorted along an axis using the built-in function numpy.sort(arr) or arr.sort() for in-place sorting.

The following code sorts a random matrix along the last axis or along each row. This is the default behavior (axis=-1) when no axis is mentioned in the argument.

However, a specific can also be supplied in the following way. Here, axis=0 prompts sorting along each column.

numpy.clip

The numpy.clip() function limits the values in an array to a specified range. It takes an array and a minimum and maximum value as arguments. Any elements in the array that are less than the specified minimum value are replaced by the minimum value, and any elements greater than the specified maximum value are replaced by the maximum value.

numpy.stack

A list of NumPy arrays of the same shape can be stacked along an axis using the numpy.stack function. The function takes a list of arrays to stack and (optionally) an axis of choice.

Let us first consider stacking two 1D arrays or vectors vertically along axis=0, to form a (fat-short) 2D array.

Let us now stack them horizontally along axis=1, to form a (skinny-tall) 2D array.

Next, we stack two matrices along different axes.