5.1 Handling Missing Values


: 20 minutes

In this section, we deal with missing data in a acquired dataset. But, we first import Pandas and Numpy:

Pandas represents a missing entry by the designated floating-point value NaN.

For this exposition, we use synthetic Retail Store Sales data from Kaggle. The data is stored locally as retail_store_sales.csv. We import our data into a Pandas DataFrame sales:

By running the above code, we gather that the dataset has 11 columns. For a basic summary, use .describe() function.

Detecting

The function .isna(), when applied on a Series, returns a boolean Series, where True indicates a corresponding missing value in the Series. In case of a DataFrame, the function .isna() simply returns corresponding boolean DataFrame, applying the function on each column. Run the code below to see for yourself.

In order to know how many missing values each column contains, we additionally apply .sum().

We see that the columns Item, Price Per Unit, Quantity, Total Spent, and Discount Applied have, respectively, 1213, 609, 604, 604, and 4199 missing values.

In addition, the .info() function provides a convenient way to summarize the non-null entires in each column.

Filtering Out

Filtering out the missing values at once, Pandas offers the .dropna() function, which returns a new (Series or) DataFrame.

We see by running the code above that after dropping all missing values, only 7579 rows (about 60\%) have survived.

Filling In

In some cases, you may want to fill in the missing entries, rather than completely wiping out data rows containing at least on missing entry. For example, filtering out NaNs from the sales data trims off a little less than half the entire data! It may be wise to fill in the missing values with meaningful data.

The .fillna(fill_value) function fills the missing entires with a constant fill_value.

The .fillna() function also let’s us use a dictionary containing a fill value for each column.

Instead of providing fill values, the missing values can also be interpolated by making use of the method argument in .fillna() similar to reindexing.