5.2 Data Transformation

: 30 minutes

We still use the same sales DataFrame from last section for most of our examples here.

Duplicates

The data may often contain the same row (with exactly same column values) multiple times. Below is an example DataFrame df with duplicate rows at index 1 and 5.

Use the .duplicated() function to know which rows have already appeared before.

If you have detected repeated rows, you may want to delete them using .drop_duplicates(). Again, the output is a new DataFrame.

The default behavior of the .drop_duplicates() function is to find duplicates using all the columns. However, you want to scope the search within a subset of columns, use the subset as an argument.

Lastly, the function keeps only first occurrence by default. To keep the last occurrence, use the attribute keep='last' as shown below.

Transformation through Mapping

Sometimes, your applications demands transforming a column entirely to prepare the data for analysis. For example, our sales data (see previous chapter) has a string column named Item.

The entries of the column are prepended with Item_* redundantly. We can use the .map() map to transform the Item column by dropping the leading Item_ string by passing a function. Note that the output is a new Series and the action does not alter the original DataFrame.

The .map() also accepts a Python dictionary as an argument. Notice tha that Category column has 8 categories.

If using numeric codes for them is more convenient, we can encode the transformation as a dictionary and supply it to the .map() function.

Renaming Index and Columns

In order to rename the index and the columns, use the .rename() function. For example, the following code downcases the column names of sales.

A subset of the columns can be renames by passing a dictionary argument.