5.2 Data Transformation
: 30 minutes
We still use the same sales
DataFrame from last section for most of our examples here.
Duplicates
The data may often contain the same row (with exactly same column values) multiple times. Below is an example DataFrame df
with duplicate rows at index 1 and 5.
Use the .duplicated()
function to know which rows have already appeared before.
If you have detected repeated rows, you may want to delete them using .drop_duplicates()
. Again, the output is a new DataFrame.
The default behavior of the .drop_duplicates()
function is to find duplicates using all the columns. However, you want to scope the search within a subset of columns, use the subset as an argument.
Lastly, the function keeps only first occurrence by default. To keep the last occurrence, use the attribute keep='last'
as shown below.
Transformation through Mapping
Sometimes, your applications demands transforming a column entirely to prepare the data for analysis. For example, our sales
data (see previous chapter) has a string
column named Item
.
The entries of the column are prepended with Item_*
redundantly. We can use the .map()
map to transform the Item
column by dropping the leading Item_
string by passing a function. Note that the output is a new Series and the action does not alter the original DataFrame.
The .map()
also accepts a Python dictionary as an argument. Notice tha that Category
column has 8 categories.
If using numeric codes for them is more convenient, we can encode the transformation as a dictionary and supply it to the .map()
function.
Renaming Index and Columns
In order to rename the index and the columns, use the .rename()
function. For example, the following code downcases the column names of sales
.
A subset of the columns can be renames by passing a dictionary argument.