4.2 Essential Functionality


: 30 minutes

We now explore the core functionality to manipulate DataFrame. This features make Pandas so powerful a tool for data analysis.

Recall from last section, out toy dataset for the BMI study:

Reindexing

The primary use of the DataFrame.reindex() function is to rearrange the rows and/or columns of an existing DataFrame. The function outputs a new DataFrame object.

Notice that a new index or column name introduces NaN entries. Pandas offers fill_value=value to use a substitute value for the missing data.

One can also make use of method=ffill and method=bfill to forward and backward interpolate, respectively.

Drop

The DataFrame.drop method simply drops (a list of) rows and/or columns from a DataFrame, outputting a new object.

Indexing

Indexing and slicing works exactly as NumPy array indexing. But, instead of using integer-based indices, Pandas prefers a special argument loc[row_list] to select rows from a list of index labels.

Similarly, one may use iloc[row_list] for integer-based row selection—even when the indices are not integers.

Both rows and columns can be selected by passing a second list of column labels and integers when using loc[row_list, column_list] and iloc[row_list, column_list], respectively.

Indexing Options (McKinney 2017)
Method Description
df[column] Select single column or sequence of columns from the DataFrame; special case conveniences: Boolean array (filter rows), slice (slice rows), or Boolean DataFrame (set values based on some criterion)
df.loc[rows] Select single row or subset of rows from the DataFrame by label
df.loc[:, cols] Select single column or subset of columns by label
df.loc[rows, cols] Select both row(s) and column(s) by label
df.iloc[rows] Select single row or subset of rows from the DataFrame by integer position
df.iloc[:, cols] Select single column or subset of columns by integer position
df.iloc[rows, cols] Select both row(s) and column(s) by integer position
df.at[row, col] Select a single scalar value by row and column label
df.iat[row, col] Select a single scalar value by row and column position (integers)
reindex Select either rows or columns by labels

Boolean DataFrame

A boolean comparison of a DataFrame with a scalar returns a boolean DataFrame.

The comparison can also be scoped to a set of rows or columns.

Just like NumPy, a boolean array can be passed as index for selecting data based on boolean conditions.

Arithmetic Operations

Arithmetic operations on two Pandas Series containing numeric values can be performed, by aligning the indices. As you might have already guessed, the missing values—in the extra indices, if any—filled with NaN.

In case of two DataFrames, Pandas aligns both rows and columns.

The table below lists arithmetic operations commonly used for Series and DataFrames.

Arithmetic Operations (McKinney 2017)
Method Description
add, radd Methods for addition (+)
sub, rsub Methods for subtraction (-)
div, rdiv Methods for division (/)
floordiv, rfloordiv Methods for floor division (//)
mul, rmul Methods for multiplication (*)
pow, rpow Methods for exponentiation (**)
McKinney, Wes. 2017. Python for Data Analysis. 2nd ed. O’Reilly Media. https://www.oreilly.com/library/view/python-for-data/9781491957653/.

Functions and Mappings

Almost all NumPy universal functions are similarly applicable to Pandas Series and DataFrames.

The Pandas .apply() function facilitates user-defined custom operations to apply on a row or column. For example, to produce the average of a Series, we can define the following avg function.

Instead of columns, our custom function can be applied on each row in the following way.

For element-wise custom operations, use the .map() function.

Sorting and Ranking

To sort the index of a Series or DataFrame lexicographically, Pandas offers the sort_index() function.

Use the axis=columns argument to sort the columns.

To sort a Series by its values, use the sort_values method to sort Series and DataFrame (by columns):

To sort by multiple columns, provide a list of column names.

Similarly, to rank the data in each column, use .rank.

Duplicate Labels

As already mentioned before, the indices may not be unique. In order to detect duplicate labels, apply the is_unique attribute of DataFrame.index