4.2 Essential Functionality
: 30 minutes
We now explore the core functionality to manipulate DataFrame. This features make Pandas so powerful a tool for data analysis.
Recall from last section, out toy dataset for the BMI study:
Reindexing
The primary use of the DataFrame.reindex()
function is to rearrange the rows and/or columns of an existing DataFrame. The function outputs a new DataFrame object.
Notice that a new index or column name introduces NaN
entries. Pandas offers fill_value=value
to use a substitute value
for the missing data.
One can also make use of method=ffill
and method=bfill
to forward and backward interpolate, respectively.
Drop
The DataFrame.drop
method simply drops (a list of) rows and/or columns from a DataFrame, outputting a new object.
Indexing
Indexing and slicing works exactly as NumPy array indexing. But, instead of using integer-based indices, Pandas prefers a special argument loc[row_list]
to select rows from a list of index labels.
Similarly, one may use iloc[row_list]
for integer-based row selection—even when the indices are not integers.
Both rows and columns can be selected by passing a second list of column labels and integers when using loc[row_list, column_list]
and iloc[row_list, column_list]
, respectively.
Method | Description |
---|---|
df[column] |
Select single column or sequence of columns from the DataFrame; special case conveniences: Boolean array (filter rows), slice (slice rows), or Boolean DataFrame (set values based on some criterion) |
df.loc[rows] |
Select single row or subset of rows from the DataFrame by label |
df.loc[:, cols] |
Select single column or subset of columns by label |
df.loc[rows, cols] |
Select both row(s) and column(s) by label |
df.iloc[rows] |
Select single row or subset of rows from the DataFrame by integer position |
df.iloc[:, cols] |
Select single column or subset of columns by integer position |
df.iloc[rows, cols] |
Select both row(s) and column(s) by integer position |
df.at[row, col] |
Select a single scalar value by row and column label |
df.iat[row, col] |
Select a single scalar value by row and column position (integers) |
reindex |
Select either rows or columns by labels |
Boolean DataFrame
A boolean comparison of a DataFrame with a scalar returns a boolean DataFrame.
The comparison can also be scoped to a set of rows or columns.
Just like NumPy, a boolean array can be passed as index for selecting data based on boolean conditions.
Arithmetic Operations
Arithmetic operations on two Pandas Series containing numeric values can be performed, by aligning the indices. As you might have already guessed, the missing values—in the extra indices, if any—filled with NaN
.
In case of two DataFrames, Pandas aligns both rows and columns.
The table below lists arithmetic operations commonly used for Series and DataFrames.
Method | Description |
---|---|
add, radd |
Methods for addition (+) |
sub, rsub |
Methods for subtraction (-) |
div, rdiv |
Methods for division (/) |
floordiv, rfloordiv |
Methods for floor division (//) |
mul, rmul |
Methods for multiplication (*) |
pow, rpow |
Methods for exponentiation (**) |
Functions and Mappings
Almost all NumPy universal functions are similarly applicable to Pandas Series and DataFrames.
The Pandas .apply()
function facilitates user-defined custom operations to apply on a row or column. For example, to produce the average of a Series, we can define the following avg
function.
Instead of columns, our custom function can be applied on each row in the following way.
For element-wise custom operations, use the .map()
function.
Sorting and Ranking
To sort the index of a Series or DataFrame lexicographically, Pandas offers the sort_index()
function.
Use the axis=columns
argument to sort the columns.
To sort a Series by its values, use the sort_values
method to sort Series and DataFrame (by columns):
To sort by multiple columns, provide a list of column names.
Similarly, to rank the data in each column, use .rank
.
Duplicate Labels
As already mentioned before, the indices may not be unique. In order to detect duplicate labels, apply the is_unique
attribute of DataFrame.index