4.1 Essential Data Structures


: 40 minutes

We first import Pandas into your workspace in the following way:

Since Pandas and NumPy go hand-in-hand, it’s wise to import NumPy as well for later use.

Most of the heavy-lifting in Pandas is done by two major data structures: Series and DataFrame.

If a dataframe is thought of as a spreadsheet, then each column can be thought of as a series.

Series

The Series data structure provides the building block for DataFrame. A Series object is like a column in a spreadsheet: a one-dimensional vector containing objects of the same (NumPy) dtype. In that sense, Series provides a container for a homogeneous data column.

From list

The simplest method to form a Series by passing a Python list.

Printing a Series object like col above also prints the numpy.dtype of the elements it contains. The only difference between a NumPy 1D array and a Pandas Series is the presence of an index in the latter; see the mysterious first column in the output of the above code.

Index is a Pandas data structure to store and manipulate indices for Series and DataFrame.

For a Series containing N objects, if no explicit index is supplied, Pandas creates an index from 0 to N-1. One can also set the index explicitly by passing using the index argument. Note that the indices don’t have to be integers or unique.

From dict

Another convenient way to create a Series (along with the index) is from a Python dictionary. Let us consider the following Python dictionary of top six most populated countries1.

1 Source: wikipedia

countries = {
    "India" : 1_417_492_000,
    "China" : 1_408_280_000, 
    "USA" : 340_110_988, 
    "Indonesia" : 284_438_782,
    "Pakistan" : 241_499_431,
    "Nigeria" : 223_800_000
}

When initializing a Series from a dictionary, the keys become indices and the values their corresponding data object.

Note that the inferred NumPy dtype is int64 to accommodate long integers.

Now, let us image that the data dictionary is very large and we would like to use only a selected subset of keys. This can be achieved by passing the selected keys to the index argument. The trick is often used also to rearrage the data.

In the above example, the passed index forces the Series to create an entry for Ireland. Since there was no such key in the dictionary countries, the value NaN (not a number) is entered in its place. An observant reader would notice that dtype now changed to float64 due to the presence of NaN. NaN indicates a missing value; to detect them use functions pd.isna() and pd.notna().

Both a Series and its index have a name attribute, which can be updated as follows:

Use the index attribute to update the existing index.

Accessing Data

To access a value from a Series, you can use the corresponding index in the following way.

CautionIndex may not be unique!

Note that the index may not contain unique values. In that case, selection may get you more than one element.

A list of values from the index can be passed to retrieve a list of objects from a Series.

You can also access values using their integer positions using the function .iloc[.].

NumPy-like Operations

Almost all NumPy operations (e.g. boolean indexing, arithmetic operations) can equivalently be applied to Series.

Dataframe

The DataFrame data structure presents a table, containing columns (Series) of possibly different data types.

From tabular data

The easiest way to initialize DataFrame is by supplying the column names and associated data.

From dict

DataFrames can also be created from a dictionary containing columns names as keys and corresponding Series as values. Let us consider the following toy data in a Python dictionary.

BMI_data = {
    "age": [22, 24, 31, 27],
    "height": [155, 165, 162, 159],
    "weight": [188, 202, 178, 196]
}
BMI_respondents = ["Student 1", "Student 2", "Student 3", "Student 4"]

For our toy BMI study, we can load the dictionary data into DataFrame, with an optional argument to set the index.

For a large dataset, the functions DataFrame.head() and DataFrame.tail display only the first and last five rows, respectively. An optional argument can be set to change the default.

The columns can be rearrange, scoped, and modified using the columns attribute during initialization.

Like Series, supplying a column name that does not exist as a key in the data dictionary creates a column with NaN or missing values.

Accessing data

While retrieving a column, DataFrame can be thought of as a dictionary—with the keys being the column names.

Note that the returned object is a Series, inheriting the index from df.

Like NumPy, Columns can be modified through direct assignment.

Rows of DataFrame can be accessed primarily by .loc[] and .iloc[] for label-based and integer-based indices, respectively.

Note Adding a Column

If df refers to the DataFrame from the above code, which of the following is the expected outcome of the code below:

col = pd.Series(['M', 'F', 'M'], index=['Student 1', 'Student 3', 'Student 4'])
df['sex'] = col

A new column sex is added to df without error, and sex of Student 2 is set to NaN.

.loc[]

When the argument is a single label, the returned object is a Series having the columns of df as its indices.

In case of batch access, however, the returned object is a DataFrame.

.iloc[]

Similarly, the integer-based access of rows are shown below.

Just like Series, a DataFrame and its columns and index have a name attribute, which can be updated as follows: