4.1 Essential Data Structures
: 40 minutes
We first import Pandas into your workspace in the following way:
Since Pandas and NumPy go hand-in-hand, it’s wise to import NumPy as well for later use.
Most of the heavy-lifting in Pandas is done by two major data structures: Series and DataFrame.
If a dataframe is thought of as a spreadsheet, then each column can be thought of as a series.
Series
The Series data structure provides the building block for DataFrame. A Series
object is like a column in a spreadsheet: a one-dimensional vector containing objects of the same (NumPy) dtype
. In that sense, Series provides a container for a homogeneous data column.
From list
The simplest method to form a Series by passing a Python list
.
Printing a Series object like col
above also prints the numpy.dtype
of the elements it contains. The only difference between a NumPy 1D array and a Pandas Series is the presence of an index
in the latter; see the mysterious first column in the output of the above code.
Index
is a Pandas data structure to store and manipulate indices for Series and DataFrame.
For a Series containing N objects, if no explicit index is supplied, Pandas creates an index from 0 to N-1. One can also set the index explicitly by passing using the index
argument. Note that the indices don’t have to be integers or unique.
From dict
Another convenient way to create a Series (along with the index) is from a Python dictionary. Let us consider the following Python dictionary of top six most populated countries1.
1 Source: wikipedia
= {
countries "India" : 1_417_492_000,
"China" : 1_408_280_000,
"USA" : 340_110_988,
"Indonesia" : 284_438_782,
"Pakistan" : 241_499_431,
"Nigeria" : 223_800_000
}
When initializing a Series from a dictionary, the keys become indices and the values their corresponding data object.
Note that the inferred NumPy dtype
is int64
to accommodate long integers.
Now, let us image that the data dictionary is very large and we would like to use only a selected subset of keys. This can be achieved by passing the selected keys to the index
argument. The trick is often used also to rearrage the data.
In the above example, the passed index forces the Series to create an entry for Ireland
. Since there was no such key in the dictionary countries
, the value NaN
(not a number) is entered in its place. An observant reader would notice that dtype
now changed to float64
due to the presence of NaN
. NaN
indicates a missing value; to detect them use functions pd.isna()
and pd.notna()
.
Both a Series and its index have a name
attribute, which can be updated as follows:
Use the index
attribute to update the existing index.
Accessing Data
To access a value from a Series, you can use the corresponding index in the following way.
Note that the index may not contain unique values. In that case, selection may get you more than one element.
A list of values from the index can be passed to retrieve a list of objects from a Series.
You can also access values using their integer positions using the function .iloc[.]
.
NumPy-like Operations
Almost all NumPy operations (e.g. boolean indexing, arithmetic operations) can equivalently be applied to Series.
Dataframe
The DataFrame data structure presents a table, containing columns (Series) of possibly different data types.
From tabular data
The easiest way to initialize DataFrame is by supplying the column names and associated data.
From dict
DataFrames can also be created from a dictionary containing columns names as keys and corresponding Series as values. Let us consider the following toy data in a Python dictionary.
= {
BMI_data "age": [22, 24, 31, 27],
"height": [155, 165, 162, 159],
"weight": [188, 202, 178, 196]
}= ["Student 1", "Student 2", "Student 3", "Student 4"] BMI_respondents
For our toy BMI study, we can load the dictionary data into DataFrame, with an optional argument to set the index.
For a large dataset, the functions DataFrame.head()
and DataFrame.tail
display only the first and last five rows, respectively. An optional argument can be set to change the default.
The columns can be rearrange, scoped, and modified using the columns
attribute during initialization.
Like Series, supplying a column name that does not exist as a key in the data dictionary creates a column with NaN
or missing values.
Accessing data
While retrieving a column, DataFrame can be thought of as a dictionary—with the keys being the column names.
Note that the returned object is a Series, inheriting the index from df
.
Like NumPy, Columns can be modified through direct assignment.
Rows of DataFrame can be accessed primarily by .loc[]
and .iloc[]
for label-based and integer-based indices, respectively.
.loc[]
When the argument is a single label, the returned object is a Series having the columns of df
as its indices.
In case of batch access, however, the returned object is a DataFrame.
.iloc[]
Similarly, the integer-based access of rows are shown below.
Just like Series, a DataFrame and its columns and index have a name
attribute, which can be updated as follows: