5.4 Merge and Combine Datasets
: 30 minutes
In this section, we discuss merging and combining datasets in Pandas. Like SQL join, the pd.merge
function merge the rows of two datasets on columns, traditionally called keys
. Once merged, complex queries can be performed on the data.
For the ease of exposition, we use two real datasets: death-rates.csv
and drug-overdose.csv
. Feel free to check the Kaggle pages to learn more about the datasets and their metadata.
We first import the datasets.
Take a quick look at the first few rows:
As always, we run a quick summary on the imported DataFrames.
We first choose a common column to join the datasets. Both datasets have a common column named YEAR
.
Note that the merged dataset contains all columns of both datasets—except for YEAR
.
Note that the years in the result are the intersection, or the common set found in both datasets. This is an instance of inner
join. There are other variants (left
, right
, outer
), which can be mentioned using the how
argument of .merge
.
Find the total number of suicides for male drug consumers younger than 25 years (YEAR_NUM_drug < 25) and who were younger than 25 years at the time they committed suicide (YEAR_NUM_suicide < 25).