5.4 Merge and Combine Datasets


: 30 minutes

In this section, we discuss merging and combining datasets in Pandas. Like SQL join, the pd.merge function merge the rows of two datasets on columns, traditionally called keys. Once merged, complex queries can be performed on the data.

For the ease of exposition, we use two real datasets: death-rates.csv and drug-overdose.csv. Feel free to check the Kaggle pages to learn more about the datasets and their metadata.

We first import the datasets.

Take a quick look at the first few rows:

As always, we run a quick summary on the imported DataFrames.

We first choose a common column to join the datasets. Both datasets have a common column named YEAR.

Note that the merged dataset contains all columns of both datasets—except for YEAR.

Note that the years in the result are the intersection, or the common set found in both datasets. This is an instance of inner join. There are other variants (left, right, outer), which can be mentioned using the how argument of .merge.

Find the total number of suicides for male drug consumers younger than 25 years (YEAR_NUM_drug < 25) and who were younger than 25 years at the time they committed suicide (YEAR_NUM_suicide < 25).

Concatenation