Lab 5 – Introduction to Data Mining

Series

Exercise 1

What is the difference between Pandas and NumPy?

Exercise 2

What is the difference between pandas Series and DataFrame?

Exercise 3

Create a Series from a Python list. Print the Series and interpret the output.

Exercise 4

Create a Series containing the numbers (e.g. 6001) of courses you are taking this semester, with index indicating the instructors. Print only the data type of your Series.

Exercise 5

Create a Series containing 25 random numbers from the standard normal distribution. C

DataFrame

Exercise 6

Create a DataFrame from a list of lists so that at least one column is integer, one column is categorical, and one column is float.

Exercise 7

Create a DataFrame from a dictionary of lists so that the two columns contain, respectively, normally and uniformly distributed random numbers.

Exercise 8

Create a DataFrame from the Series creating in Exercise 5.

Exercise 9

Import the file cities.csv into a DataFrame.

Exercise 10

Save the DataFrame from Exercise 9 as a colon-delimited csv and without the header.

BMI Study

For our BMI study, we will generate fabricated data using numpy.random. As we discussed before, our features are weight, height, and age.

A good model for weight is uniform distribution supported on [105, 230] lbs. Similarly, age and height can be assumed to be uniformly distributed over [15, 85] and [60, 75], respectively. Recall that taking measurements of 25 respondents everyday for a year makes the shape of our data tensor (365, 25, 3).

Exercise 11

Write code to generate data just for a day for our BMI study. Create a DataFrame called bmi_df.

Exercise 12

Rename the index of bmi_df to Student.

Exercise 13

Rearrange the columns of bmi_df in this order: height, age, weight

Exercise 14

Access the agecolumn first.

Exercise 15

Access both age and height columns together.

Exercise 16

Change the indices to Student 1, Student 2, …, Student 25.

Exercise 17

Create a new column called Date filled with the value Jan 1.

Exercise 18

Return the features of Student 20. Describe the output type.

Exercise 19

Return the features of students 15 through 20.

Exercise 20

Round the age column up to the nearest integer.

Exercise 21

Select the last five row and the first two columns.

Exercise 22

Drop the Date column.

Exercise 23

Select the students with their height between 68 and 72.

Exercise 24

Sort bmi_df by age.

Exercise 25

Compute the standard deviation of each column of bmi_df. Describe the output type.