Introduction to Data Mining
Welcome
This course introduces the basic concepts of data mining as applied to data science, using Python as the programming language. Python Bootcamp is a prerequisite for the course; we start with only advanced topics of the programming language. This is followed by a more thorough treatment of Linear Algebra, Numpy, and Pandas. With weekly hands-on activities (called labs) and homework exercises, we delve into the fundamental concepts and the practical know-how to perform data mining in real-world situations. We use various data wrangling and visualization techniques to make them suitable for typical exploratory data analysis (EDA). We then use the data in a variety of machine-learning (ML) schemes. As an introductory class, we aim to cover a broad spectrum of this emerging field instead of a specialized area. Students will continue to use these concepts and skills in the more advanced courses further into the program.
Instructor(s)
Name: Sushovan Majhi
Email: s.majhi@gwu.edu
Learning Outcomes
As a result of completing this course, students will be able to: How to see the world of data science through the lenses of matrices; Apply data mining concepts and techniques to real-life problems; Demonstrate knowledge of Python programming and basic object-oriented programming concepts by creating Python code for common tasks; Write Python code to perform data analysis, including data pre-processing, data wrangling and model building of various machine learning algorithms; Produce visual charts and graphs of real-world data using Python programming. Analyze data to find information that is relevant and consequential using Python. Synthesize knowledge gained through collaboration with peers on group projects; Communicate data analysis results to a general audience through presentations to peers and instructor.
Course Prerequisites
We don’t go through the basics of Python programming. You will find the class more rewarding if you have a good working knowledge of Python or other programming experience yourself. If you feel the need to catch up more on the programming side, consider spending a few hours completing an online course or two on Python programming. In addition to Python Bootcamp, I strongly recommend completing one of the following:
- (Trinket Tutorials) https://hourofpython.com/#tutorials
- (Official Tutorials) https://docs.python.org/3/tutorial/index.html
- (eBook) Python for Data Analysis (Ch 1–3) https://wesmckinney.com/book/
Course Format
The primary format of this course is a lecture followed by a lab or discussion. In general, lectures with labs are the main learning tools. These labs are an important part of the course. It is the student’s responsibility to complete the in-class labs, which will be graded.
Textbooks
The second half of the course will loosely follow the following book(s): (The Python version of) An Introduction to Statistical
Learning by Gareth M. James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (This book is good for coding exercises.) Download a free online copy from here: https://www.statlearning.com/
Python for Data Analysis, 3E by Wes McKinney The e-book is available here: https://wesmckinney.com/book/ (This is a free e-book with …)
Learning from Data, Abu-Mostafa et al.
Exams
There is an in-class exam on Oct 3.
Assignments
Assignments will be due weekly (sometimes bi-weekly) and will be individual. More details will be posted on Blackboard.
Participation
Student participation in the course is key to fully grasping the class material. The following factors will be taken into account when calculating the participation grade:
Being present in class
Asking questions Making contributions to class discussions and when called upon Staying on task during lab exercises
Readings
Most class days have a corresponding reading assigned, which is indicated within the class syllabus. It is the student’s responsibility to complete these readings before class on the date they appear on the schedule.
Labs
Students are expected to be present during labs and to complete all labs during class lectures. Labs will be due at the end of class and will be submitted on Blackboard. Labs will be graded and constitute 30% of the overall grade.
Grading
Your final grade will be determined by:
- Assignments (40%)
- Labs (30%)
- Exams (30%)
Schedule
Download a printable syllabus here.
Date | Topic | Reading | Assignment Due |
---|---|---|---|
Module: Linear Algebra & Numpy | |||
Aug 26 | Getting to know each other Object-oriented programming, classes, Instance, static, and class methods Inheritance subclassing Lab: Implementing Python classes |
Ch 1–3 | Cheers! |
Sep 2 | Matrices, basic matrix operations Determinant, inverse Introduction to Numpy ndArray Lab: problem-solving using Numpy |
Ch 4 | HW Python |
Sep 9 | Diagonalization, Matrix decompositions (QR, SVD) Applications Lab: Solve systems of equations, image compression |
TBD | HW Matrices |
Sep 16 | PCA, applications, LU, Orthonality, Eigen values Application | TBD | HW Matrices |