Academics
/
Courses
/
Descriptions
COMP_SCI 326: Introduction to the Data Science Pipeline

Prerequisites

(CS 212 and CS 214) or graduate student or instructor consent

Description

This course aims to cover various tools in the process of data science for obtaining, cleaning, visualizing, modeling, and interpreting data. Most of the tools introduced in this course will be based on Python, although the idea can be applied to similar tools in other programming languages. The goal of this course is not about the foundation of relevant technologies but rather when and how to use them in the pipeline of data science. The student will finish a quarter-long self-defined course project to exercise the data-science tools covered in the lecture. As the outcome of this course, the students should be able to independently work on real-life datasets with large scales and gain insights from them.

This course fulfills the Technical Elective area.
Formerly Comp_Sci 396 - last offer was Spring 2022

COURSE INSTRUCTOR: Huiling Hu or Joshua D'Arcy

COURSE COORDINATOR: Huiling Hu

Related Materials

“Python Data Science Handbook: Essential Tools for Working with Data” by Jake VanderPlas
“Learning Data Mining with Python” by Robert Layton

Grading

Grades will be assigned according to the description below. Letter grades will be assigned based on a percentage-to-letter-grade mapping.

Homework assignments (35%)
- 5 individual assignments
Midterm exam (25%)
Course Project (40%) Students can define their own topic. The project includes
- Proposal
- Milestone
- Presentation
- Final Report

Course Outline

Main Topics Include

Course overview and logistics
Obtaining and managing Data
Data cleaning
Exploratory Data Analysis
Statistics
- Correlation, Independence and Association
- Hypothesis Testing
Basic machine learning
- Basic concepts and algorithms
- Assessment and Overfitting
- Feature selection
Text mining
Data Visualization and Storytelling
Ethics in Data Science