Academics
/
Courses
/
Descriptions
DATA-ENG 300: Data Engineering Studio

Prerequisites

DATA_ENG 200 and 1 unit from each of the following core areas; Statistics Foundations, Intermediate Programming/Algorithmic Skills, and Applied Machine Learning.

Description

Data Science Studio Courses

DATA_ENG courses are only open to students who have been admitted to the Machine Learning and Data Science Minor. To apply for admission to the minor, please see the application and selection information.

DATA_ENG 300 Data Engineering Studio

Offered: Winter (TTh 12:30-1:50 p.m.) and Spring (TTh 9:30-10:50 a.m.)

Data Engineering Studio teaches how to build a sustainable data science lifecycle. Students will analyze data in multiple contexts (e.g., SQL, building machine learning models), share the findings with peers, and practice iteratively refining the analysis based on feedback. They will become acquainted with the common pitfalls in applying data analytics to real-world datasets. Several modern data engineering tools, such as docker containers, Spark, Airflow, and MLFlow, will be covered. This course is reserved for students pursuing the McCormick Machine Learning and Data Science Minor. We encourage students to take this course at the end of their studies in the minor. It is the second part of a two-part sequence with DATA_ENG 200.

Prerequisite: DATA_ENG 200 and 1 unit from each of the following core areas; Statistics Foundations, Intermediate Programming/Algorithmic Skills, and Applied Machine Learning.

Learning Objectives

(General Overview)

Students will understand the core concepts and scope of data engineering.
Students will understand the stages of the data engineering lifecycle and the common tools and techniques used.
Students will understand and be able to conduct exploratory data analysis to uncover insights from data.
Students will know and be able to design and manage relational and non-relational databases effectively.
Students will understand and be able to apply the principles of distributed (cloud) computing.
Students will be able to use Spark to accomplish extract-transform-load and extract-load-transform of data.
Students will be able to automate data and machine learning pipelines to enhance efficiency and reproducibility.

(If time permits)

Students will know and be able to design and implement A/B tests to evaluate hypotheses.
Students will understand and be able to apply transfer learning techniques to improve model performance with limited data.

Topics

Introduction to data engineering
Containerization
Exploratory data analysis
Distributed (cloud) computing
ETL and ELT via Spark
Automation of data pipelines
NoSQL databases