CurriculumData Science Studio Courses
DATA_ENG courses are only open to students who have been admitted to the Machine Learning and Data Science minor and are on the Data Engineering or Hybrid tracks. To apply for admission to the minor, please see the application and selection information.
DATA_ENG 200 Foundations of Data Science
Offered: Winter (TTh 9:30-10:50 a.m.) and Spring (TTh 12:30-1:50 p.m.)
Foundations of Data Science will cover the fundamentals of data science and the context within which this field operates. This course will introduce the steps of the data science lifecycle and the associated data tools and techniques, through implementation in languages such as Python. This course is reserved for students pursuing the McCormick Machine Learning and Data Science Minor. We encourage students to take this early in their studies for the minor. It is the first part of a two-part sequence with DATA_ENG 300.
Prerequisite: COMP_SCI 150
Learning Objectives
(General overview)
- Students will understand the core concepts and scope of data science.
- Students will understand the stages of the data science lifecycle and the common tools and techniques used.
- Students will be able to formulate and scope innovative, relevant, or scientific questions that can be addressed with data.
- Students will be able to utilize computational thinking for problem-solving in data science.
- Students will be able to present data findings through written communications and visual aids through homework assignments and a project presentation.
(Related to specific. topics)
- Students will be able to conduct exploratory data analysis to uncover insights.
- Students will know and be able to apply principles of data cleaning and manipulation.
- Students will know and be able to apply the principles of algorithmic data collection and joining of multiple data sources.
- Students will know and be able to identify and avoid common pitfalls in data analytics, such as algorithmic bias.
- Students will know and be able to construct reproducible data science pipelines to ensure replicability of analyses.
(If time permits)
- Students will understand and apply best practices for handling and protecting sensitive data.
- Students will be able to implement version control to manage and track changes in data projects.
Topics
- Introduction to data science
- Data exploration and visualization
- Data manipulation, transformation, and standardization
- Algorithmic data retrieval methods
- Statistical modeling and machine learning
- Introduction to cloud computing
- Ethics and algorithmic bias
(If time permits)
- Data security and privacy
- Version control
DATA_ENG 300 Data Engineering Studio
Offered: Winter (TTh 12:30-1:50 p.m.) and Spring (TTh 9:30-10:50 a.m.)
Data Engineering Studio teaches how to build a sustainable data science lifecycle. Students will analyze data in multiple contexts (e.g., SQL, building machine learning models), share the findings with peers, and practice iteratively refining the analysis based on feedback. They will become acquainted with the common pitfalls in applying data analytics to real-world datasets. Several modern data engineering tools, such as docker containers, Spark, Airflow, and MLFlow, will be covered. This course is reserved for students pursuing the McCormick Machine Learning and Data Science Minor. We encourage students to take this course at the end of their studies in the minor. It is the second part of a two-part sequence with DATA_ENG 200.
Prerequisite: DATA_ENG 200 and 1 unit from each of the following core areas; Statistics Foundations, Intermediate Programming/Algorithmic Skills, and Applied Machine Learning.
Learning Objectives
(General Overview)
- Students will understand the core concepts and scope of data engineering.
- Students will understand the stages of the data engineering lifecycle and the common tools and techniques used.
- Students will understand and be able to conduct exploratory data analysis to uncover insights from data.
- Students will know and be able to design and manage relational and non-relational databases effectively.
- Students will understand and be able to apply the principles of distributed (cloud) computing.
- Students will be able to use Spark to accomplish extract-transform-load and extract-load-transform of data.
- Students will be able to automate data and machine learning pipelines to enhance efficiency and reproducibility.
(If time permits)
- Students will know and be able to design and implement A/B tests to evaluate hypotheses.
- Students will understand and be able to apply transfer learning techniques to improve model performance with limited data.
Topics
- Introduction to data engineering
- Containerization
- Exploratory data analysis
- Distributed (cloud) computing
- ETL and ELT via Spark
- Automation of data pipelines
- NoSQL databases
(If time permits)
- A/B testing
- Transfer learning
More information on required materials will be coming soon.