This comprehensive course spans 4 months (16 weeks) and equips learners with expertise in Python, SQL, Azure, AWS, Apache Airflow, Kafka, Spark, and more.
- Learning Days: Monday to Thursday (theory and practice).
- Friday: Job shadowing or peer projects.
- Saturday: Hands-on lab sessions and project-based learning.
Month 1: Foundations of Data Engineering
Week 1: Onboarding and Environment Setup
-
Monday:
- Onboarding, course overview, career pathways, tools introduction.
-
Tuesday:
- Introduction to cloud computing (Azure and AWS).
-
Wednesday:
- Data governance, security, compliance, and access control.
-
Thursday:
- Introduction to SQL for data engineering and PostgreSQL setup.
-
Friday:
- Peer Project: Environment setup challenges.
-
Saturday (Lab):
- Mini Project: Build a basic pipeline with PostgreSQL and Azure Blob Storage.
Week 2: SQL Essentials for Data Engineering
-
Monday:
- Core SQL concepts (
SELECT
,WHERE
,JOIN
,GROUP BY
).
- Core SQL concepts (
-
Tuesday:
- Advanced SQL techniques: recursive queries, window functions, and CTEs.
-
Wednesday:
- Query optimization and execution plans.
-
Thursday:
- Data modeling: normalization, denormalization, and star schemas.
-
Friday:
- Job Shadowing: Observe senior engineers writing and optimizing SQL queries.
-
Saturday (Lab):
- Mini Project: Create a star schema and analyze data using SQL.
Week 3: Introduction to Data Pipelines
-
Monday:
- Theory: Introduction to ETL/ELT workflows.
-
Tuesday:
- Lab: Create a simple Python-based ETL pipeline for CSV data.
-
Wednesday:
- Theory: Extract, transform, load (ETL) concepts and best practices.
-
Thursday:
- Lab: Build a Python ETL pipeline for batch data processing.
-
Friday:
- Peer Project: Collaborate to design a basic ETL workflow.
-
Saturday (Lab):
- Mini Project: Develop a simple ETL pipeline to process sales data.
Week 4: Introduction to Apache Airflow
-
Monday:
- Theory: Introduction to Apache Airflow, DAGs, and scheduling.
-
Tuesday:
- Lab: Set up Apache Airflow and create a basic DAG.
-
Wednesday:
- Theory: DAG best practices and scheduling in Airflow.
-
Thursday:
- Lab: Integrate Airflow with PostgreSQL and Azure Blob Storage.
-
Friday:
- Job Shadowing: Observe real-world Airflow pipelines.
-
Saturday (Lab):
- Mini Project: Automate an ETL pipeline with Airflow for batch data processing.
Month 2: Intermediate Tools and Concepts
Week 5: Data Warehousing and Data Lakes
-
Monday:
- Theory: Introduction to data warehousing (OLAP vs. OLTP, partitioning, clustering).
-
Tuesday:
- Lab: Work with Amazon Redshift and Snowflake for data warehousing.
-
Wednesday:
- Theory: Data lakes and Lakehouse architecture.
-
Thursday:
- Lab: Set up Delta Lake for raw and curated data.
-
Friday:
- Peer Project: Implement a data warehouse model and data lake for sales data.
-
Saturday (Lab):
- Mini Project: Design and implement a basic Lakehouse architecture.
Week 6: Data Governance and Security
-
Monday:
- Theory: Data governance frameworks and data security principles.
-
Tuesday:
- Lab: Use AWS Lake Formation for access control and security enforcement.
-
Wednesday:
- Theory: Managing sensitive data and compliance (GDPR, HIPAA).
-
Thursday:
- Lab: Implement security policies in S3 and Azure Blob Storage.
-
Friday:
- Job Shadowing: Observe senior engineers applying governance policies.
-
Saturday (Lab):
- Mini Project: Secure data in the cloud using AWS and Azure.
Week 7: Real-Time Data Processing with Kafka
-
Monday:
- Theory: Introduction to Apache Kafka for real-time data streaming.
-
Tuesday:
- Lab: Set up a Kafka producer and consumer.
-
Wednesday:
- Theory: Kafka topics, partitions, and message brokers.
-
Thursday:
- Lab: Integrate Kafka with PostgreSQL for real-time updates.
-
Friday:
- Peer Project: Build a real-time Kafka pipeline for transactional data.
-
Saturday (Lab):
- Mini Project: Create a pipeline to stream e-commerce data with Kafka.
Week 8: Batch vs. Stream Processing
-
Monday:
- Theory: Introduction to batch vs. stream processing.
-
Tuesday:
- Lab: Batch processing with PySpark.
-
Wednesday:
- Theory: Combining batch and stream processing workflows.
-
Thursday:
- Lab: Real-time processing with Apache Flink and Spark Streaming.
-
Friday:
- Job Shadowing: Observe a real-time processing pipeline.
-
Saturday (Lab):
- Mini Project: Build a hybrid pipeline combining batch and real-time processing.
Month 3: Advanced Data Engineering
Week 9: Machine Learning Integration in Data Pipelines
-
Monday:
- Theory: Overview of ML workflows in data engineering.
-
Tuesday:
- Lab: Preprocess data for machine learning using Pandas and PySpark.
-
Wednesday:
- Theory: Feature engineering and automated feature extraction.
-
Thursday:
- Lab: Automate feature extraction using Apache Airflow.
-
Friday:
- Peer Project: Build a simple pipeline that integrates ML models.
-
Saturday (Lab):
- Mini Project: Build an ML-powered recommendation system in a pipeline.
Week 10: Spark and PySpark for Big Data
-
Monday:
- Theory: Introduction to Apache Spark for big data processing.
-
Tuesday:
- Lab: Set up Spark and PySpark for data analysis.
-
Wednesday:
- Theory: Spark RDDs, DataFrames, and SQL.
-
Thursday:
- Lab: Analyze large datasets using Spark SQL.
-
Friday:
- Peer Project: Build a PySpark pipeline for large-scale data processing.
-
Saturday (Lab):
- Mini Project: Analyze big data sets with Spark and PySpark.
Week 11: Advanced Apache Airflow Techniques
-
Monday:
- Theory: Advanced Airflow features (XCom, task dependencies).
-
Tuesday:
- Lab: Implement dynamic DAGs and task dependencies in Airflow.
-
Wednesday:
- Theory: Airflow scheduling, monitoring, and error handling.
-
Thursday:
- Lab: Create complex DAGs for multi-step ETL pipelines.
-
Friday:
- Job Shadowing: Observe advanced Airflow pipeline implementations.
-
Saturday (Lab):
- Mini Project: Design an advanced Airflow DAG for complex data workflows.
Week 12: Data Lakes and Delta Lake
-
Monday:
- Theory: Data lakes, Lakehouses, and Delta Lake architecture.
-
Tuesday:
- Lab: Set up Delta Lake on AWS for data storage and management.
-
Wednesday:
- Theory: Managing schema evolution in Delta Lake.
-
Thursday:
- Lab: Implement batch and real-time data loading to Delta Lake.
-
Friday:
- Peer Project: Design a Lakehouse architecture for an e-commerce platform.
-
Saturday (Lab):
- Mini Project: Implement a scalable Delta Lake architecture.
Month 4: Capstone Projects
Week 13: Batch Data Pipeline Development
-
Monday to Thursday:
- Design and Implementation:
- Build an end-to-end batch data pipeline for e-commerce sales analytics.
- Tools: PySpark, SQL, PostgreSQL, Airflow, S3.
-
Friday:
- Peer Review: Present progress and receive feedback.
-
Saturday (Lab):
- Project Milestone: Finalize and present batch pipeline results.
Week 14: Real-Time Data Pipeline Development
-
Monday to Thursday:
- Design and Implementation:
- Build an end-to-end real-time data pipeline for IoT sensor monitoring.
- Tools: Kafka, Spark Streaming, Flink, S3.
-
Friday:
- Peer Review: Present progress and receive feedback.
-
Saturday (Lab):
- Project Milestone: Finalize and present real-time pipeline results.
Week 15: Final Project Integration
-
Monday to Thursday:
- Design and Implementation:
- Integrate both batch and real-time pipelines for a comprehensive end-to-end solution.
- Tools: Kafka, PySpark, Airflow, Delta Lake, PostgreSQL, and S3.
-
Friday:
- Job Shadowing: Observe senior engineers integrating complex pipelines.
-
Saturday (Lab):
- Project Milestone: Showcase integrated solution for review.
Week 16: Capstone Project Presentation
-
Monday to Thursday:
- Final Presentation Preparation:
- Polish, test, and document the final project.
-
Friday:
- Peer Review: Present final projects to peers and receive feedback.
-
Saturday (Lab):
- Capstone Presentation: Showcase completed capstone projects to industry professionals and instructors.
Top comments (0)