GitHub - warestack/bda: Big Data Analytics lab repository with weekly tutorials, exercises and other resources.
Welcome to the Big Data Analytics lab repository. This repo contains weekly materials including tutorials, exercises, quizzes, homework, and reference solutions.
Structure
You can use this repository in two ways:
- Clone (recommended for most students)
Use this if you only want to access materials and work locally.
- Fork (recommended for reorganizing work)
Create your own copy of the repository on GitHub. This allows you to:
- track your progress
- commit your solutions
- share your work easily
You can also pull updates from the original repository (upstream) when new materials are released.
Sessions
- Session 1 README
Setup, Python fundamentals, loops/indexing, and CSV basics. - Session 2 README
csv.DictReader, key-based data access, and practical data cleaning flow. - Session 3 README
Iterators/generators, streaming vs loading, complexity, and intro RAG concepts. - Session 4 README
Serial vs multiprocessing basics, process management, andPool-based parallel image processing. - Session 5 README
Capstone activity instructions and link to the capstone repository. - Session 6 README
Mutexes, semaphores, controlled concurrency, and an optional parallel printer simulation project. - Session 7 README
Introduction to pandas, data cleaning basics, NumPy arrays, and pandas homework practice. - Session 8 README
Introduction to Apache Spark with PySpark, Google Colab practice, Spark SQL analytics, and optional local Spark setup. - Session 9 README
Advanced PySpark practice with CSV schemas, derived columns, time features, window rankings, and final summary outputs.
Notes
- Complete all tutorials, exercises, quizzes, and homework each week
- Use one public GitHub repository (e.g. bda-homeworks) for your work
- Keep your repository updated regularly
Communication
- Discussion forum: Microsoft Teams
- Submission: Share your repository link in the Teams forum with the instructor and class