Big Data Parallel Programming, 7.5 credits
Parallelldatorprogrammering för bearbetning av stora datamängder, 7,5 hp
Course code: DT8034
School of Information Technology
Level: Second cycle
Select course syllabus
Finalized by: Forsknings- och utbildningsnämnden, 2024-09-18 and is valid for students admitted for spring semester 2025.
Main field of study with advanced study
Computer Science and Engineering, Second cycle, has second-cycle course/s as entry requirements. (A1F)Entry requirements
The course Edge Computing and Internet of Things 7.5 credits. Exemption of the requirement in Swedish is granted. English 6.
Placement in the Academic System
The course is mandatory for students of the Master’s Programme in Information Technology, 120 credits and as an elective course in the Programme Computer Science and Engineering, 300 credits. The course is also offered as a freestanding course.
Objectives
Processing huge amounts of data is at the core of data mining, deep learning and real-time autonomous decision making. All these are in turn at the core of modern artificial intelligence applications. Data can reside more or less permanently in the cloud and accessed via distributed file systems and / or be streamed in real time from multiple sensors at very high rates.
Access to data as well as processing is done using very well engineered frameworks where both storage and processing are distributed and processing is done in parallel. The purpose of this course is to introduce you to this infrastructure including parallel programming for the implementation of these frameworks. This should enable you to judge how to choose a framework for your applications, identify pros and cons, suggest improvements and even implement improvements.
Following successful completion of the course the student should be able to:
Knowledge and understanding
- explain distributed file systems, distributed systems processing concepts, parallel programming concepts
- describe tools and programming languages for storing and processing data in machine learning and data mining applications
Skills and ability
- use specialized frameworks and methods for programming machine learning and data mining applications, such as hadoop, spark
- use specialized programming languages for GPUs, such as CUDA and / or OpenCL for achieving high-performance
Judgement and approach
- discuss suitable methods and tools for storing and processing data in machine learning and data mining applications.
Content
The course includes modern techniques, methods and tools for distributed storage for static and streamed massive data, for example distributed and fault tolerant key-value tables including replication and coordination mechanisms.
The course also includes modern techniques, methods and tools for distributed processing for static and streamed massive data including frameworks such as MapReduce and Spark. Finally, the course includes concepts, methods and tools for parallel programming for computing clusters, including GPUs.
Language of Instruction
Teaching Formats
The course includes lectures on distributed file systems, distributed systems processing concepts, parallel programming paradigms, specialized methods, frameworks and tools. The lectures introduce the material in the literature.
Lab exercises on data-intensive computing and parallel programming are used to allow the students to develop skills and abilities in the area.
A project on data-intensive computing allows the students to put their knowledge in the context of a realistic application.
Grading scale
Examination formats
To pass the course the students need to pass the labs and the project. To pass the project includes a program, a report, a presentation and is the basis of an individual oral examination.
1901: Laboratory Work, 2.5 credits
Two-grade scale (UG): Fail (U), Pass (G)
1902: Project, 5 credits
Four-grade scale, digits (TH): Fail (U), Pass (3), Pass with credit (4), Pass with distinction (5)
Exceptions from the specified examination format
If there are special reasons, the examiner may make exceptions from the specified examination format and allow a student to be examined in another way. Special reasons can e.g. be study support for students with disabilities.
Course evaluation
Course evaluation is part of the course. This evaluation offers guidance in the future development and planning of the course. Course evaluation is documented and made available to the students.
Course literature and other materials
Literature list 2025-01-20 – Until further notice
The course will use several sources, in many cases only fragments of these. The material will be extended with tutorials for the tools and frameworks to be used in the course as well as with research publications.
Zhang, Y, Cao, T, Li, S, Tian, X, Yuan, L, Jia, H, Vasilakos, A. “Parallel Processing Systems for Big Data: A Survey”
http://ieeexplore.ieee.org/document/7547948/
Leskovec, J, Rajaraman, A, Ullman, J. “Mining of Massive Datasets” http://infolab.stanford.edu/~ullman/mmds/book.pdf
Hennessy, J. L., & Patterson, D. A. “Computer architecture: a quantitative approach”. Elsevier. Chapter on The Warehouse-Scale Computer
Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee. Learning Spark: Lightning-Fast Data Analytics, 2nd Edition, O’Reilly, 2020