Big Data Parallel Programming, 7.5 credits

Parallelldatorprogrammering för bearbetning av stora datamängder, 7,5 hp

Course code: DT8034

School of Information Technology

Level: Second cycle

Select course syllabus

Version

Finalized by: Forsknings- och utbildningsnämnden , 2024-09-18 and is valid for students admitted for spring semester 2026.

Main field of study with advanced study

Computer Science and Engineering, Second cycle, has second-cycle course/s as entry requirements. (A1F)

Entry requirements

The course Edge Computing and Internet of Things 7.5 credits. English 6 or English level 2. Exemption of the requirement in Swedish is granted for those with foreign grades.

Placement in the Academic System

The course is mandatory for students of the Master’s Programme in Information Technology, 120 credits and as an elective course in the Programme Computer Science and Engineering, 300 credits. The course is also offered as a freestanding course.

Objectives

Processing huge amounts of data is at the core of data mining, deep learning and real-time autonomous decision making. All these are in turn at the core of modern artificial intelligence applications. Data can reside more or less permanently in the cloud and accessed via distributed file systems and / or be streamed in real time from multiple sensors at very high rates.

Access to data as well as processing is done using very well engineered frameworks where both storage and processing are distributed and processing is done in parallel. The purpose of this course is to introduce you to this infrastructure including parallel programming for the implementation of these frameworks. This should enable you to judge how to choose a framework for your applications, identify pros and cons, suggest improvements and even implement improvements.

Following successful completion of the course the student should be able to:

Knowledge and understanding

explain distributed file systems, distributed systems processing concepts, parallel programming concepts
describe tools and programming languages for storing and processing data in machine learning and data mining applications

Skills and ability

use specialized frameworks and methods for programming machine learning and data mining applications, such as hadoop, spark
use specialized programming languages for GPUs, such as CUDA and / or OpenCL for achieving high-performance

Judgement and approach

discuss suitable methods and tools for storing and processing data in machine learning and data mining applications.

Content

The course includes modern techniques, methods and tools for distributed storage for static and streamed massive data, for example distributed and fault tolerant key-value tables including replication and coordination mechanisms.
The course also includes modern techniques, methods and tools for distributed processing for static and streamed massive data including frameworks such as MapReduce and Spark. Finally, the course includes concepts, methods and tools for parallel programming for computing clusters, including GPUs.

Language of Instruction

Teaching is conducted in English.

Teaching Formats

The course includes lectures on distributed file systems, distributed systems processing concepts, parallel programming paradigms, specialized methods, frameworks and tools. The lectures introduce the material in the literature.

Lab exercises on data-intensive computing and parallel programming are used to allow the students to develop skills and abilities in the area.

A project on data-intensive computing allows the students to put their knowledge in the context of a realistic application.

Grading scale

Four-grade scale, digits (TH): Fail (U), Pass (3), Pass with credit (4), Pass with distinction (5)

Examination formats

To pass the course the students need to pass the labs and the project. To pass the project includes a program, a report, a presentation and is the basis of an individual oral examination.

1901: Laboratory Work, 2.5 credits
Two-grade scale (UG): Fail (U), Pass (G)

1902: Project, 5 credits
Four-grade scale, digits (TH): Fail (U), Pass (3), Pass with credit (4), Pass with distinction (5)

Exceptions from the specified examination format

If there are special reasons, the examiner may make exceptions from the specified examination format and allow a student to be examined in another way. Special reasons can e.g. be study support for students with disabilities.

Course evaluation

Course evaluation is part of the course. This evaluation offers guidance in the future development and planning of the course. Course evaluation is documented and made available to the students.

Course literature and other materials

Select literature list

Literature list 2025-01-20 – Until further notice

Finalized by: Forsknings- och utbildningsnämnden, 2024-09-18.

The course will use several sources, in many cases only fragments of these. The material will be extended with tutorials for the tools and frameworks to be used in the course as well as with research publications.

Zhang, Y, Cao, T, Li, S, Tian, X, Yuan, L, Jia, H, Vasilakos, A. “Parallel Processing Systems for Big Data: A Survey”

http://ieeexplore.ieee.org/document/7547948/

Leskovec, J, Rajaraman, A, Ullman, J. “Mining of Massive Datasets” http://infolab.stanford.edu/~ullman/mmds/book.pdf

Hennessy, J. L., & Patterson, D. A. “Computer architecture: a quantitative approach”. Elsevier. Chapter on The Warehouse-Scale Computer

Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee. Learning Spark: Lightning-Fast Data Analytics, 2nd Edition, O’Reilly, 2020