Introduction to GPU Programming
CMPUT 382-B1
Winter 2022
|
Machine learning models require loading, transforming,
and processing extremely large datasets to glean critical insights. With up
to 1.3 TB of unified memory and all-to-all GPU communications with NVSwitch, HGX A100 powered by A100 80GB GPUs can load and
perform calculations on enormous datasets to derive actionable insights
quickly. |
General Information
o Instructor: Prof. Pierre Boulanger Tel: 780-492-3031 Email: pierreb@ualberta.ca
o URL: www.cs.ualberta.ca/~pierreb Office: Virtual Office hours: By appointment only.
o Lectures: MWF 14:00:00 - 14:50:00 (Virtual Class Over Zoom)
o
Winter Term 2022 - LAB H01
o
Tuesday 11:00:00 - 13:50:00
o
Winter Term 2022 - LAB H02
o
Thursday 14:00:00 -
16:50:00 (T B 104)
o
TAs
o Hong Zu Li hongzu@ualberta.ca
o Rafsanjany Kushol kushol@ualberta.ca
Course Goals
o Learn how to program heterogeneous parallel computing systems such as GPUs
o CUDA Language
o Functionality and maintainability of GPU
o How to deal with scalability
o Portability issues
o Other subjects
o Parallel programming API, tools, and techniques
o Principles and patterns of parallel algorithms
o Processor architecture features and constraints
Prerequisites
It is assuming that you already have some familiarity with the C and C++ Languages.
Course Content
M1: Introduction Jan. 5-7 |
o
1.1
Course Introduction and Overview o
1.2 Introduction to
Heterogeneous Parallel Computing o
1.3 Portability and
Scalability in Heterogeneous Parallel Computing o
No Labs this week |
o
Lecture-1-2-heterogeneous-computing o
Lecture-1-3-portability-scalability
|
M2: Introduction to CUDA C Jan. 10-14 |
o 2.1 CUDA C vs. Thrust vs. CUDA LibrariesMemory Allocation and Data Movement API Functions o 2.2 Memory Allocation and Data Movement API Functions o 2.3 Threads and Kernel Functions o 2.4 Introduction to the CUDA Toolkit |
o
Brief
Introduction to C programming o Chapter 2 - Data Parallel
Computing.pdf o
Lecture-2-1-cuda-thrust-libs o
Lecture-2-2-cuda-data-allocation-API o
Lecture-2-3-cuda-parallelism-threads o
Quiz1 opens on e-Class
Friday at 9h55 and closes Monday at 23h55 o
Module 2 Lab see
e-class
|
M3: CUDA Parallelism Model Jan. 17-21
|
o 3.1 Kernel-Based SPMD Parallel Programming o 3.2 Multidimensional Kernel Configuration o 3.3 Color-to-Grayscale Image Processing Example o 3.4 Image Blur Example o 3.5 Thread Scheduling |
o
Chapter
3 - Scalable Parallel Execution.pdf o Lecture-3-1-kernel-SPMD-parallelism o
Lecture-3-2-kernel-multidimension o
Lecture-3-3-color-to-greyscale-image-processing-example o
Lecture-3-5-thread-scheduling o
Quiz2 opens on e-Class
Friday at 9h55 and closes Monday at 23h55 o Module 3 Lab see e-class |
M4: Memory Model and Locality Jan. 24-26 |
o 4.1 CUDA Memories o 4.2 Tiled Parallel Algorithms o 4.3 Tiled Matrix Multiplication o 4.4 Tiled Matrix Multiplication Kernel o 4.5 Handling Arbitrary Matrix Sizes in Tiled Algorithms |
o
Chapter
4 - Memory and Data Locality o Lecture-4-1 CUDA Memories o
Lecture-4-2-Tiled Parallel
Algorithms o
Lecture-4-3-Tiled
Matrix Multiplication o
Lecture-4-4-Tiled
Matrix Multiplication Kernel o
Lecture-4-5
Handling Arbitrary Matrix Sizes in Tiled Algorithms o
Quiz3 opens on e-Class Friday
at 9h55 and closes Monday at 23h55 o Module 4 Lab see e-class |
M5: Thread Execution and Efficiency Jan. 28 |
o 5.1 Warps and SIMD Hardware o 5.2 Performance Impact of Control Divergence |
o
Chapter
5 - Performance Considerations.pdf o
Lecture-5-1-Warps
and SIMD Hardware o
Lecture-5-2-Performance
Impact of Control Divergence o
Quiz4 opens on e-Class Friday at 9h55 and
closes Monday at 23h55 o
Module 5 Lab see e-class |
M6: Memory Access Performance Jan. 31 |
o 6.1 DRAM Bandwidth o 6.2 Memory Coalescing in CUDA |
o
Chapter
5 - Performance Considerations.pdf o
Lecture-6-2-Memory
Coalescing in CUDA o Quiz5 opens on e-Class Friday at 9h55 and closes Monday at 23h55 |
M7: Parallel Computation Patterns (Histogram) Feb. 2 - 4 |
o 7.1 Histogramming o 7.2 Introduction to Data Races o 7.3 Atomic Operations in CUDA o 7.4 Atomic Operation Performance o 7.5 Privatization Technique for Improved Throughput |
o
Chapter
9 - Parallel Patterns-Parallel Histogram Computation o
Lecture-7-2-Introduction
to Data Races o
Lecture-7-3-Atomic
Operations in CUDA o
Lecture-7-4-Atomic
Operation Performance o
Lecture-7-5-Privatization
Technique for Improved Throughput o
Quiz6 opens on e-Class Thursday at 9h55 and
closes Monday at 23h55 o
Module 7 Lab see
e-class |
M8: Parallel Computation Patterns (Stencil) Feb. 7-11 |
o 8.1 Convolution o 8.2 Tiled Convolution o 8.3 Tile Boundary Conditions o 8.4 Analyzing Data Reuse in Tiled Convolution
|
o Chapter 7 - Parallel Patterns: Convolution o
Lecture-8-2-Tiled
Convolution o
Lecture-8-3-Tile
Boundary Conditions o Lecture-8-4-Analyzing
Data Reuse in Tiled Convolution o
Quiz7 opens on e-Class
Thursday at 9h55 and closes Monday at 23h55 o
Module 8 Lab see
e-class |
M9: Parallel Computation Patterns (Reduction) Feb. 14-18 |
o 9.1 Reduction o 9.2 Reduction Kernel o 9.3 Better Reduction Kernel |
o
Lecture-9-2-reduction-kernel o Lecture-9-3-better-reduction-kernel o
Quiz8 opens on e-Class
Thursday at 9h55 and closes Monday at 23h55 o
Module
9 Lab see e-class |
Reading Week Feb. 21-25 |
o
No class |
|
M10: Parallel Computation Patterns (scan) Feb. 28-Mar. 2
|
o 10.1 Prefix Sum o 10.2 A Work-inefficient Scan Kernel o 10.3 A Work-Efficient Parallel Scan Kernel o 10.4 More on Parallel Scan o 10.5 Scan applications |
o
Chapter
9 - Parallel Patterns:PrefixSum o
Lecture-10-2-A
Work-inefficient Scan Kernel o
Lecture-10-3-A
Work-Efficient Parallel Scan Kernel o
Lecture-10-4-More
on Parallel Scan o
Quiz9 opens on
e-Class Thursday at 9h55 and closes Monday at 23h55 o
Module 10 Lab see
e-class |
M11: Floating-Point Considerations Mar. 4 |
o
11.0
GPU Internal Architecture o
11.1 Floating-Point
Precision and Accuracy o 11.2 Numerical Stability |
o
Lecture-11-0-GPU-struct-basics |
M12: GPU as Part of the PC Architecture Mar. 7-9 |
o 12.1 GPU as Part of the PC Architecture |
|
M13: Efficient Host-Device Data Transfer Mar. 14-16 |
o 13.1 Pinned Host Memory o 13.2 Task Parallelism in CUDA o 13.3 Overlapping Data Transfer with Computation |
o
Lecture-13-1-Pinned
Host Memory o
Lecture-13-2-Task
Parallelism in CUDA o
Lecture-13-3-Overlapping
Data Transfer with Computation o
Quiz10 opens on e-Class Thursday at 9h55
and closes Monday at 23h55 o
Module 13 Lab see
e-class |
M14: Application Case Study: Electrostatic Potential
Calculation and CNN Mar. 21-23 |
o
14.1 Electrostatic Potential Calculation Part 1 o 14.2 Electrostatic Potential Calculation Part 2 o 14.3 DNN and Convolutional Neural Networks |
o Lecture-14-1-VMD-case-study-Part1 o
Lecture-14-2-VMD-case-study-Part2 |
M15: Computational Thinking for Parallel Programming Mar. 25 |
o 15.1 Computational Thinking o 15.2 Multi-GPU-Programming |
o 2nd-Edition-Chapter13-Computational-Thinking o
Lecture-15-1-Computational-Thinking |
M16: Related Programming Models: OpenCL Mar. 28 |
o 16.1 OpenCL Data Parallelism Model o 16.2 OpenCL Device Architecture o 16.3 OpenCL Host Code |
o
2nd-Edition-Chapter14-OpenCL o Lecture-16-1-OpenCL
Data Parallelism Model o Lecture-16-2-OpenCL
Device Architecture o
Lecture-16-3-OpenCL
Host Code o
An Introduction to OpenCL
using AMD GPUs o Quiz11 opens on e-Class Thursday at 9h55 and closes Monday at 23h55 o Module 17
Lab see e-class |
M17: Related Programming Models: MPI Mar. 30-Apr 1 |
o
16.1 Introduction to
Heterogeneous Supercomputing and MPI |
o
2nd-Edition-Chapter19-Cluster o Lecture-17-1-Introduction to Heterogeneous Supercomputing and MPI |
M18: Related Programming Models: OpenACC Apr. 1 |
o 18.1 Introduction to OpenACC o 18.2 OpenACC
Subtleties |
o 2nd-Edition-Chapter15-OpenACC o Lecture-18-2-openACC-subtleties o
Accelerating HPC Applications on NVIDIA GPUs with OpenACC o Quiz12 opens on e-Class Thursday at 9h55 and closes Monday at 23h55 |
M19: Dynamic Parallelism Apr. 4 |
o 19.1 Dynamic
Parallelism |
|
M20: Class Review Apr. 8 |
o
Class Review |
|
Final
Exam April
25, 9h00 to 11h00 |
o Open Book Final Exam Delivered Virtual |
|
Quizzes
There will be 12 quizzes
distributed after each module. You are required to do only 10 of them and it
will count for 20% of the final mark.
The quizzes will generally be
handed out electronically after a module is terminated and are due in class at
the beginning of the next module.
Final Exam
At the end of the term there will be a 2-hour final exam in
the form of a quiz. The Quiz will be administered using ExamLock.
The exam will cover the material discussed in class. The exam will be very
close to the quizzes. The exam will be open book. The exam will count for 30%
of the final mark.
Exam date: April 25, 9h00 to 11h00
Laboratory
There will be 10 labs of various complexity distributed
during the term. The labs will count for 50% of the final score. The lab
instructions will be distributed on the website at the end of each module. The
students are not expected to do the entire lab in the allocated lab period and
that they can work from home or when ever the lab is free. The marks will be
based on functionality of the program and speed-up (10% extra).
Course Grade
Quizzes |
20% |
Labs |
50% |
Final
Exam |
30% |
References