Introduction to GPU Programming

CMPUT 382-B1

Winter 2022

Diagram

Description automatically generated

Machine learning models require loading, transforming, and processing extremely large datasets to glean critical insights. With up to 1.3 TB of unified memory and all-to-all GPU communications with NVSwitch, HGX A100 powered by A100 80GB GPUs can load and perform calculations on enormous datasets to derive actionable insights quickly.

 

 

General Information

 

o   Instructor: Prof. Pierre Boulanger Tel: 780-492-3031 Email: pierreb@ualberta.ca

o   URL: www.cs.ualberta.ca/~pierreb Office: Virtual Office hours: By appointment only.

o   Lectures: MWF 14:00:00 - 14:50:00 (Virtual Class Over Zoom)

o   Lab Sections (Can be access remotely)

o   Winter Term 2022 - LAB H01

o   Tuesday 11:00:00 - 13:50:00

o   Winter Term 2022 - LAB H02

o   Thursday 14:00:00 - 16:50:00 (T B 104) 

 

o   TAs

o  Hong Zu Li hongzu@ualberta.ca

o  Rafsanjany Kushol kushol@ualberta.ca

 

Course Goals

o   Learn how to program heterogeneous parallel computing systems such as GPUs

 

o   CUDA Language

o   Functionality and maintainability of GPU

o   How to deal with scalability

o   Portability issues

o   Other subjects

o   Parallel programming API, tools, and techniques

o   Principles and patterns of parallel algorithms

o   Processor architecture features and constraints

 

Prerequisites

It is assuming that you already have some familiarity with the C and C++ Languages.

Course Content

M1: Introduction

 

Jan. 5-7

o  1.1 Course Introduction and Overview

o  1.2 Introduction to Heterogeneous Parallel Computing

o  1.3 Portability and Scalability in Heterogeneous Parallel Computing

o   No Labs this week

 

o   Chapter01-introduction.pdf

o   Lecture-1-1-overview

o   Lecture-1-2-heterogeneous-computing

o   Lecture-1-3-portability-scalability

 

M2: Introduction to CUDA C

 

Jan. 10-14

o   2.1 CUDA C vs. Thrust vs. CUDA LibrariesMemory Allocation and Data Movement API Functions

o   2.2 Memory Allocation and Data Movement API Functions

o   2.3 Threads and Kernel Functions

o   2.4 Introduction to the CUDA Toolkit

 

o   Brief Introduction to C programming

o   Chapter 2 - Data Parallel Computing.pdf

o   Lecture-2-1-cuda-thrust-libs

o   Lecture-2-2-cuda-data-allocation-API

o   Lecture-2-3-cuda-parallelism-threads

o   Lecture-2-4-cuda-toolkit

o   Quiz1 opens on e-Class Friday at 9h55 and closes Monday at 23h55

o   Module 2 Lab see e-class

 

M3: CUDA Parallelism Model

 

Jan. 17-21

 

 

o   ​3.1 Kernel-Based SPMD Parallel Programming

o   3.2 Multidimensional Kernel Configuration

o   3.3 Color-to-Grayscale Image Processing Example

o   3.4 Image Blur Example

o   3.5 Thread Scheduling

o   Chapter 3 - Scalable Parallel Execution.pdf

o   Lecture-3-1-kernel-SPMD-parallelism

o   Lecture-3-2-kernel-multidimension

o   Lecture-3-3-color-to-greyscale-image-processing-example

o   Lecture-3-4-blur-kernel

o   Lecture-3-5-thread-scheduling

o   Quiz2 opens on e-Class Friday at 9h55 and closes Monday at 23h55

o   Module 3 Lab see e-class

M4: Memory Model and Locality

 

Jan. 24-26

o   4.1 CUDA Memories

o   4.2 Tiled Parallel Algorithms

o   4.3 Tiled Matrix Multiplication

o   4.4 Tiled Matrix Multiplication Kernel

o   4.5 Handling Arbitrary Matrix Sizes in Tiled Algorithms

o   Chapter 4 - Memory and Data Locality

o   Lecture-4-1 CUDA Memories

o   Lecture-4-2-Tiled Parallel Algorithms

o   Lecture-4-3-Tiled Matrix Multiplication

o   Lecture-4-4-Tiled Matrix Multiplication Kernel

o   Lecture-4-5 Handling Arbitrary Matrix Sizes in Tiled Algorithms

o   Quiz3 opens on e-Class Friday at 9h55 and closes Monday at 23h55

o   Module 4 Lab see e-class

M5: Thread Execution and Efficiency

 

Jan. 28

o   ​5.1 Warps and SIMD Hardware

o   5.2 Performance Impact of Control Divergence

o   Chapter 5 - Performance Considerations.pdf

o   Lecture-5-1-Warps and SIMD Hardware

o   Lecture-5-2-Performance Impact of Control Divergence

o   Quiz4 opens on e-Class Friday at 9h55 and closes Monday at 23h55

o   Module 5 Lab see e-class

o   Excellent paper on Thread Execution in CUDA

M6: Memory Access Performance

Jan. 31

o   6.1 DRAM Bandwidth

o   6.2 Memory Coalescing in CUDA

o   Chapter 5 - Performance Considerations.pdf

o   Lecture-6-1-DRAM Bandwidth

o   Lecture-6-2-Memory Coalescing in CUDA

o   Quiz5 opens on e-Class Friday at 9h55 and closes Monday at 23h55

M7: Parallel Computation Patterns (Histogram)

 Feb. 2 - 4

o   7.1 Histogramming

o   7.2 Introduction to Data Races

o   7.3 Atomic Operations in CUDA

o   7.4 Atomic Operation Performance

o   7.5 Privatization Technique for Improved Throughput

o   Chapter 9 - Parallel Patterns-Parallel Histogram Computation

o   Lecture-7-1-Histogramming

o   Lecture-7-2-Introduction to Data Races

o   Lecture-7-3-Atomic Operations in CUDA

o   Lecture-7-4-Atomic Operation Performance

o   Lecture-7-5-Privatization Technique for Improved Throughput

o   Quiz6 opens on e-Class Thursday at 9h55 and closes Monday at 23h55

o  Module 7 Lab see e-class

M8: Parallel Computation Patterns (Stencil)

Feb. 7-11

o   8.1 Convolution

o   8.2 Tiled Convolution

o   8.3 Tile Boundary Conditions

o   8.4 Analyzing Data Reuse in Tiled Convolution

 

 

o   Chapter 7 - Parallel Patterns: Convolution

o   Lecture-8-1-Convolution

o   Lecture-8-2-Tiled Convolution

o   Lecture-8-3-Tile Boundary Conditions

o   Lecture-8-4-Analyzing Data Reuse in Tiled Convolution

o   Convolution Optimization

o   Quiz7 opens on e-Class Thursday at 9h55 and closes Monday at 23h55

o   Module 8 Lab see e-class 

M9: Parallel Computation Patterns (Reduction)

Feb. 14-18

o   9.1 Reduction

o   9.2 Reduction Kernel

o   9.3 Better Reduction Kernel

o   Lecture-9-1-reduction

o   Lecture-9-2-reduction-kernel

o   Lecture-9-3-better-reduction-kernel

o   Reduction Optimization

o   Quiz8 opens on e-Class Thursday at 9h55 and closes Monday at 23h55

o   Module 9 Lab see e-class

Reading Week

Feb. 21-25

o   No class

 

M10: Parallel Computation Patterns (scan)

Feb. 28-Mar. 2

 

o   10.1 Prefix Sum

o   10.2 A Work-inefficient Scan Kernel

o   10.3 A Work-Efficient Parallel Scan Kernel

o   10.4 More on Parallel Scan

o   10.5 Scan applications

o   Chapter 9 - Parallel Patterns:PrefixSum

o   Lecture-10-1-Prefix Sum

o   Lecture-10-2-A Work-inefficient Scan Kernel

o   Lecture-10-3-A Work-Efficient Parallel Scan Kernel

o   Lecture-10-4-More on Parallel Scan

o   Scan Paper  

o   Quiz9 opens on e-Class Thursday at 9h55 and closes Monday at 23h55

o   Module 10 Lab see e-class

M11: Floating-Point Considerations

Mar. 4

o   11.0 GPU Internal Architecture

o   11.1 Floating-Point Precision and Accuracy

o   11.2 Numerical Stability

o   Lecture-11-0-GPU-struct-basics

o   Lecture-11-1-Floating-Point Precision and Accuracy

o   Lecture-11-2-Numerical Stability

M12: GPU as Part of the PC Architecture

Mar. 7-9

o   12.1 GPU as Part of the PC Architecture

o   Lecture-12-1-GPU as Part of the PC Architecture

M13: Efficient Host-Device Data Transfer

Mar. 14-16

o   13.1 Pinned Host Memory

o   13.2 Task Parallelism in CUDA

o   13.3 Overlapping Data Transfer with Computation

o   Lecture-13-1-Pinned Host Memory

o   Lecture-13-2-Task Parallelism in CUDA

o   Lecture-13-3-Overlapping Data Transfer with Computation

o   Quiz10 opens on e-Class Thursday at 9h55 and closes Monday at 23h55

o   Module 13 Lab see e-class

M14: Application Case Study: Electrostatic Potential Calculation and CNN

Mar. 21-23

o   14.1 Electrostatic Potential Calculation Part 1

o   14.2 Electrostatic Potential Calculation Part 2

o   14.3 DNN and Convolutional Neural Networks

o   Lecture-14-1-VMD-case-study-Part1

o   Lecture-14-2-VMD-case-study-Part2

o   Lecture-14-3-Convolutional-Neural-Networks

o   NVIDIA Lecture on cuCNN

M15: Computational Thinking for Parallel Programming

Mar. 25

o   15.1 Computational Thinking

o   15.2 Multi-GPU-Programming

o   2nd-Edition-Chapter13-Computational-Thinking

o   Lecture-15-1-Computational-Thinking

o   Lecture-15-2-Multi-GPU-Programming

o   Programming Methods for Summit Multi-GPU Nodes

M16: Related Programming Models: OpenCL

Mar. 28

o   16.1 OpenCL Data Parallelism Model

o   16.2 OpenCL Device Architecture

o   16.3 OpenCL Host Code

o   2nd-Edition-Chapter14-OpenCL

o   Lecture-16-1-OpenCL Data Parallelism Model

o   Lecture-16-2-OpenCL Device Architecture

o   Lecture-16-3-OpenCL Host Code

o   An Introduction to OpenCL using AMD GPUs

o   Quiz11 opens on e-Class Thursday at 9h55 and closes Monday at 23h55

o   Module 17 Lab see e-class

M17: Related Programming Models: MPI

Mar. 30-Apr 1

o  16.1 Introduction to Heterogeneous Supercomputing and MPI

o   2nd-Edition-Chapter19-Cluster

o   Lecture-17-1-Introduction to Heterogeneous Supercomputing and MPI

o   Lecture-17-2-MPI-CUDA-Part2

o   Lecture-17-3-MPI-CUDA-Part3

o   GPUDirect, CUDA Aware MPI, and CUDA IPC

M18: Related Programming Models: OpenACC

Apr. 1

o   18.1 Introduction to OpenACC

o  18.2 OpenACC Subtleties

o   2nd-Edition-Chapter15-OpenACC

o   Lecture-18-1-openACC-intro

o   Lecture-18-2-openACC-subtleties

o   Accelerating HPC Applications on NVIDIA GPUs with OpenACC

o   Quiz12 opens on e-Class Thursday at 9h55 and closes Monday at 23h55

M19: Dynamic Parallelism

Apr. 4

o  19.1 Dynamic Parallelism

 

 

o   3rd-Edition-Chapter13-cuda-dynamic-parallelism

o   Lecture-19-Dynamic-parallelism

M20: Class Review

Apr. 8

o   Class Review

 

 

o   Class Review Notes

 

Final Exam

April 25, 9h00 to 11h00

o   Open Book Final Exam Delivered Virtual

 

 

Quizzes

 

There will be 12 quizzes distributed after each module. You are required to do only 10 of them and it will count for 20% of the final mark.

The quizzes will generally be handed out electronically after a module is terminated and are due in class at the beginning of the next module.

 

Final Exam

At the end of the term there will be a 2-hour final exam in the form of a quiz. The Quiz will be administered using ExamLock. The exam will cover the material discussed in class. The exam will be very close to the quizzes. The exam will be open book. The exam will count for 30% of the final mark.

Exam date: April 25, 9h00 to 11h00

Laboratory

There will be 10 labs of various complexity distributed during the term. The labs will count for 50% of the final score. The lab instructions will be distributed on the website at the end of each module. The students are not expected to do the entire lab in the allocated lab period and that they can work from home or when ever the lab is free. The marks will be based on functionality of the program and speed-up (10% extra).

Course Grade

Quizzes

20%

Labs

50%

Final Exam

30%

 

References