Introduction to GPU Programming
CMPUT 396 LEC B3
Winter 2016
General
Information
o
Instructor: Pierre Boulanger Tel: 780-492-3031 Email: pierreb@cs.ualberta.ca
o URL: www.cs.ualberta.ca/~pierreb Office:
411 Athabasca Hall Office hours: By appointment only.
o Lectures:
Every Monday 14h00 to 15h00 in ATH 411
Course Goals
o
Learn
how to program heterogeneous parallel computing systems such as GPUs
o CUDA Language
o Functionality and maintainability of GPU
o How to deal with scalability
o Portability issues
o Technical subjects
o Parallel programming API, tools and techniques
o Principles and patterns of parallel algorithms
o Processor architecture features and constraints
Prerequisites
It is assuming that you
already have some familiarity with the C and C++ Languages.
Course Content
1:
Introduction |
o
Course Introduction and Overview o
Introduction to Heterogeneous Parallel Computing o
Portability and Scalability in Heterogeneous Parallel Computing |
o
Lecture-1-2-heterogeneous-computing.pptx
o
Lecture-1-3-portability-scalability.pptx |
2:
Introduction to CUDA C |
o CUDA C vs. CUDA Libs vs. OpenACC o
Memory Allocation and Data Movement API Functions o
Data Parallelism and Threads o
Introduction to CUDA Toolkit |
o Chapter03-cuda-programming-model.pdf o Lecture-2-1-cuda-thrust-libs.pptx o Lecture-2-2-cuda-data-allocation-API.pptx o Lecture-2-3-cuda-parallelism-threads.pptx o Lecture-2-4-cuda-toolkit.pptx |
3:
CUDA Parallelism Model |
o
Kernel-Based SPMD Parallel Programming o
Multidimensional Kernel Configuration o
Color-to-Greyscale Image Processing Example o
Blur Image Processing Example |
o
Chapter04-cuda-parallelism-model.pdf
o
Lecture-3-1-kernel-SPMD-parallelism.pptx o
Lecture-3-2-kernel-multidimension.pptx o
Lecture-3-3-color-to-greyscale-image-processing-example.pptx |
4:
Memory Model and Locality |
o
CUDA Memories o
Tiled Matrix Multiplication o
Tiled Matrix Multiplication Kernel o
Handling Boundary Conditions in Tiling o
Tiled Kernel for Arbitrary Matrix Dimensions |
o Programming
Massively Parallel Processors o Hands-on
Approach - Copy.pdf o
lecture5-6-CUDA-memory-model-2015.pptx o
Video1 o
Video2 o
Video3 o
Video4 o
Video5 o
Video6 |
5:
Kernel-based Parallel Programming |
o
Memory Coalescing o
Convolution o
Faster Convolution o
2D Convolution |
o
PPT |
6: Performance Considerations: Scan Applications |
o
Scan Applications:
Per-thread Output Variable Allocation o
Scan Applications: Radix
Sort o
Performance Considerations
(Histogram (Atomics) Example) o
Performance
Considerations (Histogram (Scan) Example) |
o PPT o PPT, Video1, Video2, Video3, Video4 o PPT o
Assignment 4: |
7: Floating Point Considerations |
o
Floating Point Precision
Considerations o
Numerical Stability |
o
PPT o
PPT o
PPT |
8: GPU as part of the PC Architecture |
o
GPU as part of the PC
Architecture |
o
PPT o
Assignment 5 |
9: Efficient Host-Device Data Transfer |
o
Data Movement API vs.
Unified Memory o
Pinned Host Memory o
Task Parallelism/CUDA
Streams o
Overlapping Transfer with
Computation |
o
PPT |
10: Application Case Study: Advanced MRI Reconstruction |
o
Advanced MRI
Reconstruction and Field Calculations |
o
PPT |
11: Scan and Prefix Sum |
o
Scan and Prefix Sum |
o
PPT o
PPT o
PPT o
PPT |
12: OpenCL |
o
OpenCL and |
o
PPT o
PPT o
PPT |
13: OpenACC |
o
OpenACC |
o
PPT o
PPT o
Assignment 6: |
14: Multi-GPU |
o
Multi-GPU |
o PPT o PPT o Video1 |
15: Using CUDA Libraries |
o
Example Applications
Using CUDA Libraries |
o
PPT o
PPT |
Homework
Homework
will generally be handed out in lecture and be due in lecture on the following
week. Most of them involve CUDA programming. There will be approximately 5
problem sets.
Course Project
There will be an individual
semester project, culminating in a final 8 pages
report in IEEE format and a presentation at a day workshop. Progress and check
points before the final due date will count toward the final grade.
Course Grade
The final grade for the course
is based on our best assessment of your understanding of the material, as well
as your commitment and participation. The problem sets and final projects are
combined to give a final grade:
o
https://wiki.cites.illinois.edu/wiki/display/ece408fa15/Class+Schedule?src=spaceshortcut