Introduction to GPU Programming

CMPUT 396 LEC B3

Winter 2016

General Information

o Instructor: Pierre Boulanger Tel: 780-492-3031 Email: pierreb@cs.ualberta.ca

o URL: www.cs.ualberta.ca/~pierreb Office: 411 Athabasca Hall Office hours: By appointment only.

o Lectures: Every Monday 14h00 to 15h00 in ATH 411

Course Goals

o Learn how to program heterogeneous parallel computing systems such as GPUs

o CUDA Language

o Functionality and maintainability of GPU

o How to deal with scalability

o Portability issues

o Technical subjects

o Parallel programming API, tools and techniques

o Principles and patterns of parallel algorithms

o Processor architecture features and constraints

Prerequisites

It is assuming that you already have some familiarity with the C and C++ Languages.

Course Content

1: Introduction	o Course Introduction and Overview o Introduction to Heterogeneous Parallel Computing o Portability and Scalability in Heterogeneous Parallel Computing	o Quick Start Guide.pdf o Chapter01-introduction.pdf o Lecture-1-1-overview.pptx o Lecture-1-2-heterogeneous-computing.pptx o Lecture-1-3-portability-scalability.pptx
2: Introduction to CUDA C	o CUDA C vs. CUDA Libs vs. OpenACC o Memory Allocation and Data Movement API Functions o Data Parallelism and Threads o Introduction to CUDA Toolkit	o Chapter03-cuda-programming-model.pdf o Lecture-2-1-cuda-thrust-libs.pptx o Lecture-2-2-cuda-data-allocation-API.pptx o Lecture-2-3-cuda-parallelism-threads.pptx o Lecture-2-4-cuda-toolkit.pptx o Module 2 Quiz.pdf o Module 2 Lab
3: CUDA Parallelism Model	o Kernel-Based SPMD Parallel Programming o Multidimensional Kernel Configuration o Color-to-Greyscale Image Processing Example o Blur Image Processing Example	o Chapter04-cuda-parallelism-model.pdf o Lecture-3-1-kernel-SPMD-parallelism.pptx o Lecture-3-2-kernel-multidimension.pptx o Lecture-3-3-color-to-greyscale-image-processing-example.pptx o Lecture-3-5-transparent-scaling.pptx o Module 3 Quiz.pdf o Module 3 Lab
4: Memory Model and Locality	o CUDA Memories o Tiled Matrix Multiplication o Tiled Matrix Multiplication Kernel o Handling Boundary Conditions in Tiling o Tiled Kernel for Arbitrary Matrix Dimensions	o Programming Massively Parallel Processors o Hands-on Approach - Copy.pdf o lecture5-6-CUDA-memory-model-2015.pptx o Video1 o Video2 o Video3 o Video4 o Video5 o Video6
5: Kernel-based Parallel Programming	o Memory Coalescing o Convolution o Faster Convolution o 2D Convolution	o PPT, PPT o PPT o PPT, Video1, Video2 o PPT, Video1, Video2
6: Performance Considerations: Scan Applications	o Scan Applications: Per-thread Output Variable Allocation o Scan Applications: Radix Sort o Performance Considerations (Histogram (Atomics) Example) o Performance Considerations (Histogram (Scan) Example)	o PPT, Video1, Video2 o PPT o PPT, Video1, Video2, Video3, Video4 o PPT o Assignment 4: o http://www.webgpu.com/mp/11
7: Floating Point Considerations	o Floating Point Precision Considerations o Numerical Stability	o PPT o PPT o PPT
8: GPU as part of the PC Architecture	o GPU as part of the PC Architecture	o PPT o Assignment 5 o http://www.webgpu.com/mp/85
9: Efficient Host-Device Data Transfer	o Data Movement API vs. Unified Memory o Pinned Host Memory o Task Parallelism/CUDA Streams o Overlapping Transfer with Computation	o PPT
10: Application Case Study: Advanced MRI Reconstruction	o Advanced MRI Reconstruction and Field Calculations	o PPT o PPT , PPT
11: Scan and Prefix Sum	o Scan and Prefix Sum	o PPT o PPT o PPT o PPT
12: OpenCL	o OpenCL and	o PPT o PPT o PPT
13: OpenACC	o OpenACC	o PPT o PPT o Assignment 6: o http://www.webgpu.com/mp/88
14: Multi-GPU	o Multi-GPU	o PPT o PPT o Video1
15: Using CUDA Libraries	o Example Applications Using CUDA Libraries	o PPT o PPT o CUBLAS library o CUFFT library o CUSPARSE library o CURAND library o Nsight Visual Studio

online CUDA documentation

PTX (low-level instructions)

Floating point accuracy on NVIDIA GPUs

CUDA SDK examples

Homework

Homework will generally be handed out in lecture and be due in lecture on the following week. Most of them involve CUDA programming. There will be approximately 5 problem sets.

Course Project

There will be an individual semester project, culminating in a final 8 pages report in IEEE format and a presentation at a day workshop. Progress and check points before the final due date will count toward the final grade.

Course Grade

The final grade for the course is based on our best assessment of your understanding of the material, as well as your commitment and participation. The problem sets and final projects are combined to give a final grade:

o https://wiki.cites.illinois.edu/wiki/display/ece408fa15/Class+Schedule?src=spaceshortcut