Introduction to GPU Programming

CMPUT 398-A1

Fall 2016

http://cdn.wccftech.com/wp-content/uploads/2014/03/NVIDIA-Pascal-GPU-Chip-Module.jpg



NVIDIA GP100 Block Diagram

NVIDIA GPU

New Pascal Architecture

The GP100 GPU is comprised of 3840 CUDA cores, 240 texture units and a 4096bit memory interface, arranged in eight 512 bit segments.

The 3840 CUDA cores make up six Graphics Processing Clusters, or GPCs for short. Each of these has 10 Pascal Streaming Multiprocessors.

 

 

General Information

 

o   Instructor: Pierre Boulanger Tel: 780-492-3031 Email: pierreb@cs.ualberta.ca

o   URL: www.cs.ualberta.ca/~pierreb Office: 411 Athabasca Hall Office hours: By appointment only.

o   Lectures: Monday-Wednesday-Friday 9h00 to 9h50 in NRE 2 127

o   Labs: Group1: Thursday 17h00-19h50 Group2: Friday 14h00-16h50

o   TAs : Michael Feist: mdfeist@ualberta.ca and Sankalp Prabhakar: sankalp@ualberta.ca

 

 

Course Goals

o   Learn how to program heterogeneous parallel computing systems such as GPUs

 

o   CUDA Language

o   Functionality and maintainability of GPU

o   How to deal with scalability

o   Portability issues

 

o   Technical subjects

 

o   Parallel programming API, tools and techniques

o   Principles and patterns of parallel algorithms

o   Processor architecture features and constraints

 

Prerequisites

It is assuming that you already have some familiarity with the C and C++ Languages.

Course Content

M1: Introduction

Sept. 7 and 9

o   1.1 Course Introduction and Overview

o   1.2 Introduction to Heterogeneous Parallel Computing

o   1.3 Portability and Scalability in Heterogeneous Parallel Computing

 

o   Chapter01-introduction.pdf

o   Lecture-1-1-overview

o   Lecture-1-2-heterogeneous-computing

o   Lecture-1-3-portability-scalability

 

M2: Introduction to CUDA C

Sept. 12, 14, and 16

o   2.1 CUDA C vs. Thrust vs. CUDA LibrariesMemory Allocation and Data Movement API Functions

o   2.2 Memory Allocation and Data Movement API Functions

o   2.3 Threads and Kernel Functions

o   2.4 Introduction to the CUDA Toolkit

 

o   Brief Introduction to C programming

o   Chapter 2 - Data Parallel Computing.pdf

o   Lecture-2-1-cuda-thrust-libs

o   Lecture-2-2-cuda-data-allocation-API

o   Lecture-2-3-cuda-parallelism-threads

o   Lecture-2-4-cuda-toolkit

o   Quiz1 opens on eClass Friday at 9h55 and closes Monday at 23h55

o   Module 2 Lab1 Lab1.zip

o  Lab due Sept. 19

 

M3: CUDA Parallelism Model

 

Sept. 19,21, and 23

o   ​3.1 Kernel-Based SPMD Parallel Programming

o   3.2 Multidimensional Kernel Configuration

o   3.3 Color-to-Grayscale Image Processing Example

o   3.4 Image Blur Example

o   3.5 Thread Scheduling

o   Chapter 3 - Scalable Parallel Execution.pdf

o   Lecture-3-1-kernel-SPMD-parallelism

o   Lecture-3-2-kernel-multidimension

o   Lecture-3-3-color-to-greyscale-image-processing-example

o   Lecture-3-4-blur-kernel

o   Lecture-3-5-thread-scheduling

o   Quiz2 opens on eClass Friday at 9h55 and closes Monday at 23h55

o   Module 3 Lab2 Lab2.zip

o   Lab due Sept. 26

M4: Memory Model and Locality

 

Sept. 26,28, and 30

o   4.1 CUDA Memories

o   4.2 Tiled Parallel Algorithms

o   4.3 Tiled Matrix Multiplication

o   4.4 Tiled Matrix Multiplication Kernel

o   4.5 Handling Arbitrary Matrix Sizes in Tiled Algorithms

o   Chapter 4 - Memory and Data Locality

o   Lecture-4-1 CUDA Memories

o   Lecture-4-2-Tiled Parallel Algorithms

o   Lecture-4-3-Tiled Matrix Multiplication

o   Lecture-4-4-Tiled Matrix Multiplication Kernel

o   Lecture-4-5 Handling Arbitrary Matrix Sizes in Tiled Algorithms

o   Quiz3 opens on eClass Friday at 9h55 and closes Monday at 23h55

o   Module 4 Lab3 Lab3.zip

o   Lab due Oct. 3

M5: Thread Execution and Efficiency

 

Oct. 3 and 5

o   ​5.1 Warps and SIMD Hardware

o   5.2 Performance Impact of Control Divergence

o   Chapter 5 - Performance Considerations.pdf

o   Lecture-5-1-Warps and SIMD Hardware

o   Lecture-5-2-Performance Impact of Control Divergence

o   Quiz4 opens on eClass Friday at 9h55 and closes Monday at 23h55

o   Module 5 Lab4 Lab4.zip

o   Lab due Oct. 10

o   Excellent paper on Thread Execution in CUDA

M6: Memory Access Performance

Oct. 7

o   6.1 DRAM Bandwidth

o   6.2 Memory Coalescing in CUDA

o   Chapter 5 - Performance Considerations.pdf

o   Lecture-6-1-DRAM Bandwidth

o   Lecture-6-2-Memory Coalescing in CUDA

o   Quiz5 opens on eClass Friday at 9h55 and closes Monday at 23h55

M7: Parallel Computation Patterns (Histogram)

 

Oct. 12 and 14

o   7.1 Histogramming

o   7.2 Introduction to Data Races

o   7.3 Atomic Operations in CUDA

o   7.4 Atomic Operation Performance

o   7.5 Privatization Technique for Improved Throughput

o   Chapter 11 - Parallel Patterns-Parallel Histogram Computation

o   Lecture-7-1-Histogramming

o   Lecture-7-2-Introduction to Data Races

o   Lecture-7-3-Atomic Operations in CUDA

o   Lecture-7-4-Atomic Operation Performance

o   Lecture-7-5-Privatization Technique for Improved Throughput

o   Quiz6 opens on eClass Thursday at 9h55 and closes Monday at 23h55

o   Module 7 Lab5 Lab5.zip

o   Lab due Oct. 17

M8: Parallel Computation Patterns (Stencil)

Oct. 17 and 19

o   8.1 Convolution

o   8.2 Tiled Convolution

o   8.3 Tile Boundary Conditions

o   8.4 Analyzing Data Reuse in Tiled Convolution

 

 

o   Chapter 7 - Parallel Patterns: Convolution

o   Lecture-8-1-Convolution

o   Lecture-8-2-Tiled Convolution

o   Lecture-8-3-Tile Boundary Conditions

o   Lecture-8-4-Analyzing Data Reuse in Tiled Convolution

o   Convolution Optimization

o   Quiz7 opens on eClass Thursday at 9h55 and closes Monday at 23h55

o   Module 8 Lab6 Lab6.zip

o   Lab due Oct. 24

 

M9: Parallel Computation Patterns (Reduction)

Oct. 21 and 24

o   9.1 Reduction

o   9.2 Reduction Kernel

o   9.3 Better Reduction Kernel

o   Lecture-9-1-reduction

o   Lecture-9-2-reduction-kernel

o   Lecture-9-3-better-reduction-kernel

o   Reduction Optimization

o   Quiz8 opens on eClass Thursday at 9h55 and closes Monday at 23h55

o   Module 9 Lab7 Lab7.zip

o   Lab due Oct. 31

M10: Parallel Computation Patterns (scan)

 

Oct. 26 and 28

o   10.1 Prefix Sum

o   10.2 A Work-inefficient Scan Kernel

o   10.3 A Work-Efficient Parallel Scan Kernel

o   10.4 More on Parallel Scan

o   10.5 Scan applications

o   Chapter 9 - Parallel Patterns:PrefixSum

o  Parallel Prefix Sum (Scan) with CUDA

o   Lecture-10-1-Prefix Sum

o   Lecture-10-2-A Work-inefficient Scan Kernel

o   Lecture-10-3-A Work-Efficient Parallel Scan Kernel

o   Lecture-10-4-More on Parallel Scan

o   Quiz9 opens on eClass Thursday at 9h55 and closes Monday at 23h55

o   Module 10 Lab8 Lab8.zip

o   Lab due Nov. 21

M11: Floating-Point Considerations

Oct. 31

o   11.1 Floating-Point Precision and Accuracy

o   11.2 Numerical Stability

o   Lecture-11-1-Floating-Point Precision and Accuracy

o   Lecture-11-2-Numerical Stability

M12: GPU as Part of the PC Architecture

Nov. 2

o   12.1 GPU as Part of the PC Architecture

 

o   Lecture-12-1-GPU as Part of the PC Architecture

M13: Efficient Host-Device Data Transfer

Nov. 4

o   13.1 Pinned Host Memory

o   13.2 Task Parallelism in CUDA

o   13.3 Overlapping Data Transfer with Computation

o   Lecture-13-1-Pinned Host Memory

o   Lecture-13-2-Task Parallelism in CUDA

o   Lecture-13-3-Overlapping Data Transfer with Computation

o   Quiz10 opens on eClass Thursday at 9h55 and closes Monday at 23h55

o   Module 13 Lab9 Lab9.zip

o   Lab due Nov. 28

Fall Reading Week

Nov. 7-11

o   No class

 

M14: Application Case Study: Electrostatic Potential Calculation and CNN

Nov. 14-16

o   14.1 Electrostatic Potential Calculation Part 1

o   14.2 Electrostatic Potential Calculation Part 2

o   14.3 DNN and Convolutional Neural Networks

o   Lecture-14-1-VMD-case-study-Part1

o   Lecture-14-2-VMD-case-study-Part2

o   Lecture-14-3-Convolutional-Neural-Networks

M15: Computational Thinking for Parallel Programming

Nov. 21

o   15.1 Computational Thinking

o   15.2 Multi-GPU-Programming

o   2nd-Edition-Chapter13-Computational-Thinking

o   Lecture-15-1-Computational-Thinking

o   Lecture-15-2-Multi-GPU-Programming

M16: Related Programming Models: MPI

Nov. 23

o   16.1 Introduction to Heterogeneous Supercomputing and MPI

o   2nd-Edition-Chapter19-Cluster

o   Lecture-16-1-Introduction to Heterogeneous Supercomputing and MPI

o   Lecture-16-2-MPI-CUDA-Part2

o   Lecture-16-3-MPI-CUDA-Part3

M17: Related Programming Models: OpenCL

Nov. 25

o   17.1 OpenCL Data Parallelism Model

o   17.2 OpenCL Device Architecture

o   17.3 OpenCL Host Code

o   2nd-Edition-Chapter14-OpenCL

o   Lecture-17-1-OpenCL Data Parallelism Model

o   Lecture-17-2-OpenCL Device Architecture

o   Lecture-17-3-OpenCL Host Code

o   Quiz11 opens on eClass Thursday at 9h55 and closes Monday at 23h55

o   Module 17 Lab10 Lab10.zip

o   Lab Due Dec. 5

M18: Related Programming Models: OpenACC

Nov. 28

o   18.1 Introduction to OpenACC

o   18.2 OpenACC Subtleties

o   2nd-Edition-Chapter15-OpenACC

o   Lecture-18-1-openACC-intro

o   Lecture-18-2-openACC-subtleties

o   Quiz12 opens on eClass Thursday at 9h55 and closes Monday at 23h55

M19: Class Review

Dec. 7

o   Class Review

o   New Pascal Architecture

o   New GPU-based Super Computers

o   Class Review and Answers to Quiz Questions

 

Final Exam

Dec. 16

o   Friday, December 16 at 9h00 NRE 2 127

o   The exam will be 2 hours

 

 

Quizzes

 

There will be twelve quizzes distributed after each module. You are required to do only 10 of them and it will count for 10% of the final mark. The quizzes will generally be handed out electronically after a module is terminated and are due in class at the beginning of the next module in paper format.

 

Final Exam

There will be a final exam at the end of the term. The exam will count for 30% of the final mark. The date of the final exam is December 16 at 9h00 in NRE 2-127.

Laboratory

There will be 10 labs of various complexity distributed during the term. The labs will count for 50% of the final score. The lab instructions will be distributed on the website at the end of each module. The labs will be located in room CSC 167. There will be two groups one on Thursday 17h00-19h50 and one on Friday at 14h00-16h50. The students have three hours to compile and make the program work. The students are not expected to do the entire lab in the allocated lab period and that they can work from home or when ever the lab is free. The marks will be based on functionality of the program and speed-up.

Course Grade

The final grade for the course is based on our best assessment of your understanding of the material, as well as your commitment and participation. The quizzes, the labs, and final projects are combined to give a final grade:

 

Final Exam

30%

Quizzes

20%

Labs

50%

 

References