CUDA
Ȩ > CUDA > CUDA > CUDA(Illinois)

CUDA(Illinois)

ECE 498 AL : Programming Massively Parallel Processors

 

http://courses.ece.illinois.edu/ece498/al/


 

Course Objectives

Virtually all semiconductor market domains, including PCs, game consoles, mobile handsets, servers, supercomputers, and networks, are converging to concurrent platforms. There are two important reasons for this trend. First, these concurrent processors can potentially offer more effective use of chip space and power than traditional monolithic microprocessors for many demanding applications. Second, an increasing number of applications that traditionally used Application Specific Integrated Circuits (ASICs) are now implemented with concurrent processors in order to improve functionality and reduce engineering cost. The real challenge is to develop applications software that effectively uses these concurrent processors to achieve efficiency and performance goals. The aim of this course is to provide students with knowledge and hands-on experience in developing applications software for processors with massively parallel computing resources. In general, we refer to a processor as massively parallel if it has the ability to complete more than 64 arithmetic operations per clock cycle. Todays NVIDIA processors already exhibit this capability. Processors from Intel, AMD, and IBM will begin to qualify as massively parallel in the next several years. Effectively programming these processors will require in-depth knowledge about parallel programming principles, as well as the parallelism models, communication models, and resource limitations of these processors. The target audiences of the course are students who want to develop exciting applications for these processors, as well as those who want to develop programming tools and future implementations for these processors. We will be using NVIDIA processors and the CUDA programming tools in the lab section of the course. Many have reported success in performing non-graphics parallel computation as well as traditional graphics rendering computation on these processors. You will go through structured programming assignments before being turned loose on the final project. Each programming assignment will involve successively more sophisticated programming skills. The final project will be of your own design, with the requirement that the project must involve a demanding application such as mathematics- or physics-intensive simulation or other data-intensive computation, followed by some form of visualization and display of results. This is a course in programming massively parallel processors for general computation. We are fortunate to have the support and presence of David Kirk, the Chief Scientist of NVIDIA and one of the main driving forces behind the new NVIDIA CUDA technology. Building on architecture knowledge from ECE 411, and general C programming knowledge, we will expose you to the tools and techniques you will need to attack a real-world application for the final project. The final projects will be supported by some real application groups at UIUC and around the country, such as biomedical imaging and physical simulation.

 

Instructor:
Wen-Mei Hwu. Office: 215 CSL, E-mail: w-hwu@illinois.edu, Phone: 244-8270

Teaching Assistant:
John Stratton. Office: 223 CSL, E-mail: stratton@illinois.edu, Phone: 333-4171

 

Lecture and Office
HoursLecture: 9:30-10:50 AM, Tu Th, 1105 Siebel Center.
Regular makeup lecture time Wed 5:00-6:30 PM (selected weeks)
Instructor office hours:Wednesdays 2-3:30pm

Course TA office hours:Mondays 2-3pm, Fridays 3-4pm

 

 "If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?"

-- Seymore Cray

 

"640 K ought to be enough for anybody."

-- Bill Gates, 1981

 

 

 

Spring 2009 Syllabus (Tentative)

Date

Lecture

Material

Assignments

Week 1: Tu, 1/20

Lecture 1 - Introduction

Slides (ppt)
Voice (
mp3)

Read Chapter 1 of Textbook

Th 1/22

Lecture 2 - GPU Computing and CUDA Programming Model Intro

Slides (ppt)
Voice (
mp3)

Read Chapter 2 of Textbook

Week 2: Tu, 1/27

Lecture 3 - CUDA Example and CUDA Threads

Slides (ppt)
Voice (
mp3)

Read Chapter 3 of Textbook
AC accounts and MP1 available

Th, 1/29

Lecture 4 - CUDA Threads Part 2 and API Details

Slides (ppt)
Voice (
mp3)
Joke (
mp3)

Read Chapter 3 of Textbook
Work on MP1

Week 3: Tu, 2/3

Lecture 5 - CUDA Memory

Slides (ppt)
Voice (
mp3)

Read Chapter 4 of Textbook

Th, 2/5

Lecture 6 - CUDA Memory Example

Slides (ppt)
Voice (
mp3)

Week 4: Mon, 2/9

MP1 (parts 1 and 2) due

Tu 2/10

Lecture 7 - GPU as Part of the PC Architecture

Slides (ppt)
Voice (
mp3)

Read Chapter 5 of Textbook

Th, 2/12

Lecture 8 ?CUDA Threading Hardware

Slides (ppt)
Voice (
mp3)

Week 5: Tu, 2/17

Lecture 9 - CUDA Memory Hardware

Slides (ppt)
Voice (
mp3)

Th, 2/19

Lecture 10 - Control Flow in CUDA

Slides (ppt)
Voice (
mp3)

Fri, 2/20

MP2 due

Week 6: Tu, 2/24

Lecture 11 - Floating Point Performance, precision and Accuracy

Prof. Hwu's Floating Point notes (doc)

Slides (ppt)
Voice (
mp3)

Read Chapter 6 of Textbook

Th, 2/26

Lecture 12 - Parallel Programming Basics

Slides (ppt)
Voice (
mp3)

Week 7: Tu, 3/3

Lecture 13 - Parallel Algorithm Basics

John Stratton's methodology for computing bank conflicts in Scan (doc).

Slides (ppt)
Voice (
mp3)

Wed, 3/4

MP3 due

Th, 3/5

Lecture 14 - Final Project Kickoff

Slides (ppt)
Voice (
mp3)

Week 8: Tue, 3/10

Lecture 15 - (TA lecture, John Stratton) Reductions and Their Implementation

Slides (ppt)
Voice (
mp3)

Th, 3/12

Lecture 16 - Application Case Studies - MRI

Slides (ppt)
Voice (
mp3)

Read Chapter 7 of Textbook

Fri, 3/13

MP4 due

Week 9: Tu, 3/17

Lecture 17 - Application Case Studies - MRI part 2

Slides (ppt)
Voice (
mp3)

Th, 3/19

Lecture 18 - Application Case Studies - MRI part 3

Slides (ppt)
Voice (
mp3)

Week 10: Spring Break, no class

Week 11: Mon, 3/30

MP5 due

Tu, 3/31

Lecture 19: The rest of the semester

Voice (mp3)

Th, 4/2

NO LECTURE: Please attend one of the Accelerator Conference Talks

Week 12: Tu, 4/7

Lecture 20: (Guest lecture, John Stone) Application performance insights: Direct Summation Potential Grids

Voice (mp3)

Tu, 4.9

Lecture 21: (Guest lecture, John Stone) Application performance insights part 2

Slides (pdf)
Voice (
mp3)

Project Proposals due

Week 13: Tu, 4/14

Lecture 22: Guest Lecturer Aaron Shin - Computational Fluid Dynamics Case Study

Slides (pdf)
Voice (
mp3)

Th, 4/16

Lecture 23: John Stratton - Successful CUDA application patterns

Slides (ppt)
Voice (
mp3)

Week 14: Tu, 4/21

Lecture 24 John Stratton - More CUDA features and tools

Slides (ppt)

Th, 4/23

NO LECTURE

Week 15: Tu, 4/28

Lecture 25: David Kirk - GPU computing history

Voice (mp3)

Th, 4/30

NO LECTURE

Exam: specific date TBD

Week 16: Wed, 5/6

Final Project Presentation Symposium

 

Archived lectures/recordings from previous semester(s):

Spring 2007 - First-time course offering by Prof. Hwu (UIUC) and Prof. Kirk (NVIDIA)!

 

Fall 2007 Hall of Fame

This coveted title is earned by the Top 5 students who wrote the fastest code in accomplishing the final, and arguably the most challenging machine problem, MP5, parallel sort. Their results have been independently confirmed by a rigorous TA test suite, and the code has been manually checked and corrected for any irregularities and inconsistencies in the timing mechanism. Note that these students are ranked by absolute CUDA processing time, not the speedup, even though they are consistent. Congratulations!

MP5 Top 5:

 

Name

Program Output

1. Alan Kaatz
Source Code / Report

**===--------------- Grading kaatz -------------------===**
Processing 16000000 elements...
Host CPU Processing time: 8084.667969 (ms)
G80 CUDA Processing time: 311.354004 (ms)
Speedup: 25.966160X
Test PASSED
diffing TA and student outputs test 1
diffing TA and student outputs test 2
diffing TA and student outputs test 3

PASSED: kaatz passed all testss
**===-------------------------------------------------===**

2. Thomas Shen
Source Code / Report

**===--------------- Grading tbshen -------------------===**
Processing 16000000 elements...
Host CPU Processing time: 8011.271973 (ms)
G80 CUDA Processing time: 322.955994 (ms)
Speedup: 24.806079X
Test PASSED
diffing TA and student outputs test 1
diffing TA and student outputs test 2
diffing TA and student outputs test 3

PASSED: tbshen passed all tests.
**===-------------------------------------------------===**

3. Lingling Miao
Source Code / Report

**===--------------- Grading lmiao2-------------------===**
Processing 16000000 elements...
Host CPU Processing time: 8018.119141 (ms)
G80 CUDA Processing time: 461.638000 (ms)
Speedup: 17.368846X
Test PASSED
diffing TA and student outputs test 1
diffing TA and student outputs test 2
diffing TA and student outputs test 3

PASSED: lmiao2 passed all tests.
**===-------------------------------------------------===**

4. Michael Connor
Source Code / Report

**===--------------- Grading connor2-------------------===**
Processing 16000000 elements...
Host CPU Processing time: 8034.141113 (ms)
G80 CUDA Processing time: 504.471985 (ms)
Speedup: 15.925842X
Test PASSED
diffing TA and student outputs test 1
diffing TA and student outputs test 2
diffing TA and student outputs test 3

PASSED: connor2 passed all tests.
**===-------------------------------------------------===**

5. Faycal Benmlih
Source Code / Report

**===--------------- Grading benmlih2-------------------===**
Processing 16000000 elements...
Host CPU Processing time: 7643.923828 (ms)
Number of elements: 16000000
Number of Cuda elements: 16000000
Number of Cuda Memory elements: 16777216
Number of padded elements: 777216
G80 CUDA Processing time: 578.578003 (ms)
Speedup: 13.211570X
Test PASSED
diffing TA and student outputs test 1
diffing TA and student outputs test 2
diffing TA and student outputs test 3

PASSED: benmlih2 passed all tests.
**===-------------------------------------------------===**

All speedup results are computed against an 2.2 GHz AMD Opteron 248 processor, with 1GB of system memory.