Contents (hide)

1.   1.  Mon 2015-01-26 (L1)
2.   2.  Thurs 1/29/15 (L2)
3.   3.  Mon 2/2/15
4.   4.  Thurs 2/5/15 (L3)
5.   5.  Mon 2/9/15 (L4)
1.   5.1  NVidia device architecture, start of CUDA
6.   6.  Thurs 2/12/15 (L5)
7.   7.  Mon 2/16/15
8.   8.  Thurs 2/19/15 (L5)
9.   9.  Mon 2/23/15 (L6)
1.   9.1  GTC highlights
2.   9.2  CUDA ctd
10. 10.  Thurs 2/26/15 (L7)
1. 10.1  Nvidia conceptual hierarchy
2. 10.2  CUDA versions
3. 10.3  GPU summary
4. 10.4  CUDA FAQs
5. 10.5  Useful commands etc
6. 10.6  Recent CUDA Changes
7. 10.7  Lecture material
11. 11.  Mon 3/2/15 (L8)
12. 12.  Thurs 3/5/15 (L9)
13. 13.  Mon 3/9/15 (L10)
1. 13.1  Homework note
2. 13.2  Videos
3. 13.3  CUDA Unified Memory
4. 13.4  Stanford lecture on Thrust
14. 14.  Thurs 3/12/15 (L11)
1. 14.1  Videos
2. 14.2  Thrust experiments
3. 14.3  Sorting with Thrust on device
15. 15.  Mon 3/16/2015 (L12)
1. 15.1  Videos
2. 15.2  Thrust notes
16. 16.  Thurs 3/19/15 (L11)
1. 16.1  Thrust
2. 16.2  Videos
3. 16.3  Homeworks
4. 16.4  Term project
17. 17.  Mon 3/30/15 (L12)
1. 17.1  Thrust
2. 17.2  Error about missing CUDA 5 files etc.
3. 17.3  GPU Techology Conference talk
4. 17.4  cuFFT Notes
5. 17.5  GTC 2015 talk
6. 17.6  cuBLAS etc Notes
7. 17.7  Term project dates
18. 18.  Mon 4/6/15 (L14)
1. 18.1  Supercomputing videos
2. 18.2  Matlab
3. 18.3  Mathematica in parallel.
19. 19.  Thurs 4/9/15 (L15)
20. 20.  Mon 4/13/15 (L16)
21. 21.  Thurs 4/16/15 (L17)
1. 21.1  Papers selected by students
22. 22.  Mon 4/20/15 (L18)
23. 23.  Thurs 4/23/15 (L19)
24. 24.  Mon 4/27/15 (L20)
25. 25.  Thurs 4/30/15 - no class
26. 26.  Mon 5/4/15 (L21)
1. 26.1  Term project
2. 26.2  New material
27. 27.  Wed 5/6/15
28. 28.  Thurs 5/7/15 (L22)
29. 29.  Mon 5/11/15 (L23)
1. 29.1  Term project presentations
2. 29.2  Course recap
3. 29.3  Term projects due.
4. 29.4  Remember your course survey
5. 29.5  After the semester

2015 Week 1

## 1.  Mon 2015-01-26 (L1)

2. ssh.
3. Read https://computing.llnl.gov/tutorials/parallel_comp/ for an intro to parallel computing.
4. Some points:
1. Physics limits to processor speed.
2. History of Nvidia.
3. Intel CPUs vs Nvidia CUDA cores.
5. OpenMP vs CUDA.
6. Rate-limiting cost usually I/O not computation.
5. Homework 1 is online, due next week.
6. Notes written during the lecture.

## 2.  Thurs 1/29/15 (L2)

1. I (WRF) will also usually stay after lectures to answer questions.
2. Why you should also take the Supercomputer course: It's your chance to use a Blue Gene.
3. Sandia Red Sky supercomputer build timelapse
4. LMS summary if necessary.
5. Geoxeon uses ZFS, which snapshots. Find your old files in /home/parcomp/.zfs/snapshot/*/rcsid .
6. On geoxeon, I will put class files in /home/parcomp/Class/ .
8. Intro to OpenMP by Mattson
9. Start OpenMP by running sample programs on geoxeon /home/parcomp/Class/openmp/llnl. Some of the programs show bugs.
10. Compiling openmp programs and Makefile.
11. Environment variables to set number of threads etc.
12. My example programs were from LLNL. The tutorial is here. Sample programs that I used are linked from here. Search down the page for omp_hello.c.
13. The Intel Mattson tutorial also has programs, and PDF slides and Youtube videos.
14. Today we saw that running even a short OpenMP program could be unpredictable. The reason for that was today's big lesson.
This happened during the first space shuttle launch in 1981. A 1-in-70 chance event prevented the computers from synchonizing. This had been observed once before during testing. However when they couldn't get it to happen again, they ignored it.
This interlacing of different threads can happen at the machine instruction level. The C statement K=K+3 can translate into the machine instructions
Let's color the threads red and blue.
then K increases by 6. If they interlace thus:
then K increases by 3.
15. This can be really hard to debug, particularly when you don't know where this is happening.
One solution is to serialize the offending code, so that only one thread executes it at a time. The limit would be to serialize the whole program and not to use parallel processing.
16. OpenMP has several mechanisms to help.
17. A critical pragma serializes the following block. There are two considerations.
1. You lose the benefits of parallelization on that block.
2. There is an overhead to set up a critical block that I estimate might be 100,000 instructions.
18. An atomic pragma serializes one short C statement, which must be one of a small set of allowed operations, such as increment. It matches certain atomic machine instructions, like test-and-set. The overhead is much smaller.
19. If every thread is summing into a common total variable, the reduction pragma causes each thread to sum into a private subtotal, and then sum the subtotals. This is very fast.
20. Another lesson is that sometimes you can check your program's correctness with an independent computation. For the trivial example of summing {$i^2$}, use the formula
{$\sum_{i=1}^N i^2 = N(N+1)(2N+1)/6$}.
There is a lot of useful algebra if you have the time to learn it. I, ahem, learned this formula in high school.
21. Another lesson is that even when the program gives the same answer every time, it might still be consistently wrong.
22. Another is that just including OpenMP facilities, like -fopenmp, into your program slows it down even if you don't use them.
23. Another is that the only meaningful time metric is elapsed real time. One reason that CPU time is meaningless is the OpenMP sometimes pauses a thread's progress by making it do a CPU-bound loop. (That is also a common technique in HW.)
24. Also note that CPU times can vary considerably with successive executions.
25. Also, using too many threads might increase the real time.
26. Finally, floating point numbers have their own problems. They are an approximation of the mathematical concept called the real number field. That is defined by 11 axioms that state obvious properties, like
A+B=B+A (commutativity) and
A+(B+C)=(A+B+C) (associativity).
This is covered in courses on modern algebra or abstract algebra.
The problem is that most of these are violated, at least somewhat, with floats. E.g.,
{$\left(10^{20}-10^{20}\right)+1=0+1=1$} but
{$10^{20}+\left(-10^{20}+1\right)=10^{20}-10^{20}=0$}.
Therefore when threads execute in a different order, floating results might be different.
There is no perfect solution, though using double is a start.
1. On modern CPUs, double is just as fast as float.
2. However it's slower on GPUs.
3. It also requires moving twice the data, and data movement is often more important than CPU time.
The large field of numerical analysis is devoted to finding solutions, with more robust and stable algorithms.
Summing an array by first sorting, and then summing the absolutely smaller elements first is one technique. Inverting a matrix by pivoting on the absolutely largest element, instead of on {$a_{11}$} is another.

2015 Week 2

## 3.  Mon 2/2/15

No class - snow day.

## 4.  Thurs 2/5/15 (L3)

1. We'll see programs in Gfiles:.
3. Debug some programs in https://computing.llnl.gov/tutorials/openMP/samples/C

2015 Week 3

## 5.  Mon 2/9/15 (L4)

1. Look at some more programs in Gfiles:. .
1. stacksize.cc - get and set stack size. Can also do ulimit -s.
2. omps.cc - read assorted OMP variables.
2. Homework 2 is online in Homeworks due next Mon.
3. OpenMP sections - my example is sections.cc with some variants.
1. Note my cool way to print an expression's name followed by its value.
2. Since I/O is not thread-safe, it has to be protected by a critical and also followed by a barrier.
3. Note the 3 required levels of pragmas: parallel, sections, section.
4. The assignment of sections to threads is nondeterministic.
5. IMNSHO, OpenMP considerably easier than pthreads, fork, etc.
1. While inside a pragma parallel, you queue up lots of tasks - they're executed as threads are available.
3. My example is tasks.cc with some variants.
4. taskfib.cc is modified from an example in the OpenMP spec.
2. This is only an example; you would never implement fibonacci that way. (The fastest way is the following closed form. In spite of the sqrts, the expression always gives an integer. )
{$$F_n = \frac{\left( \frac{1+\sqrt{5}}{2} \right) ^n - \left( \frac{1-\sqrt{5}}{2} \right) ^n }{\sqrt{5}}$$}
3. Spawning a task is expensive; we can calculate the cost.
5. What you should have learned from the OpenMP lectures:
1. How to use a well-designed and widely used API that is useful on almost all current CPUs.
2. Shared memory: the most common parallel architecture.
3. How to structure a program to be parallelizable.
4. Issues in parallel programs, like nondeterminism, race conditions, critical regions, etc.

### 5.1  NVidia device architecture, start of CUDA

1. The above shared memory model hits a wall; CUDA handles the other side of the wall.
2. I'll lecture from the Stanford notes for awhile. stanford/lectures/. Today, lecture 1, through slide 30.

## 6.  Thurs 2/12/15 (L5)

1. To people still having trouble connecting to geoxeon, try the following:
1. ssh-keygen -t rsa
2. ssh-keygen -t rsa1
3. ssh-keygen -t dsa
4. ssh-keygen -t ed25519
5. ssh-keygen -t ecdsa
6. cat ~/.ssh/*.pub > all
7. Then email file all to me, along with your rcsid.
If some of those formats are unrecognized, don't worry. The hope is that one of them will work.
2. One advantage of ssh is that you can mount your geoxeon directory on your local machine. On my laptop, I run caja, then do File - Connect to server, choose type ssh, server geoxeon.ecse.rpi.edu, use my RCSID as the user name, leave the password blank, and give my private key passphrase if asked. Any other program to mount network shares would work as well.
Where the share is actually mounted varies. On my laptop, it's here: /var/run/user/1000/gvfs/sftp:host=geoxeon.ecse.rpi.edu,user=wrf
As with any network mount, fancy filesystem things like attributes, simultaneous reading and writing from different machines, etc., probably won't work.
3. To avoid retyping your ssh private key passphrase in the same shell, do this
4. With ssh, you can also copy files back and forth w/o mounting or typing passwords:
1. scp localfile geoxeon.ecse.rpi.edu:
2. scp -r localdir geoxeon.ecse.rpi.edu:
3. scp geoxeon.ecse.rpi.edu:remotefile .
It even does filename completion on remote files.
5. You can also run single commands:
ssh geoxeon.ecse.rpi.edu hostname
6. Geoxeon implements AFS, so your RCS files are accessible (read and write) at
/afs/rpi/edu/home/??/RCSUID.
Search for your home dir thus:
ls -ld /afs/rpi.edu/home/*/RCSID.
Authenticate yourself with klog.
7. On geoxeon, your files are transparently compressed, so there's no need to explicitly use gzip etc.
8. Remote graphics over ssh is very slow. I'll give hints later about improving this.
9. Stanford lecture 1, slide 31-end. stanford/lectures/
10. Homework 2 due date is postponed to next Thurs because of no lecture on Mon.

2015 Week 4

## 7.  Mon 2/16/15

No classes - Presidents' Day.

## 8.  Thurs 2/19/15 (L5)

### 8.1  New TA

Hanchao Liu, l i u h 9, is our new 10-hour TA. His main job will be grading and maintaining LMS. I (WRF) will be your prime question-answerer.

### 8.2  Types of memory allocation

Here's a brief overview of my understanding of the various places that you can assign memory in a program.

1. Static. Define a fixed-size array global array. The variable is constructed at compile time, so accesses might perhaps be faster. Global vars with non default initial values increase the executable file size. If they're large enough, you need to use the compiler option -mcmodel=medium or -mcmodel=large. I don't know the effect on the program's size or speed.
2. Stack. Define local arrays, that are created and freed as the routine is entered and exited. Their addresses relative to the base of this call frame may be constant. The default stack size is 8MB. You can increase this with the command ulimit or in the program as shown in stacksize.cc. I believe that in OpenMP, the max stacksize may be allocated when each thread is created. Then, a really big stackssize might have a penalty.
3. Heap. You use new and destroy. Variables are constructed whenever you want. The more objects on the heap, the more time that each new or destroy takes. If you have lots of objects consider using placement new or creating an array of them.

I like to use static, then stack, and heap only when necessary.

Google's allocator is noticeably better than the default one. To use it, link your programs with -ltcmalloc. You can often use it on an existing executable foo thus:

I found it to save 15% to 30% in time.

Another memory concern is speed. Geoxeon has a NUMA (Non Uniform Memory Architecture). It has two 8-core Xeons. Each core has 64GB of main memory. Although all 128GB are in a common address space, accessing memory on same core as the thread is running on is faster.

The following is what I think based on some research, but may be wrong: A 4KB page of memory is assigned to a specific core when it is first written (not when it is reserved). So, each page of a large array may be on a different core. This can be used to optimize things. This gets more fun with 8-processor systems.

All that is separate from cache issues.

You can also assign your OpenMP threads to specific cores. This affects speed in ways I don't understand. The issues are resource sharing vs conflicts.

2015 Week 4b

### 8.3  VPN requirement for geoxeon

When accessing geoxeon from off-campus, you must go through the VPN.

### 8.5  New CUDA material

1. Stanford lectures:
lecture_2/gpu_history_and_cuda_programming_basics.pdf
2. Demo programs on geoxeon:
/home/parcomp/Class/cuda/stanford/tutorials

2015 Week 5

## 9.  Mon 2/23/15 (L6)

### 9.1  GTC highlights

The annual [[http://www.gputechconf.com/|GPU Technology Conference]] is in March. You can register for only $1400. However many talks are online as videos, with PDF slide sets, for free. I'll show extracts in class (the full talks are 1 hour). 1. Smashing galaxies Smashing Galaxies: A Hierarchical Road from Top to Bottom Jeroen Bedorf (Leiden Observatory, Leiden University), Evghenii Gaburov (SARA, Amsterdam, the Netherlands) Find out how one can leverage massive GPU parallelism to assemble fast sparse octree construction and traverse methods by combining parallel primitives such as scan and sort algorithms. These techniques have culminated in Bonsaia hierarchical gravitational N-body code which is used to study the formation and mergers of galaxies in the present day Universe. With the advent of Kepler''s dynamic parallelism, we explore the new venues that this technology opens for scalable implementations of such hierarchical algorithms. We conclude the session with cutting edge simulations complemented with spectacular visualizations that are produced in collaboration with NVIDIA''s visualization experts. ### 9.2 CUDA ctd 1. With CUDA, the dominant problem in program optimization is optimizing the data flow. Getting the data quickly to the cores is harder than processing it. It helps big to have regular arrays, where each core reads or writes a successive entry. This is analogous to the hardware fact that wires are bigger (hence, more expensive) than gates. 2. That is the opposite optimization to OpenMP, where having different threads writing to adjacent addresses will cause the false sharing problem. 3. lecture_3/cuda_threads_and_atomics.pdf slides 19-end. 4. lecture_4/cuda_memories.pdf, slides 1-28. 5. More tutorial programs from tutorials/. 2015 Week 5b ## 10. Thurs 2/26/15 (L7) ### 10.1 Nvidia conceptual hierarchy As always, this is as I understand it, and could be wrong. 1. At the bottom is the hardware micro-architecture. This is an API that defines things like the available operations. The last several Nvidia micro-architecture generations are, in order, Tesla (which introduced unified shaders), Fermi, Kepler, Maxwell (introduced in 2014), Pascal. The major number of the CUDA capability version corresponds to the micro-architecture generation. Kepler is 3.x. The K20xm is 3.5. 2. Each micro-architecture is implemented in several different microprocessors. E.g., the Kepler micro-architecture is embodied in the GK107, GK110, etc. The second letter describes the micro-architecture. Different microprocessors with the same micro-architecture may have different amounts of various resources, like the number of processors and clock rate. 3. To be used, microprocessors are embedded in modules, which are grouped into series such as GeForce, Quadro, etc. Confusingly, there is a Tesla computing module that may use any of the Tesla, Fermi, or Kepler micro-architectures. Two different modules using the same microprocessor may have different amounts of memory and other resources. 4. The term GPU sometimes refers to the microprocessor and sometimes to the module. 5. There are three families of modules: GeForce for gamers, Quadro for professionals, and Tesla for computation. 6. Geoxeon has a Quadro K5000 and a Tesla K20xm. 7. Since the highest-end modules don't have video out, they are called something like compute modules. ### 10.2 CUDA versions 1. The CUDA capability version defines the hardware. 2. The CUDA driver and runtime also have a software version, defining things like available C++ functions. The latest is 7.0. ### 10.3 GPU summary Here's a summary of the GPU architecture as I understand it. It's more compact than I've found elsewhere. I'll add to it from time to time. Quoted sizes are for Kepler GK110 with Compute Capability 3.5. Many of the features are new with 3.5. 1. The host is the CPU. 2. The device is the GPU. 3. The device contains 14 streaming multiprocessor extremess (SMXs). (This is for Kepler. For Fermi, they were called streaming multiprocessors (SMs)). 4. A thread is a sequential program with private and shared memory, program counter, etc. 5. Threads are grouped, 32 at a time, into warps. 6. Warps of threads are grouped into blocks. Often the warps are only implicit, and we consider that the threads are grouped directly into blocks. 7. Blocks of threads are grouped into a grid, which is all the threads in the kernel. 8. A kernel is a parallel program executing on the device. 1. The kernel runs potentially thousands of threads. 2. A kernel can create other kernels and wait for their completion. 3. There may be a limit, e.g., 5 seconds, on a kernel's run time. 9. Thread-level resources: 1. Each thread can use up to 255 fast registers. Registers are private to the thread. All the threads' registers are allocated from a fixed pool. The more registers that each thread uses, the fewer threads that can run simultaneously. 2. Each thread has 512KB slow local memory, allocated from the global memory. 3. Local memory is used when not enough registers are available, and to store thread-local arrays. 10. Warp-level resources: 1. Threads are grouped, 32 at a time, into warps. 2. Each warp executes as a SIMD, with one instruction register. At each cycle, every thread in a warp is either executing the same instruction, or is disabled. If the 32 threads want to execute 32 different instructions, then they will execute one after the other, sequentially. If you read in some NVidia doc that threads in a warp run independently, then continue reading the next page to get the info mentioned in the previous paragraph. 3. If successive instructions in a warp do not depend on each other, then, if there are enough warp schedulers available, they may be executed in parallel. This is called Instruction Level Parallelism (ILP). 4. For an array in local memory, which means that each thread will have its private copy, the elements for all the threads in a warp are interleaved to potentially increase the I/O rate. Therefore your program should try to have successive threads read successive words of arrays. 5. A thread can read variables from other threads in the same warp, with the shuffle instruction. Typical operation are to read from the K-th next thread, to do a butterfly permutation, or to do an indexed read. This happens in parallel for the whole warp, and does not use shared memory. 6. A warp vote combines a bit computed by each thread to report results like all or any. 11. Block-level resources: 1. A block may contain up to 1024 threads. 2. Each block has access to 65536 fast 32-bit registers, for the use of its threads. 3. Each block can use up to 49152 bytes of the SMX's fast shared memory. The block's shared memory is shared by all the threads in the block, but is hidden from other blocks. Shared memory is basically a user-controllable cache of some global data. The saving comes from reusing that shared data several times after you loaded it from global memory once. 4. Warps in a block run asynchronously and run different instructions. They are scheduled and executed as resources are available. 5. The threads in a block can be synchonized with __syncthreads(). Because of how warps are scheduled, that can be slow. 6. The threads in a block can be arranged into a 3D array, up to 1024x1024x64. That is for convenience, and does not increase performance. 7. I'll talk about textures later. 12. Streaming Multiprocessor (SMX) - level resources: 1. Each SMX has 192 single-precision CUDA cores, 64 double-precision units, 32 special function units, and 32 load/store units. 2. A CUDA core is akin to an ALU. The cores, and all the units, are pipelined. 3. A CUDA core is much less powerful than one core of an Intel Xeon. My guess is 1/20th. 4. Beware that, in the CUDA C Programming Guide, NVidia sometimes calls an SMX a core. 5. The limited number of, e.g., double precision units means that an DP instruction will need to be scheduled several times for all the threads to execute it. That's why DP is slower. 6. Each SMX has 32K fast 4-byte registers, to be divided among its threads. (Yes this sounds inconsistent with the number of registers per block. When I figure it out, I'll edit this. I'm synthesizing this doc from several sources, some of which are older.) 7. Each SMX has 4 warp schedulers and 8 instruction dispatch units. 8. 64 warps can simultaneously reside in an SMX. 9. Therefore up to 32x64=2048 threads can be executed in parallel by an SMX. 10. Up to 16 blocks that can simultaneously be resident in an SMX. However, if each block uses too many resources, like shared memory, then this number is reduced. Each block sits on only one SMX; no block is split. However a block's warps are executed asynchronously (until synced). 11. Each SMX has 64KiB fast memory to be divided between shared memory and an L1 cache. Typically, 48KiB is used for the shared memory, to be divided among its resident blocks, but that can be changed. 12. The L1 cache can cache local or global memory. 13. Each SMX has a read-only data cache of 48KB to cache the global constant memory. 14. Each SMX has 8 profile counters that can be incremented by any thread. 13. Grid-level resources: 1. The blocks in a grid can be arranged into a 3D array. up to {$ (2^{31}-1, 2^{16}-1, 2^{16}-1) $}. 2. Blocks in a grid are queued and executed as resources are available, in an unpredictable parallel or serial order. Therefore they should be independent of each other. Trying to sync blocks can cause a deadlock. 3. The number of instructions in a kernel is limited. 2M. 4. Any thread can stop the kernel by calling assert. 14. Device-level resources: 1. There is a large and slow 6GB global memory, which persists from kernel to kernel. Transactions to global memory are 128 bytes. Host memory can also be memory-mapped into global memory, although the I/O rate will be lower. Reading from global memory can take hundreds of cycles. A warp that does this will be paused and another warp started. Such context switching is very efficient. Therefore device throughput stays high, although there is a latency. This is called '''Thread Level Parallelism (TLP)''' and is a major reason for GPU performance. That assumes that an SMX has enough active warps that there is always another warp available for execution. That is a reason for having warps that do not use all the resources (registers etc) that they're allowed to. 2. There is a 1536KB L2 cache, for sharing data between SMXs. 3. There is a 64KiB Small and fast global constant memory, , which also persists from kernel to kernel. It is implemented as a piece of the global memory, made fast with caches. (Again, I'm still resolving this apparent contradiction). 4. Grid Management Unit (GMU) schedules (pauses, executes, etc) grids on the device. This is more important because grids can now start other grids (Dynamic Parallelism). 5. Hyper-Q: 32 simultaneous CPU tasks can launch kernels into the queue; they don't block each other. If one kernel is waiting, another runs. 6. CUDA Work Distributor (CWD) dispatches 32 active grids at a time to the SMXs. There may be 1000s of grids queued and waiting. 7. GPU Direct: Other devices can DMA the GPU memory. 15. Refs: 1. http://www.geeks3d.com/20100606/gpu-computing-nvidia-cuda-compute-capability-comparative-table/ 2. The CUDA program deviceDrv. 3. http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf 4. Better Performance at Lower Occupancy, Vasily Volkov, UC Berkeley, 2010. 5. https://www.pgroup.com/lit/articles/insider/v2n1a5.htm - well written but old. 6. http://www.theregister.co.uk/2012/05/15/nvidia_kepler_tesla_gpu_revealed/ (I'll keep adding to this. Suggestions are welcome.) 1. CUDA function qualifiers: 1. __global__ device function called from host, starting a kernel. 2. __device__ device function called from device function. 3. __host__ (default) host function called from host function. 2. CUDA variable qualifiers: 1. __shared__ 2. __device__ global 3. __constant__ 4. (nothing) register if scalar, or local if array or if no more registers available. 2015 Week 5d ### 10.4 CUDA FAQs Here are some that I've found useful. ### 10.5 Useful commands etc 1. nvidia-smi 2. nvcc -o foo -lineinfo -G -O3 -arch=sm_35 --ptxas-options=-v foo.cu -lglut Don't always use all of the above options at once. 2015 Week 5e ### 10.6 Recent CUDA Changes Since the Stanford slides, CUDA has been enhanced in a few ways. 1. Unified Virtual Addressing (UVA). Memory can be allocated on the host, page-locked or pinned into real memory, and directly referenced by a device function via a special pointer. Properties: 1. Typed pointers. Pointers contain info labeling them as on the host or device, so that copying is easier. 2. No need to copy data from the host into device global memory. That saves your time. 3. However each access to host memory is even slower than to device global memory. 4. At some point, copying the host data to device memory becomes advisable. 5. The above two points become less important if the cache is effective. 6. Geoxeon almost never pages, so pinning is not the problem that it might be on other machines. One of my themes is that virtual memory is mostly obsolete since you can buy enough real memory not to have to need to swap. Also, because I/O speeds haven't increased as fast as primary memory got bigger, paging takes too much time. VM's main current use is for memory-mapping files, devices etc. My preferred way to do I/O to a binary file is to simply map the whole file into address space. Linux will then treat the file as it treats virtual memory pages - paging chunks of it into real memory on demand, and paging them out when something else needs the real memory. Reading a writing a file that is in real memory is really fast. On geoxeon, you can also put files in /ds, which is a temp file system in real memory. /ds/ is cleared when geoxeon is rebooted. 7. On geoxeon, /home/parcomp/Class/cuda/stanford/tutorials/vector_addition2.cu is an example. 2. Unified memory. This is a year old. It supplements UVA by automatically caching host data in device global memory. 1. You do not need to copy data between host and device. Programming is easier. 2. Because of the cache, this may execute more quickly than a simply-written program that explicitly copies data, but more slowly than an expertly-written program. That's the usual case with machine optimization. 3. Unified memory builds on, and obsoletes the explicit use of, unified addressing. 4. I recommend that you always use unified memory. 5. On geoxeon, /home/parcomp/Class/cuda/stanford/tutorials/vector_addition3.cu is an example. 2015 Week 5f ### 10.7 Lecture material 1. Review tutorials/sum_reduction.cu 2. lecture_4/cuda_memories.pdf, slides 29-end. 3. Tutorial programs from tutorials/. 1. matrix_multiplication.cu 2. shared_variables.cu 4. Misc CUDA advice for installing it on your own system. They are based on my experience getting things working. 1. When compiling the programs in NVIDIA_CUDA-7.0_Samples, if you get errors about missing libraries libGLU.so, etc, set: GLPATH=/usr/bin Then do make. If you do make -j, it will run in parallel. 2. Assuming that you've installed CUDA into /local/cuda: You cannot simply link the executables in /local/cuda/bin into a dir in your path. nvcc looks for libraries in a directory relative to the location of nvcc. Therefore, you have to add /local/cuda/bin to PATH. 2015 Week 6 ## 11. Mon 3/2/15 (L8) 1. Homework 3 is online, due in 10 days. 2. Nvidia's CUDA occupancy calculator. 3. (Part of) Pixar's keynote address at the 2014 GTC, Using NVIDIA GPUs for Feature Film Production at Pixar. http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php 4. lecture_5/performance_considerations.pdf/ The important point is the general ideas. In your own research, your first job is to get your CUDA project to run correctly. Only then, and only if speed is a problem, do you worry about speed.  lecture_6/parallel_patterns_1.pdf, slides 1-25. This shows good paradigms for parallel algorithms, generally useful. They make your problem faster on any parallel architecture.  ## 12. Thurs 3/5/15 (L9) ### 12.1 Freeze decisions early: SW design paradigm 1. One of my rules is to push design decisions to take effect as early in the process execution as possible. Constructing variables at compile time is best, at function call time is 2nd, and on the heap is worst. 1. If I have to construct variables on the heap, I construct few and large variables, never many small ones. 2. Often I compile the max dataset size into the program, which permits constructing the arrays at compile time. Recompiling for a larger dataset is quick (unless you're using CUDA). Accessing this type of variable uses one less level of pointer than accessing a variable on the heap. I don't know whether this is faster with a good optimizing compiler, but it's probably not slower. 3. If the data will require a dataset with unpredictably sized components, such as a ragged array, then I may do the following. 1. Read the data once to accumulate the necessary statistics. 2. Construct the required ragged array. 3. Reread the data and populate the array. 2015 Week 6b ### 12.2 Lecture material 1. Investigating New Numerical Techniques for Reservoir Simulations on GPUs Hatem Ltaief (Supercomputing Laboratory, KAUST), Ahmad Abdelfattah (Center of Extreme Computing, KAUST), Rio Yokota (Center of Extreme Computing, KAUST) Reservoir simulation involve sparse iterative solvers for linear systems that arise from implicit discretizations of coupled PDEs from high-fidelity reservoir simulators. One of the major bottlenecks in these solvers is the sparse matrix-vector product. Sparse matrices are usually compressed in some format (e.g., CSR, ELL) before being processed. In this talk, we focus on the low-level design of a sparse matrix-vector (SpMV) kernel on GPUs. Most of the relevant contributions focus on introducing new formats that suit the GPU architecture such as the diagonal format for diagonal matrices and the blocked-ELL format for sparse matrices with small dense blocks. However, we target both generic and domain-specific implementations. Generic implementations basically target the CSR and ELL formats, in order to be part of the KAUST-BLAS library. More chances for further optimizations appear when the matrix has specific structure. In the talk, we will present the major design challenges and outlines, and preliminary results. The primary focus will be on the CSR format, where some preliminary results will be shown. The other bottleneck of reservoir simulations is the preconditioning in the sparse matrix solver. We investigate the possibility of a Fast Multipole Method based technique on GPUs as a compute-bound preconditioner. 2. Deploying General Purpose GPUs in a Manufacturing Environment: Software and CFD Development in Bicycle Manufacturing Joe Dutka (Acer Inc.) The session will share the work of the bicycle company Velocite and researchers at NCKU in Taiwan as they used GPU computing to design their next generation of bicycles. The platform was 2 Acer AT350 F2 servers with both Quadro and Tesla cards and Intel Xeon CPUs for hybrid computation. Rather than a theoretical presentation, the session will focus on real-world implementation and the difficulties overcome to develop software and bike design. Taking place at the same time, the bike will debut in the Taipei Bike Show on March 20. 3. lecture_6/parallel_patterns_1.pdf, slides 26-end. 1. work with GPU memory access implementation: Fast to read a block of data and give successive words to successive threads. Interleaving data (between what the threads want) must be read but is not used. prefer SOA (structure of arrays) to AOS (array of structures). This is the opposite of normal good encapsulated SW design. However you could hide this with C++ classes. 2. Linked lists don't parallelize well. 3. Recursion as taught in low-level CS courses often doesn't parallelize well. Nevertheless, since a year ago, a CUDA kernel can start another kernel. The new kernel goes into the queue to be run as resources dictate. 4. Starting a kernel has a cost. 5. reduce 6. split 7. compact / expand 4. lecture_7/parallel_patterns_2.pdf 5. Textures: 1. You read entries from a 1-D, 2-D, or 3-D array. If your subscripts do not hit exactly on an entry, or are outside bounds, the system interpolates a value. 2. You choose the interpolation rule: nearest neighbor, linear, etc. 3. You choose the out-of-bounds rule: clamp, wrap, etc. 4. This is useful whenever interpolation is desired, which is more than just for images. 5. http://www.math.ntu.edu.tw/~wwang/mtxcomp2010/download/cuda_04_ykhung.pdf, slides 21 and up. 6. 0_Simple/simpleTexture/simpleTexture.cu from the CUDA SDK. 7. A texture can have many layers, all of the same size. This is not a mipmap. 8. A cube-map texture has a texture for each face of a cube. 9. 2D and 3D cuda arrays, used for textures, are stored in a space-filling curve order, to increase the probability that two elements with similar indices are close in memory. 10. Surface memory is like a texture than can be written to. However changes may not be visible until a new kernel is executed. 6. More texture refs; to read on your own 7. CUDA and OpenGL 1. You can create a buffer object in OpenGL and then access it from CUDA. 2015 Week 7 ## 13. Mon 3/9/15 (L10) ### 13.1 Homework note For programming questions: please hand in 1. your source code, and 2. sample outputs to show its workings These will be used for grading. We will not run your code. That would be impossible because you're allowed to use any platforms ### 13.2 Videos These are online here: http://www.gputechconf.com/gtcnew/on-demand-gtc.php. 1. CUDA-based Monte Carlo Simulation for Radiation Therapy Dosimetry Nicholas Henderson (Stanford University) Learn about a CUDA adaptation of Geant4, a large-scale Monte Carlo particle physics toolkit. Geant4 was originally designed to support the needs of high energy physics experiments at SLAC, CERN and other places around the world. Geant4 is an extensive toolkit which facilitates every aspect of the simulation process and has been successfully used in many other domains. Current interest is radiation therapy dosimetry. For this application the geometry is simple and the model physics is limited to low energy electromagnetics. These features allow efficient tracking of many particles in parallel on the GPU. Let's watch the whole video. ### 13.3 CUDA Unified Memory 1. http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/ Now data structures with pointers can be migrated between host and device (w/o updating the pointers). The total amount of managed memory is limited to the amount of GPU memory. 2. My code to add two vectors: add_unimem.cu. In this simple example, unified memory is slower. I'm now testing it on a more compute-intensive example (matrix multiplication), and will report when there are results. 3. Nvidia demo code: https://github.com/parallel-forall/code-samples/tree/master/posts/unified-memory. My cache: cuda/git/code-samples-master/posts/unified-memory ### 13.4 Stanford lecture on Thrust Thrust: 1. Thrust is an API that looks like STL. Its backend can be CUDA, OpenMP, or sequential host-based code. 2. Functional-programming philosophy. 3. Easier programming, once you get used to it. 4. Code is efficient. 5. Uses some unusual C++ techniques, like overloading operator(). 6. Since the Stanford slides were created, Thrust has adopted unified addressing, so that pointers know whether they are host or device. 7. Various little demo programs I'll upload to the web, and which will be on geoxeon in /pc/thrust . 8. CUDACast videos on Thrust: CUDACast #15 - Introduction to Thrust CUDACast #16 - Thrust Algorithms and Custom Operators 2015 Week 7a ## 14. Thurs 3/12/15 (L11) ### 14.1 Videos These are online here: http://www.gputechconf.com/gtcnew/on-demand-gtc.php. 1. (Your suggestions are welcome.) 2. CUDA Debugging with Command Line Tools Vyas Venkataraman (NVIDIA) CUDA debugging tools CUDA-GDB and CUDA-MEMCHECK provide a whole new feature set to help improve your CUDA application development cycle. This session is a detailed walk-through of the key debugger features and advanced techniques on using printf, CUDA-GDB and MEMCHECK together to improve overall code productivity on Linux and MacOS platforms. 3. CUDA Debugging Tools: CUDA-GDB and CUDA-MEMCHECK Vyas Venkataraman (NVIDIA) Advance debugging techniques with CUDA-GDB and CUDA-MEMCHECK will be covered that can assist even the most advanced CUDA developer in locating program correctness issues. - Watch this on your own if you want more info. 4. Exploiting the GPU for High Performance Geospatial Situational Awareness Involving Massive and Dynamic Data Sets Bart Adams (Luciad) Geospatial Situational Awareness(SA)engines face stringent accuracy and performance requirements. Large volumes of static and dynamic data need to be analyzed and visualized, in both 2D and 3D in various geographic projections, at sub-centimeter accuracy and interactive update rates. In contrast to game engines where this data can be pre-processed and stored in optimized data structures, the data comes in any form and needs to be interpreted on-the-fly. This talk will discuss these challenges and the advanced GPU rendering techniques and algorithms that address them. We will show that by exploiting the GPU, terabytes of terrain and imagery data, in combination with highly dynamic data streams that can contain millions of tracks and multiple radar feeds as well as orthorectified UAV video streams, can be handled on a world-scale theater at update rates of over 60Hz 2015 Week 7b ### 14.2 Thrust experiments Thrust files on geoxeon are in /pc/thrust/ . 1. thrust_examples contains files from the thrust distribution. 2. rpi contains files I wrote. 3. There is also some doc. For students using their own computer, I've mirrored this in thrust . Using /thrust/rpi/sortreduce.cu, I did some experiments on sorting and reducing a 500M element vector on the host and on the device. The times, in seconds: Operation Time on host Time on device Create a vector .74 .014 Transform a vector with a simple functor .8 .012 Sort 11. .66 Reduce .5 .013 Copy from device to host .6 The device is 20x to 50x faster than the host on these simple operations. ### 14.3 Sorting with Thrust on device Even if most of your work is on the host, it can pay to offload some work to the device. E.g., sorting 1,000,000,000 ints takes 6.2 seconds (floats take 8.4 secs). That excludes time to create the array, which you'd have to do anyway, but includes time for the device to read and write the data (via unified memory; this is too much data to fit in the device). A complete program is on geoxeon in /pc/thrust/rpi/devicesort3.cu . The critical code is this:  int *p; cudaHostAlloc(&p, N*sizeof(T), cudaHostAllocMapped); // here write your data into p thrust::sort(thrust::device, p, p+N);  This requires Thrust 1.7 (released in 2013), and uses Unified Memory. Getting the program this small took a lot of time. Sort has a new argument saying to use the device. There is no need now for distinguished host and device pointers. This is only one way that CUDA and Thrust are getting easier to use. Even a simple op like comparing two arrays is a few times faster on the device (including the access time back to the host). If the data is already on the device, it's many times faster. One problem with using these new features is that they are sometimes buggy. Last year I found an error in thrust::tabulate. A few hours after I reported it, the error was acknowledged and fixed in the development version. That's service! Earlier, I found an error in boost that was acknowledged and fixed. SW seems to break when I use it. Thrust intro documentation is here: http://docs.nvidia.com/cuda/thrust/. The git page, with complete documentation, intro documentation, and many examples is here: http://thrust.github.io/. Note that rewriting the examples to use unified memory and version 1.7 would make them shorter and clearer. Nevertheless, the examples teach you how to easily do things that you mightn't have thought easy at all. 2015 Week 7c ### 14.4 Faster graphical access to geoxeon X over ssh is very slow. Here are some things I've discovered that help. 1. Use a faster, albeit less secure, cipher: ssh -c arcfour,blowfish-cbc -C geoxeon.ecse.rpi.edu 2. Use xpra; here's an example: 1. On geoxeon: xpra start :77; DISPLAY=:77 xeyes&  Don't everyone use 77, pick your own numbers in the range 20-99. 2. On server, i.e., your machine: xpra attach ssh:geoxeon.ecse.rpi.edu:77 3. I suspect that security is weak. When you start an xpra session, I suspect that anyone on geoxeon can display to it. I suspect that anyone with ssh access to geoxeon can try to attach to it, and the that 1st person wins. 3. Use nx, which needs a server, e.g., FreeNX. 2015 Week 8 ## 15. Mon 3/16/2015 (L12) ### 15.1 Videos These are online here: http://www.gputechconf.com/gtcnew/on-demand-gtc.php. 1. CUDA Application Development Life Cycle Using NVIDIA(R) Nsight(TM), Eclipse Edition Satish Salian (NVIDIA) NVIDIA(R) Nsight(TM), Eclipse Edition for Linux and Mac is an all-in-one development environment that lets you develop, debug and optimize CUDA code in an integrated UI environment. This session demonstrates all of Nsight's capabilities, including the CUDA aware source editor, build integration of the CUDA tool chain, graphical debugger for both CPU and GPU, and graphical profiler. New features in the upcoming release will be revealed. ### 15.2 Thrust notes 1. Thrust is fast because the functions that look like they would need linear time really take only log time in parallel. 2. In functions like reduce and transform, you often see an argument like thrust::multiplies<float>(). The syntax is as follows: 1. thrust::multiplies<float> is a class. 2. It overloads operator(). 3. However, in the call to reduce, thrust::multiplies<float>() is calling the default constructor to construct a variable of class thrust::multiplies<float>, and passing it to reduce. 4. reduce will treat its argument as a function name and call it with an argument, triggering operator(). 5. You may also create your own variable of that class, e.g., thrust::multiplies<float> foo. Then you just say foo in the argument list, not foo(). 6. The optimizing compiler will replace the operator() function call with the defining expression and then continue optimizing. So, there is no overhead, unlike if you passed in a pointer to a function. 3. Sometimes, e.g., in saxpy.cu, you see saxpy_functor(A). 1. The class saxpy_functor has a constructor taking one argument. 2. saxpy_functor(A) constructs and returns a variable of class saxpy_functor and stores A in the variable. 3. The class also overloads operator(). 4. (Let's call the new variable foo). foo() calls operator() for foo; its execution uses the stored A. 5. Effectively, we did a closure of saxpy_functor; this is, we bound a property and returned a new, more restricted, variable or class. 4. The Thrust examples teach several non-intuitive paradigms. As I figure them out, I'll describe a few. My descriptions are modified and expanded versions of the comments in the programs. This is not a list of all the useful programs, but only of some where I am adding to their comments. 1. arbitrary_transformation.cu and dot_products_with_zip.cu. show the very useful zip_iterator. Using it is a 2-step process. 1. Combine the separate iterators into a tuple. 2. Construct a zip iterator from the tuple. Note that operator() is now a template. 2. boundingbox.cu finds the bounding box around a set of 2D points. 1. The main idea is to to a reduce. However, the combining operation, instead of addition, is to combine two bounding boxes to find the box around them. 3. bucket_sort2d.cu overlays a grid on a set of 2D points and finds the points in each grid cell (bucket). 1. The tuple is an efficient class for a short vector of fixed length. 2. Note how random numbers are generated. You combine an engine that produces random output with a distribution. However you might need more complicated coding to make the numbers good when executing in parallel. See monte_carlo_disjoint_sequences.cu. 3. The problem is that the number of points in each cell is unpredictable. 4. The cell containing each point is computed and that and the points are sorted to bring together the points in each cell. 5. Then lower_bound and upper_bound are used to find each bucket in that sorted vector of points. 6. See the lower_bound description in http://thrust.github.io/doc/group__vectorized__binary__search.html . 4. expand.cu takes a vector like V= [0, 10, 20, 30, 40] and a vector of repetition counts, like C= [2, 1, 0, 3, 1]. Expand repeats each element of V the appropriate number of times, giving [0, 0, 10, 30, 30, 30, 40]. The process is as follows. 1. Since the output vector will be longer than the input, the main program computes the output size, byt reduce summing C, and constructs a vector to hold the output. 2. Exclusive_scan C to obtain output offsets for each input element: C2 = [0, 2, 3, 3, 6]. 3. Scatter_if the nonzero counts into their corresponding output positions. A counting iterator, [0, 1, 2, 3, 4] is mapped with C2, using C as the stencil, giving C3 = [0, 0, 1, 3, 0, 0, 4]. 4. An inclusive_scan with max fills in the holes in C3, to give C4 = [0, 0, 1, 3, 3, 3, 4]. 5. Gather uses C4 to gather elements of V: [0, 0, 10, 30, 30, 30, 40]. 5. mode.cu shows: 1. Counting the number of unique keys in a vector. 1. Sort the vector. 2. Do an inner_product. However, instead of the operators being times and plus, they are not equal to the next element and plus. 2. Counting their multiplicity. 1. Construct vectors, sized at the number of unique keys, to hold the unique keys and counts. 2. Do a reduce_by_keys on a constant_iterator using the sorted vector as the keys. For each range of identical keys, it sums the constant_iterator. That is, it counts the number of identical keys. 3. Write a vector of unique keys and a vector of the counts. 3. Finding the most used key (the mode). 1. Do max_element on the counts vector. 6. repeated_range.cu repeats each element of an N-vector K times: repeated_range([0, 1, 2, 3], 2) -> [0, 0, 1, 1, 2, 2, 3, 3]. It's a lite version of expand.cu, but uses a different technique. 1. Here, N=4 and K=2. 2. The idea is to construct a new iterator, repeated_range, that, when read and incremented, will return the proper output elements. 3. The construction stores the relevant info in structure components of the variable. 4. Treating its value like a subscript in the range [0,N*K), it divides that value by K and returns that element of its input. See also strided_range.cu and tiled_range.cu. 7. set_operations.cu. This shows methods of handling an operation whose output is of unpredictable size. The question is, is space or time more important? 1. If the maximum possible output size is reasonable, then construct an output vector of that size, use it, and then erase it down to its actual size. 2. Or, run the operation twice. The 1st time, write to a discard_iterator, and remember only the size of the written data. Then, construct an output vector of exactly the right size, and run the operation again. I use this technique a lot with ragged arrays in sequential programs. 8. sparse_vector.cu represents and sums sparse vectors. 1. A sparse vector has mostly 0s. 2. The representation is a vector of element indices and another vector of values. 3. Adding two sparse vectors goes as follows. 1. Allocate temporary index and element vectors of the max possible size (the sum of the sizes of the two inputs). 2. Catenate the input vectors. 3. Sort by index. 4. Find the number of unique indices by applying inner_product with addition and not-equal-to-next-element to the indices, then adding one. E.g., applied to these indices: [0, 3, 3, 4, 5, 5, 5, 8], it gives 5. 5. Allocate exactly enough space for the output. 6. Apply reduce_by_key to the indices and elements to add elements with the same keys. The size of the output is the number of unique keys. 9. More Thrust tutorials; the different wording might be more attractive. FYI. 1. Boston U 2. Thrust - A Productivity-Oriented Library for CUDA. This was listed on http://code.google.com/p/thrust/downloads/list . 3. Thrust by example, part 2. Some old stuff, well reviewed, and some new stuff. 10. Today I'll cover more thrust examples. Online site: http://code.google.com/p/thrust/source/browse/#hg%2Fexamples%253Fstate%253Dclosed Geoxeon copy: /pc/thrust/thrust_examples . ## 16. Thurs 3/19/15 (L11) ### 16.1 Thrust We'll look at some examples, like expand and bucket_sort2d. ### 16.2 Videos Let's watch the Nvidia GTC keynote speech of Jeff Dean, Google Senior Fellow. Agreed, at 26 hours old (given Wednesday, March 18, 11:00 AM PT), it is obsolete in internet terms, but it's still interesting. ### 16.3 Homeworks If people want, we can extend HW4. Homework 5 is online in Homeworks, due April 2. ### 16.4 Term project It's time to start thinking of your term projects. They can be anything demonstrating a knowledge of topics covered in this course -- either an implementation or a research paper. Combining your own research with a term project is fine. 2015 Week 9 ## 17. Mon 3/30/15 (L12) ### 17.1 Thrust #### How to sort many sets of numbers 1. What's the best way to sort 16000 sets of 1000 numbers each? E.g., sort the rows of a 16000x1000 array? On geoxeon, /pc/thrust/rpi/comparesorts.cu, which I copied from stackoverflow, compares three methods. 2. Call the thrust sort 16000 times, once per set. That took 10 secs. 3. Sort the whole list of 16,000,000 numbers together. Then sort it again by key, with the keys being the set number, to bring the elements of each set together. Since the sort is stable, this maintains the order within each set. (This is also how radix sort works.) That took 0.04 secs. 4. Call a thrust function (to sort a set) within another thrust function (that applies to each set). This is new in Thrust 1.8. That took 0.3 secs. This is a surprising and useful paradigm. It works because 1. There's an overhead to starting each thrust function, and 2. Radix sort, which thrust uses for ints, takes linear time. #### Mapped or managed memory, and trapping errors The new version of /pc/thrust/rpi/sortreduce.cu shows: 1. How to access mapped or managed memory from thrust functions. My current opinion is: don't. Explicitly copying host to device is often faster, and is easier to code. 2. Trapping Cuda errors. (code is in the new mycuda.h). Note that a thrust function can return asynchronously before finishing, so that errors won't be reported until later. Note that sortreduce calls some unnecessary functions, such as generate, just to measure their time. #### More sample programs 2015 Week 9a ### 17.2 Error about missing CUDA 5 files etc. 1. A missing library may be caused by running a program compiled last year. We're now on CUDA 7, and the old libraries were removed. Tell me which the problem program is and I'll recompile it. Or, you can copy it to your dir and do that. 2. Alternatively, your Makefile may be explicitly trying to use 5, or perhaps PATH or LD_LIBRARY_PATH do. Change it. 3. Some demo programs may be trying to include a particular obsolete version of a file. Changing the version number in the source usually works. 4. The Makefile for the demo programs may be trying to generate code for the very obsolete sm_10 version. Changing to sm_35 works. ### 17.3 GPU Techology Conference talk 1. http://www.gputechconf.com/gtcnew/on-demand-gtc.php Accelerated JavaScript: How to Access the GPU Without Leaving the Comfort of JavaScript Norman Rubin (NVIDIA) JavaScript is often considered a low performance scripting language but lately more and more developers are recognizing that the portability of JavaScript makes it an ideal target for web applications. This talk will describe a research prototype that allows JavaScript to access accelerators using natural JavaScript idioms, no GPU knowledge required. Even on lower performance GPU's, the implementation delivers substantial performance for many applications. The talk will include a demo. 2015 Week 9b ### 17.4 cuFFT Notes 1. GPU Computing with CUDA, Lecture 8 - CUDA Libraries - CUFFT, PyCUDA from Christopher Cooper, BU 2. CUDACast #8 - CUDA 5.5 cuFFT FFTW API Support. 3. cuFFT is inspired by FFTW (the fastest Fourier transform in the west), which they say is so fast that it's as fast as commercial FFT packages. 4. I.e., sometimes commercial packages may be worth the money. 5. Although the FFT is taught for N a power of two, users often want to process other dataset sizes. 6. The problem is that the optimal recursion method, and the relevant coefficients, depends on the prime factors of N. 7. FFTW and cuFFT determine the good solution procedure for the particular N. 8. Since this computation takes time, they store the method in a plan. 9. You can then apply the plan to many datasets. 10. If you're going to be processing very many datasets, you can tell FFTW or cuFFT to perform sample timing experiments on your system, to help in devising the best plan. 11. That's a nice strategy that some other numerical SW uses. 12. One example is Automatically Tuned Linear Algebra Software (ATLAS) . ### 17.5 GTC 2015 talk Multiphysics Simulation Using GPUs Arman Pazouki (University of Wisconsin-Madison) We present a GPU-based framework for the fully-resolved simulation of interacting rigid and deformable solid objects that move in fluid flow. The fluid dynamics is based on a meshless approach. Moving Lagrangian markers, distributed in the fluid domain as well as on the solid surfaces, are used to capture the fluid dynamics, fluid-solid, and solid-solid interactions. Mass and momentum exchange between neighbor markers are determined in a parallel spatial subdivision algorithm. The solid objects' distributed forces are reduced in parallel via thrust reduction algorithms and used later for temporal update via lightweight GPU kernels. Scenarios containing tens of thousands of floating rigid and flexible objects were exercised on several GPU architectures and the linear scalability was shown. Stereovision and the Future of Autonomous Machines Edwin Azzam (STEREOLABS) Discover how stereovision and 3D depth sensing on mobile GPUs enable the development of future autonomous cars, drones and robots. We will discuss the benefits and challenges of using stereo cameras as depth sensing sensors, and how to leverage the power of embedded GPU to overcome these challenges. We will also show demonstrations of how the technology can be used to create 3D surrounding reconstruction, detect obstacles and navigate autonomously. CUDA 7 and Beyond Mark Harris (NVIDIA) CUDA is NVIDIA's parallel computing platform and programming model. In this talk you'll learn how new support for C++11 in CUDA 7, along with new features and performance improvements in the Thrust C++ parallel algorithms library, and support for runtime compilation, makes parallel C++ more productive than ever. CUDA 7 also includes cuSOLVER, a new direct linear solver library, as well as new features and improved performance in other CUDA libraries. In this talk you'll hear about these features and get insight into the philosophy driving the development of CUDA and how it will take advantage of current and future GPUs. You will learn about NVIDIA's vision for CUDA and the challenges for the future of parallel software development. 2015 Week 9c ### 17.6 cuBLAS etc Notes 1. BLAS is an API for a set of simple matrix and vector functions, such as multiplying a vector by a matrix. 2. These functions' efficiency is important since they are the basis for widely used numerical applications. 3. Indeed you usually don't call BLAS functions directly, but use higher-level packages like LAPACK that use BLAS. 4. There are many implementations, free and commercial, of BLAS. 5. cuBLAS is one. 6. One reason that Fortran is still used is that, in the past, it was easier to write efficient Fortran programs than C or C++ programs for these applications. 7. There are other, very efficient, C++ numerical packages. (I can list some, if there's interest). 8. Their efficiency often comes from aggressively using C++ templates. 9. matrix-multiplication-cuda-cublas-curand-thrust/ ### 17.7 Term project dates 1. Mon 4/7 Proposal 1. Title 2. Team members - name, email 3. What you propose to do: implement (what, on what computer?), survey, ... 4. Why this is interesting. 2. Mon 4/20 2-minute progress report (1 per team) to class. 3. Thurs 5/7, Mon 5/11: 5-10 minute class presentation per team. Email Hanchao with your preferred date and length (5 or 10 minutes). 4. Mon 5/11: project due. 5. by mutual agreement: optional demos to Hanchao. 2015 Week 10 ## 18. Mon 4/6/15 (L14) Lecture notes that I wrote on paper are online here: notes1.pdf. 2015 Week 10a ### 18.1 Supercomputing videos I'll show that start of various videos in class. If you're interested, you can watch the rest on your own. 1. Jack Dongarra: Algorithmic and Software Challenges For Numerical Libraries at Exascale This was presented at the Exascale Computing in Astrophysics Conference 2013, whose Youtube channel is here . 2. High Performance Computing Innovation Center at LLNL from the 2014 HPCAC Stanford HPC & Exascale Conference. 3. Petros Koumoutsakos: Fast Machines and Cool Algorithms for (Exascale) Flow Simulations from the 2014 HPCAC Stanford HPC & Exascale Conference. 4. Architecture-aware Algorithms for Peta and Exascale Computing - Dongarra at SC13. 5. Lorena Barba Presents: Flying Snakes on GPUs at SC13. 6. Joel Primack: Supercomputing the Universe, Conference Exascale Computing in Astrophysics. 7. http://insidehpc.com/2015-stanford-hpc-conference-video-gallery/ Scientific computing with Intel Xeon Phi coprocessors 2015 Week 10b ### 18.2 Matlab 1. Good for applications that look like matrices. Considerable contortions required for, e.g., a general graph. You'd represent that with a large sparse adjacency matrix. 2. Using explicit for loops is slow. 3. Efficient execution when using builtin matrix functions, but can be difficult to write your algorithm that way, and difficult to read the code. 4. Very expensive and getting more so. Many separately priced apps. 5. Uses state-of-the-art numerical algorithms. E.g., to solve large sparse overdetermined linear systems. Better than Mathematica. 6. Most or all such algorithms also freely available as C++ libraries. However, which library to use? Complicated calling sequences. Obscure C++ template error messages. 7. Graphical output is mediocre. Mathematica is better. 8. Various ways Matlab can execute in parallel 1. Operations on arrays can execute in parallel. E.g. B=SIN(A) where A is a matrix. 2. Automatic multithreading by some functions Various functions, like INV(a), automatically use perhaps 8 cores. The '8' is a license limitation. Which MATLAB functions benefit from multithreaded computation? 3. PARFOR Like FOR, but multithreaded. However, FOR is slow. Many restrictions, e.g., cannot be nested. http://www.mathworks.com/help/distcomp/introduction-to-parallel-solutions.html Start pools first with: MATLABPOOL OPEN 12 Limited to 12 threads. Can do reductions. 4. Parallel Computing Server This runs on a parallel machine, including Amazon EC2. Your client sends batch or interactive jobs to it. Many Matlab toolboxes are not licensed to use it. This makes it much less useful. 5. GPU computing Create an array on device with gpuArray Run builtin functions on it. http://www.mathworks.com/help/distcomp/run-built-in-functions-on-a-gpu.html 2015 Week 10c ### 18.3 Mathematica in parallel. You terminate an input command with shift-enter. Some Mathematica commands;  Sin[1.] Plot[Sin[x],{x,-2,2}] a=Import[ "/opt/parallel/mathematica/mtn1.dat"] Information[a] Length[a] b=ArrayReshape[a,{400,400}] MatrixPlot[b] ReliefPlot[b] ReliefPlot[b,Method->"AspectBasedShading"] ReliefPlot[MedianFilter[b,1]] Dimensions[b] Eigenvalues[b] When you get bored waiting, type alt-. Eigenvalues[b+0.0] Table[ {x^i y^j,x^j y^i},{i,2},{j,2}] Flatten[Table[ {x^i y^j,x^j y^i},{i,2},{j,2}],1] StreamPlot[{x*y,x+y},{x,-3,3},{y,-3,3}]$ProcessorCount
$ProcessorType Select Parallel Kernel Configuration'' and Status in the Evaluation menu'' ParallelEvaluate[$ProcessID]
PrimeQ[101]
Parallelize[Table[PrimeQ[n!+1],{n,400,500}]]
merQ[n_]:=PrimeQ[2^n-1]
Select[Range[5000],merQ]
ParallelSum[Sin[x+0.],{x,0,100000000}]
Parallelize[  Select[Range[5000],merQ]]
CUDAInformation[]
Manipulate[n, {n, 1.1, 20.}]
Plot[Sin[x], {x, 1., 20.}]
Manipulate[Plot[Sin[x], {x, 1., n}], {n, 1.1, 20.}]
Integrate[Sin[x]^3, x]
Manipulate[Integrate[Sin[x]^n, x], {n, 0, 20}]
Manipulate[{n, FactorInteger[n]}, {n, 1, 100, 1}]
Manipulate[Plot[Sin[a x] + Sin[b x], {x, 0, 10}],
{a, 1, 4}, {b, 1, 4}]


Unfortunately there's a problem that I'm still debugging with the Mathematica - CUDA interface.

2015 Week 10d

## 19.  Thurs 4/9/15 (L15)

Homework 6 is online at Homeworks. It's easy. Present next Thurs.

1. http://insidehpc.com/2015-stanford-hpc-conference-video-gallery/
1. HPC in an Hour: Tips and tricks to build your own HPC Cluster (Steve Jones, Stanford)
2. Accelerating Big Data Processing with Hadoop, Spark and Memcached (Dhabaleswar K. Panda, Ohio State University)
2. GPU Computing with CUDA Lecture 7 - CUDA Libraries - Cusp by Christopher Cooper Boston University

2015 Week 11

## 20.  Mon 4/13/15 (L16)

The material is from Wikipedia, which appeared better than any other sources that I could find.

1. Hierarchy:
1. IaaS (Infrastructure as a Service)
1. Sample functionality: VM, storage
2. Examples:
2. Amazon_Web_Services
3. OpenStack : compute, storage, networking, dashboard
2. PaaS (Platform ...)
1. Sample functionality: OS, Web server, database server
2. Examples:
1. OpenShift
2. Cloud_Foundry
1. distributed FS, Map Reduce
2. derived from Google FS, map reduce
3. SaaS (Software ...)
1. Sample functionality: email, gaming, CRM, ERP
2. Cloud_computing_comparison
3. Virtual machine
4. Distributed storage
1. VNC
2. Grid_computing
1. decentralized, heterogeneous
2. used for major projects like protein folding

### 20.2  GTC 2015 talks

Three Ways to Debug Parallel CUDA Applications: Interactive, Batch, and Corefile (26 min)

Far Cry 4 and Assassin's Creed Unity: Spicing Up PC Graphics with NVIDIA GameWorks (25 min)

Thrust++ : Portable, Abstract Library for Medical Imaging Applications (22 min)

## 21.  Thurs 4/16/15 (L17)

### 21.1  Papers selected by students

1. Hui Lin and Li Dong. Survey on computational finance with GPU. https://people.maths.ox.ac.uk/gilesm/talks/pp10_key.pdf
2. Peter Katlic.
GTC 2015 – ID S5612
Title: Fighting Malware With GPUs in Real Time
Presenter: Peter Kovac (Avast Software)
Abstract:
Dive deep into the problematic of protecting electronic devices such as PC's, smartphones or tablets against malicious software. In this talk we will show you how we handle the ever increasing number of malware samples produced by the malware ecosystem every day. To leverage similarities between the samples for automated classification we built a distributed database engine relying on GPUs. With query times of fraction of a second even using a compound distance function, this system is able to classify incoming samples in real time. Samples classified as malware are directly used to generate rules to identify similar samples in machines of our customers.

2015 Week 12

## 22.  Mon 4/20/15 (L18)

### 22.1  Revised notes on the potential energy homework

1. My files are in potential-energy-hw/, with a copy on geoxeon in /pc/potential-energy-hw/ .
2. Double precision is necessary for the total. The problem is that there are a few large components (when 2 points are close) and many small ones. Collectively, the small ones add up to a lot. However, with single precision, once one big component is added in, adding individual small components gets mostly lost.
If double precision wasn't available, you might sort the components by absolute value, and then add them in order from small to large. However sorting takes time. It might be faster, and sufficient, to approximately sort them.
3. I have several implementations showing different techniques:
After the lecture, I renamed the programs more logically: omp->omp-double, omp2->omp-single, thr->thr-double, thr2->thr-single.
1. omp-double.cc uses OpenMP with two nested for loops.
However the iterations of the outer loop different amount of work, so all the processors do not stay fully used.
Using schedule dynamic helps here.
Also, using more OpenMP threads than can execute simultaneously has the same effect. It means that there are always some threads queued ready to run. Do this by setting in the shell before running the program, say,
However in most cases, don't worry about wasting perhaps half of the processors.
2. omp-single.cc shows a different OpenMP solution to avoid some iterations of the outer loop taking longer. Here, I use only one loop, which iterates over all the pairs of points. However, it is not much faster. The problem is that converting from the index of a pair to the numbers of the two points takes too long. E.g., it calls sqrt. A web search found ways to make this a little faster. However, why?
Note the long long int required for the number of pairs.
3. thr-single.cu uses Thrust. Again, I use one loop and convert from a pair index to the numbers of the two points.
The problem here is that Thrust apparently doesn't handle ''long long int'', so I can't have more than 2G pairs, or 60,000 points.
That might be related to the atomic operations implemented for 4-byte variables but not on 8-byte ones.
I reported the issue on stackoverflow and github.
4. thr-double.cu uses Thrust with nested transform-reduces. I copied the idea from the comparesorts.cu example that I got from stackoverflow, which is I mentioned earlier.
Note how to get global variables into a device function. The idea is the usual defining a new class with the usual overloaded operator(). Then a variable of that class is constructed to store the global variable locally. It's weird and I don't fully understand why it works.
I often found it easier to use macros rather than overloading operator() in new classes. Macros do have their uses.
thr-double gives a different answer by 0.1% than the others. I don't know why, but probably there's a roundoff problem somewhere.
4. I converted a string containing n to an int by reading from the string. There are other ways to do this.
5. Here are elapsed times on geoxeon, which can run 32 CPU threads or 2688 CUDA cores on the K20Xm GPU. The times varied somewhat; these are the minima from several runs.
n    omp-    omp-    thr-    thr-
double    single    single    double
------- ------- ------- -------
60,000 2.36 2.17 0.45 0.55
200,000 26. - - 2.24
1,000,000 655. - - 46.
2,000,000 - - - 174.

Conclusions (for this problem):

1. Using one loop is a little better than two if n is small enough.
2. Otherwise, Thrust in CUDA is over 10x faster than OpenMP.
It processes 10,000,000,000 point pairs per second. That's isn't bad.

### 22.2  Continue student presentations of interesting papers

1. The Future of human vision presented by Zhongqing Han.
2. Augmented reality presented by Kaitlyn Taboada.
3. S5398-Stephen-Jones-Adam-Lichtl.pdf presented by Clayton Rayment and Nick White.
4. S5715-Keynote-Jen-Hsun-Huang.pdf presented by Mohit Tyagi.
5. True CMYK and N-Channel Rendering on the GPU by Nathan Carr, Principal Research Scientist, Adobe Systems. The GPU Tech Conference ID is S5285, presented by Christopher Ngai.
6. Gfiles:hw6-presentations/)S4359-rt-em-wave-propagation-optix-sims-car-to-car-communication.pdf presented by Huiye Liu.

2015 Week 12a

## 23.  Thurs 4/23/15 (L19)

### 23.1  Arrayfire

1. This is a set of C++ classes and functions, which you can combine with your other C++ code.
2. The backend is CPU, CUDA, or OpenCL.
3. It claims to be very fast for matrix ops, including FFT etc.
4. This presumably calls cuFFT and cuBLAS, and so gives a unified interface to them.
5. Use this as a free access to fast parallel matrix ops, when your data format is naturally a matrix.
6. The web site is http://www.arrayfire.com/ . It has a lot of documentation.
7. There is useful info on stackoverflow.
8. I've installed it, with the CPU and CUDA backends, on geoxeon in standard places.
9. To compile a program, put this line into ~/.bashrc
export LD_LIBRARY_PATH=/local/lib:/local/cuda-7.0/lib64
start a new shell and then do one of these:
1. g++ foo.cc -lafcpu or
2. g++ foo.cc -lafcuda
(That info was not in the documentation.)
10. You might sometimes need to add -lpthread.
11. I copied the helloworld, blas, and fft programs to /pc/arrayfire/ .
12. The package is in /opt/arrayfire/ . It has many examples. The sources are in /opt/arrayfire/examples/ and the executables are in /opt/arrayfire/build/examples/
14. They prefer that you use cmake to compile your programs, but you don't have to for simple programs.
15. As with most C++ programs that aggressively use templates, small errors can cause big lists of obscure errors. E.g., I added a wrong include directory when compiling helloworld and got 2000 lines of errors. I omitted -lafcpu, i.e., omitted a link time option, and got errors in the compilation stage.
16. Being new, there are some bugs. However, it does look useful.

### 23.2  Term project progress reports

Per the term project dates announced earlier, each team should give a 2-minute summary of progress to date. We'll do those today (if time) and Thurs.

### 23.3  Another interesting recent conference

2015 SIAM Computational Science and Engineering, March 2015.

This is the list of SIAM conferences with available material.

2015 Week 13

## 24.  Mon 4/27/15 (L20)

1. Final progress reports
2. NVidia GTC talks from, e.g., https://registration.gputechconf.com/form/session-listing
3. S5566 - GPU Errors on HPC Systems: Characterization, Quantification and Implications for Architects and Operations http://on-demand.gputechconf.com/gtc/2015/presentation/S5566-James-Rogers.pdf
4. S5530 - Featured Talk: Memory Management Tips, Tricks and Techniques http://on-demand.gputechconf.com/gtc/2015/presentation/S5530-Stephen-Jones.pdf
5. S5820B - Featured Talk: CUDA 7 and Beyond http://on-demand.gputechconf.com/gtc/2015/video/S5820B.html

## 25.  Thurs 4/30/15 - no class

I was at Harvard at the 2015 CGA Conference: The Lab for Computer Graphics and Spatial Analysis (1965-1991) and Its Legacy. "1965 was a seminal year in the history of GIS, as the Harvard Laboratory for Computer Graphics (subsequently Computer Graphics and Spatial Analysis, LCGSA ) embarked on a 20+ year journey of research and development in theoretical geography, computer cartography, spatial analysis, and environmental design, which gave us many of today's essential ideas and early versions of tools now embedded in GIS, remote sensing, geospatial science, geodesign, and online culture."

Harvard closed the lab in the early 1980s. Can't have mere tools (i.e., computers) compromising pure academic research.

2015 Week 14

## 26.  Mon 5/4/15 (L21)

### 26.1  Term project

#### Talks

Contact Hanchao Liu to reserve a spot for your talk on Thurs or Mon. Your talk's length may be from 8 to 15 minutes (per team). Monday is reserved for 8-minute talks. Thurs talks may be any length.

#### Working?

Your project does not need to have accomplished what you originally proposed. However, then you have to convince us, in detail, that you worked hard and learned a lot. It would be good to get something smaller working.

### 26.2  New material

1. Jack Dongarra: Fault Tolerance in Numerical Library Routines
2. Jack Dongarra: HPCG One Year Later
3. http://www.vrworld.com/2015/03/23/jack-dongarra-on-the-great-exascale-challenge-and-rising-hpc-powers/
4. NEC's vector processor HC26-S1: High Performance Computing

## 27.  Wed 5/6/15

Consider attending Chris Carothers's talk on Center Computational Innovations at 4pm in DCC 324. This is part of ECSE's Mercer lecture series.

## 28.  Thurs 5/7/15 (L22)

### 28.2  Inspiration for finishing your term projects

1. The Underhanded C Contest
"The goal of the contest is to write code that is as readable, clear, innocent and straightforward as possible, and yet it must fail to perform at its apparent function. To be more specific, it should do something subtly evil. Every year, we will propose a challenge to coders to solve a simple data processing problem, but with covert malicious behavior. Examples include miscounting votes, shaving money from financial transactions, or leaking information to an eavesdropper. The main goal, however, is to write source code that easily passes visual inspection by other programmers."
2. The International Obfuscated C Code Contest
3. https://www.awesomestories.com/asset/view/Space-Race-American-Rocket-Failures
Moral: After early disasters, sometimes you can eventually get things to work.
4. THE 'WRONG' BROTHERS AVIATION'S FAILURES (1920s)
5. Early U.S. rocket and space launch failures and explosion
6. Numerous US Launch Failures

### 28.3  Memory

Since memory limits the performance of many current programs, here are some interesting refs.

1. http://stackoverflow.com/questions/16699247/what-is-cache-friendly-code?lq=1
4. http://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Algebra_Software
5. http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array
6. http://en.wikipedia.org/wiki/Branch_predictor
7. Bit Twiddling Hacks by Sean Eron Anderson
8. http://stackoverflow.com/questions/21497976/why-filtering-an-unsorted-list-is-faster-than-filtering-a-sorted-list?rq=1
9. http://stackoverflow.com/questions/17451765/why-maintaining-sorted-array-is-faster-than-vector-in-c?rq=1
10. What every programmer should know about memory, Part 1, part 2, etc. This is also available in PDF. It is a little dated, e.g., the part about Northbridge and Southbridge, but still quite good.
11. Optimizing software in C++ An optimization guide for Windows, Linux and Mac platforms. Current as of last year.
12. DOOM 3 BFG Technical Note, by J.M.P. van Waveren. This technical note describes some of the performance issues encountered during the development of DOOM 3.
13. http://www.research.scea.com/research/pdfs/GDC2003_Memory_Optimization_18Mar03.pdf
14. Testing cache time: I wrote a little program, on geoxeon in /pc/cache/array.cc, duplicated online in cache/array.cc. It compares reading and writing a 10000x10000 array in order as the elements are stored compared to the other order.
The former is at least 1/3 faster than the latter. Compiling with -O3 makes both tests run about 3x faster.
Note: I've duplicated most (but not quite all) code between geoxeon, in /pc/` and the web, in Gfiles:.
15. Nevertheless: Don't go overboard with this, for several reasons. E.g., Intel is very good at reducing many of these problems by fetching ahead, running another thread during the wait time, etc. In one test some time ago, I saw no penalty from using an extra level of indirection to retrieve data.

2015 Week 15

## 29.  Mon 5/11/15 (L23)

### 29.1  Term project presentations

1. Jake Lowey
2. Dan Foster
3. Mohit Tyagi
4. Rongcui Dong
5. Huiye Liu
6. Erik Schembor
7. Nick White/Clayton Rayment
8. Zhongqing Han
9. Peter Katlic
10. Ben Mizera
11. Matthew Holmes
13. Sam Brown
14. Chris Ngai

### 29.2  Course recap

1. My teaching style is to work from particulars to the general.
2. You've seen OpenMP, a tool for shared memory parallelism.
3. You've seen the architecture of NVidia's GPU, a widely used parallel system, and CUDA, a tool for programming it.
4. You've seen Thrust, a tool on top of CUDA, built in the C++ STL style.
5. You've seen how widely used numerical tools like BLAS and FFT have versions built on CUDA.
6. You've had a chance to program in all of them on geoxeon, my research lab machine with dual 8-core Xeons and K20Xm and K5000 NVidia boards.
7. You seen talks by leaders in high performance computing, such as Jack Dongarra.
8. You've seen quick references to parallel programming using Arrayfire, Matlab, Mathematica, and the cloud.
9. Now, you can inductively reason towards general design rules for shared and non-shared parallel computers, and for the SW tools to exploit them.

### 29.3  Term projects due.

Submit to LMS or email to Hanchao. If the file is big, then upload to your favorite server, or put on geoxeon, and send a link.

### 29.4  Remember your course survey

If you liked the course, then please officially tell RPI by completing the survey. Thanks.

### 29.5  After the semester

1. Before or after you leave RPI, when I'm not too busy, you're welcome to talk to me about most any legal topic.
2. You're welcome to keep using your geoxeon accounts, including for your research for awhile. Tell me if you want to keep them longer. Understand that this an unsupported research machine. If it breaks, it might not get fixed immediately. Some capabilities might never get replaced. If the disk crashes, you lose your files. If one of your accounts is used to attack geoxeon, the rules will change. Etc, etc.
If geoxeon is of massive help to your research, I and my NSF grant would appreciate acknowledgements in your papers.