Hand in your solution on RPILMS, unless instructions say otherwise. Each team should submit its solution under only 1 student's name. The other student's submission should just name the lead student. (This makes it easier for us to avoid grading it twice.)

If you have problems, then ask for help. The goal is to learn the material.

Contents (hide)

# 1.  Homework 1, due Mon Feb 2, 2015

1. Some students prefer to use non-RPI email addresses. Some prefer names that are different from RPI's records. If either is true, then give details.
2. Do you expect to use your own computer? If so, run these commands and report the output:
1. lspci|grep -i nvidia
2. nvidia-smi
3. g++ -v
4. nvcc -V
3. For this question only, email the info to WRF.
To allow you to access my computer, create a public ssh key if necessary. Then email it to me.
In linux, the command is ssh-keygen . The public key is called id_rsa.pub .
In your email, mention your RCS ID, which is approximately your last name, first initial, and sequence number. Mine is frankwr.
4. Read https://computing.llnl.gov/tutorials/parallel_comp/ for an intro to parallel computing. From it, or from your own research, answer the following questions.
5. Why have machine cycle speeds stopped increasing?
6. What architectures do the top 3 machines on the Top 500 list use?
7. Which one of the following 4 choices are most GPUs: SISD, SIMD, MISD, MIMD.
8. Which one of the following 4 choices are most current multicore CPUs: SISD, SIMD, MISD, MIMD.
9. Per Amdahl's law, if a program is 10% sequential and 90% parallelizable, what is the max speed up that can be obtained with an infinite number of parallel processors?

# 2.  Homework 2, due Thurs Feb 19, 2015

1. Debug these programs in openmp/llnl/ : omp_bug{1-6}.cc .
2. Devise a test program to estimate the time to execute a critical pragma.
4. Look at openmp/rpi/sum_reduc2.cc . Run it with varying numbers of threads to estimate wha fraction of the program is serial and is not being parallelized.
5. Write a program to multiply two 1000x1000 matrices. Do it the conventional way, not using anything fancy like Schonhage-Strassen. Now, see how much improvement you can get with OpenMP. Measure only the elapsed time for the multiplication, not for the matrix initialization.
6. Sometime parallelizing a program can increase its elapsed time. Try to create such an example, with 2 threads being slower than 1.

# 3.  Homework 3, due Thurs Mar 12, 2015

As always you may do this in pairs. One student should upload answers to LMS and mention the other student's name. The other student should upload only a statement naming his/her partner.

## 3.1  Paper questions

1. Research and then describe the main changes from NVidia Kepler to Maxwell.
2. Although a thread can use 255 registers, that might be bad for performance. Why?
3. If a thread needs more local variables than it has registers, were do the extras go?
4. Give a common way that the various threads in a block can share data with each other.
5. If a thread writes a word that another thread will then read, what must you do between the write and the read?
6. Reading a word from global memory might take 400 cycles. Does that mean that a thread that reads many words from global memory will always take hundreds of times longer to complete?
7. Since the threads in a warp are executed in a SIMD fashion, how can an if-then-else block be executed?
8. What is unified memory and how does it make CUDA programming easier?

## 3.2  Programming

Implement each of the following questions three times: with sequential C++ code, again using OpenMP, and again using CUDA. Report the elapsed times.

1. Generate 1,000,000 pseudo-random floats in [0,1] (I don't care how). Then compute their mean and standard deviation.
2. Generate 1,000,000 random 2D points in {$[0,1]^2$}. Then, assuming that each has a mass of 1, and is gravitationally attracted to the others, compute the potential energy of the system. The formula is this:
{$$U = - \sum_{i=1}^{N-1} \sum_{j=i+1}^N \frac{1}{r_{ij}}$$}
where {$r_{ij}$} is the distance between points {$i$} and {$j$}. (This assumes that {$G=1$}).

# 4.  Homework 4, due Thurs Mar 19Apr 2, 2015

Redo homework 3, but now using Thrust on the device.

# 5.  Homework 5, due Thurs Apr 2, 2015

1. Write a technical comparison of the new Nvidia Tesla K80 to the K20Xm in geoxeon.
2. Extend bucket_sort2d to 3d. Write out the points in each bucket.

# 6.  Homework 6, due Thurs Apr 16, 2015

Pick a presentation from a recent parallel computing conference and summarize it in class for 5-7 minutes. You may use slides from that presentation do not need to prepare new slides.

You may do this in teams of 2-3 if you wish. E.g., this may also be your project team.