Write Less Code: CUDA = Awesome

I'm back finally with an update on the CUDA project...read on for more details, but long story short: my project ran about 33 times faster on my GPU.

When you download the CUDA SDK, it comes with about 50 sample projects to help you get started. One of these is a template that does nothing but an array multiplication - this includes all the boilerplate code necessary to allocate memory on the GPU, copy data to it, and run the CUDA kernel. It has one file for C++ code with a reference CPU implementation, another file for C code with the GPU implementation, and a .cu CUDA file with the boilerplate. There's also a Makefile included that calls nvcc (the NVIDIA CUDA compiler) and links in the libraries appropriately.

My little program wasn't very complex; it was a naive implementation of the Stochastic Simulation Algorithm (SSA). Basically, this just takes a set of chemical reactions, an initial set of chemical quantities, and randomly chooses reactions to run until some condition is reached. The algorithm is run hundreds or thousands of times, and you can observe the probabilities of various conditions. Since it's run so many times, and each run is independent of the other runs, it is a fantastic candidate for a parallel implementation.

The first thing I did was create a reference implementation of the SSA to run on the CPU. The algorithm is pretty simple; implementing it took less than 100 lines of C. The cool thing about CUDA is that the kernel itself is just regular C code. I wanted to make sure I was running the exact same code on the CPU and GPU, so I used a little preprocessor magic. I put all the actual code in a separate file, and #include-d that file in both the CPU and GPU implementations, leaving only the processor-specific code separate.

There was a catch to implementing this on the GPU - there's no rand() library function, and you can't call into regular C libraries like SPRNG. This is obviously a fundamental limitation in writing special code for the GPU - code written for a CPU cannot be reused. Fortunately, one of the CUDA examples is a Mersenne Twister, which runs 4096 random number generators in parallel. I copied and pasted that into my implementation and it worked perfectly!

I benchmarked the CUDA implementation in two ways - by the number of threads, and by the number of reactions to run in the SSA. I've added a couple graphs of performance based on these two metrics below. You can see performance topped out at 4096 threads (handy - exactly the number of Mersenne Twisters I could actually use) and around 100,000 chemical reactions. If you've used CUDA, or read anything about it, that's probably not surprising: you need to do a lot of work to really get the performance gains.

Finally, the cool part - comparing the GPU to CPU performance. The table below shows a comparison of some simulation runs I did. Overall, it was about a 33x speedup. Put another way, the GPU ran 4096 simulations faster in less time than it took for the CPU to run 156.

	GPU	CPU	Speedup
# of Simulations	4096	156
# of Reactions	42,580,318,728	1,704,430,718
Average	10,395,586	10,925,838
Max	28,165,306	25,737,976
Time (s)	985	1259
Reactions/S	43,228,750	1,353,797	31.93
Max R/S	117,121,922	1,353,797	86.51
Simulations/S	4.16	0.12	33.56

Time for some final thoughts. Obviously the best part of CUDA is the massive speedup. 33 times faster is amazing, and I really spent no time optimizing the GPU code. What really shocked me though was how easy it really was to develop with CUDA (once I had it set up, anyway). I had never used it before, but it took me just a couple hours to adapt the CPU version of my (embarrassingly parallel) program to run on the GPU.

There were two main difficulties in developing with CUDA. First, as I mentioned before, you can't use any external libraries. There is a CUDA simulator so you can do stuff like debug printing if necessary, but everything that will actually run on the GPU has to be written from scratch. The second problem I had (which isn't really a CUDA problem per se) is that X will hang on Linux if your CUDA program ties up the GPU for more than about 5 seconds. Having to continually restart X gets annoying fast. Fortunately, if you switch over to terminal mode (Ctrl+Alt+F1) you no longer have this limitation.

Bottom line: CUDA lives up to the hype. If you're spending time waiting for your CPU to crunch numbers...you're insane. Go buy an 8800GT and rewrite your code in CUDA.

2 comments:

Curran said...: I love the idea of #include-ing the code in both places, it's something I'd wanted to try out.

I also ran into the random-number-in-CUDA issue and came up with a way to do it using a Linear Congruential Generator (with same parameters as used by ANSI C, so it's effectively rand()). Here's the code for a cellular automata implementation where each cell has a random number each iteration:

initialization:
cells[i] = (i+1)*i*i;

state iteration:
nextCells[i] = cells[i]* 1103515245 + 12345;

number extraction:
float rand = (float)((cellState/65536) % 32768)/32768;

Glad to hear CUDA is working out for you! I am also loving it.; July 25, 2008 at 6:00 AM
timglas said...: ^ How did u get "cellState"?; June 19, 2014 at 9:05 PM

Write Less Code

Friday, May 23, 2008

CUDA = Awesome

2 comments:

Blog Archive