Friday, May 23, 2008

CUDA = Awesome

I'm back finally with an update on the CUDA project...read on for more details, but long story short: my project ran about 33 times faster on my GPU.

When you download the CUDA SDK, it comes with about 50 sample projects to help you get started. One of these is a template that does nothing but an array multiplication - this includes all the boilerplate code necessary to allocate memory on the GPU, copy data to it, and run the CUDA kernel. It has one file for C++ code with a reference CPU implementation, another file for C code with the GPU implementation, and a .cu CUDA file with the boilerplate. There's also a Makefile included that calls nvcc (the NVIDIA CUDA compiler) and links in the libraries appropriately.

My little program wasn't very complex; it was a naive implementation of the Stochastic Simulation Algorithm (SSA). Basically, this just takes a set of chemical reactions, an initial set of chemical quantities, and randomly chooses reactions to run until some condition is reached. The algorithm is run hundreds or thousands of times, and you can observe the probabilities of various conditions. Since it's run so many times, and each run is independent of the other runs, it is a fantastic candidate for a parallel implementation.

The first thing I did was create a reference implementation of the SSA to run on the CPU. The algorithm is pretty simple; implementing it took less than 100 lines of C. The cool thing about CUDA is that the kernel itself is just regular C code. I wanted to make sure I was running the exact same code on the CPU and GPU, so I used a little preprocessor magic. I put all the actual code in a separate file, and #include-d that file in both the CPU and GPU implementations, leaving only the processor-specific code separate.

There was a catch to implementing this on the GPU - there's no rand() library function, and you can't call into regular C libraries like SPRNG. This is obviously a fundamental limitation in writing special code for the GPU - code written for a CPU cannot be reused. Fortunately, one of the CUDA examples is a Mersenne Twister, which runs 4096 random number generators in parallel. I copied and pasted that into my implementation and it worked perfectly!

I benchmarked the CUDA implementation in two ways - by the number of threads, and by the number of reactions to run in the SSA. I've added a couple graphs of performance based on these two metrics below. You can see performance topped out at 4096 threads (handy - exactly the number of Mersenne Twisters I could actually use) and around 100,000 chemical reactions. If you've used CUDA, or read anything about it, that's probably not surprising: you need to do a lot of work to really get the performance gains.




Finally, the cool part - comparing the GPU to CPU performance. The table below shows a comparison of some simulation runs I did. Overall, it was about a 33x speedup. Put another way, the GPU ran 4096 simulations faster in less time than it took for the CPU to run 156.











GPUCPUSpeedup
# of Simulations4096156
# of Reactions42,580,318,7281,704,430,718
Average10,395,58610,925,838
Max28,165,30625,737,976
Time (s)9851259
Reactions/S43,228,7501,353,79731.93
Max R/S117,121,9221,353,79786.51
Simulations/S4.160.1233.56


Time for some final thoughts. Obviously the best part of CUDA is the massive speedup. 33 times faster is amazing, and I really spent no time optimizing the GPU code. What really shocked me though was how easy it really was to develop with CUDA (once I had it set up, anyway). I had never used it before, but it took me just a couple hours to adapt the CPU version of my (embarrassingly parallel) program to run on the GPU.

There were two main difficulties in developing with CUDA. First, as I mentioned before, you can't use any external libraries. There is a CUDA simulator so you can do stuff like debug printing if necessary, but everything that will actually run on the GPU has to be written from scratch. The second problem I had (which isn't really a CUDA problem per se) is that X will hang on Linux if your CUDA program ties up the GPU for more than about 5 seconds. Having to continually restart X gets annoying fast. Fortunately, if you switch over to terminal mode (Ctrl+Alt+F1) you no longer have this limitation.

Bottom line: CUDA lives up to the hype. If you're spending time waiting for your CPU to crunch numbers...you're insane. Go buy an 8800GT and rewrite your code in CUDA.

Friday, May 2, 2008

Getting Started with CUDA on Linux

For the last three years, I've been going to school part-time on and off to get my Masters' degree in Computer Science. I'm in my last semester of real classes now, and thankfully it's nearly the end of the semester. However, that means end-of-semester projects...d'oh!

For one of my projects, which I'll explain in more detail later, I am planning to write some CUDA code to simulate chemical reactions. I'm currently running Ubuntu 7.10 with a NVIDIA 8800GT video card. I'm a little bit worried about getting this set up correctly - I've had some issues getting my video card working. In fact, I have to reinstall my video card drivers every time I reboot my computer. Ah, the joys of Linux.

I originally intended to write this a half-assed attempt at a live-blog while getting the CUDA examples to compile and run. However, I had so many problems, it just got too long. Here's the abridged list of problems I had, plus their solutions. I like to think of myself as moderately competent (although not a Linux genius), but this process was still pretty painful. Hopefully this list will make things easier for somebody out there:
  1. Problem: Permission errors building the SDK. Solution: Don't install the SDK as root; it just goes under your home folder, anyway.
  2. Problem: "GL/glu.h: No such file or directory." Solution: `apt-get install libglu1-mesa-dev`
  3. Problem: "gcc: installation problem, cannot exec `cc1plus.'" Solution: make sure that your gcc and g++ compilers are the same version. (WTF?!)
  4. Problem: "cannot find -lglut". Solution: `apt-get install libglut3-dev`
  5. Problem: "error while loading shared libraries: libcudart.so". Solution: Add /usr/local/cuda/lib to /etc/ld.so.conf and run ldconfig

Good luck with CUDA. If you run across different problems or different solutions, post them in the comments!

PS - Wouldn't it be nice if NVIDIA had just released this as a apt-gettable package so I wouldn't have to guess which prerequisite packages I needed?