{ “metadata”: { “name”: “” }, “nbformat”: 3, “nbformat_minor”: 0, “worksheets”: [ { “cells”: [ { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “# Accelerating C/C++ code with Libraries on GPUs\n”, “\n”, “In this self-paced, hands-on lab, we will use libraries to accelerate code on NVIDIA GPUs.\n”, “\n”, “Lab created by Mark Ebersole (Follow @CUDAHamster on Twitter)” ] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “The following timer counts down to a five minute warning before the lab instance shuts down. You should get a pop up at the five minute warning reminding you to save your work!” ] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “<script src="files/countdown_v5.0/countdown.js"></script>\n”, “<div id="clock" align="center"></div>\n”, “” ] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “—\n”, “Before we begin, let’s verify WebSockets are working on your system. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see get some output returned below the grey cell. If not, please consult the Self-paced Lab Troubleshooting FAQ to debug the issue.” ] }, { “cell_type”: “code”, “collapsed”: false, “input”: [ “print "The answer should be three: " + str(1+2)” ], “language”: “python”, “metadata”: {}, “outputs”: [] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “Let’s execute the cell below to display information about the GPUs running on the server.” ] }, { “cell_type”: “code”, “collapsed”: false, “input”: [ “!nvidia-smi” ], “language”: “python”, “metadata”: {}, “outputs”: [] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “—\n”, “<p class="hint_trigger">If you have never before taken an IPython Notebook based self-paced lab from NVIDIA, click this green box.\n”, “ <div class="toggle_container"><div class="input_area box-flex1"><div class=\"highlight\">The following video will explain the infrastructure we are using for this self-paced lab, as well as give some tips on it’s usage. If you’ve never taken a lab on this system before, it’s highly encourage you watch this short video first.

\n”, “<div align="center"><iframe width="640" height="390" src="http://www.youtube.com/embed/ZMrDaLSFqpY" frameborder="0" allowfullscreen></iframe></div>\n”, “
\n”, “<h2 style="text-align:center;color:red;">Attention Firefox Users</h2><div style="text-align:center; margin: 0px 25px 0px 25px;">There is a bug with Firefox related to setting focus in any text editors embedded in this lab. Even though the cursor may be blinking in the text editor, focus for the keyboard may not be there, and any keys you press may be applying to the previously selected cell. To work around this issue, you’ll need to first click in the margin of the browser window (where there are no cells) and then in the text editor. Sorry for this inconvenience, we’re working on getting this fixed.</div></div></div></div></p>” ] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “## Introduction to GPU Libraries\n”, “\n”, “GPU-accelerated libraries are a great and easy way to quickly speed-up the computationally intensive portions of your code. You are able to access highly-tuned and GPU optimized algorithms, without having to write any of that code.\n”, “\n”, “This lab consists of three tasks that will require you to modify some code, compile and execute it. For each task, a solution is provided so you can check your work or take a peek if you get lost.\n”, “\n”, “If you are still confused now, or at any point in this lab, you can consult the <a href="#FAQ">FAQ</a> located at the bottom of this page.” ] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “### Task #1\n”, “\n”, “For this first task, we’ll be using the cuBLAS GPU-accelerated library to do a basic matrix multiply. The specific API we’ll be using is cublasSgemm (the S stands for single and the gemm for GEneric Matrix-Matrix Multiply) and you’ll want to use the documentation for this call located at docs.nvidia.com here.\n”, “\n”, “It is important to realize that the GPU has its own physical memory, just like the CPU uses system RAM for its memory. For the library we’ll be using in this task, we have to ensure the data required is first copied across the PCI-Express bus to the GPU’s memory before we call a library API. For this task, we will manage the GPU memory with the following calls (detailed documentation here at docs.nvidia.com):\n”, “\n”, “* cudaMalloc ( void** devPtr, size_t size ) - this API call is used to allocate memory on the GPU, and is very similar to using malloc on the CPU. You provide the addres of a pointer that will point to the memory after the call completes successfully, as well as the number of bytes to allocate.\n”, “* cudaMemcpy ( void* dst, const void* src, size_t count, cudaMemcpyKind kind ) - also very similar the standard memcpy, this API call is used to copy data between the CPU and GPU. It takes a destination pointer, a source pointer, the number of bytes to copy, and the fourth parameter indicates which direction the data is traveling: GPU->CPU, CPU->GPU, or GPU->GPU. The two constants we’ll use for Task #2 are:\n”, “ * cudaMemcpyDeviceToHost - data is traveling from the GPU to the CPU\n”, “ * cudaMemcpyHostToDevice - data is traveling from the CPU to the GPU \n”, “* cudaFree ( void* devPtr ) - we use this API call to free any memory we allocated on the GPU.\n”, “\n”, “In the code below, there already exists a single-threaded CPU version of matrix multiply we’ll use to compare our GPU library call answer with. Your goal is to replace the ## FIXME: ... ## regions of code to successfully call cublasSgemm and do a basic matrix multiply. If you get stuck, there are a number of hints provided below. In addition, I encourage you to look at the task1_solution.cu file (click on the task1 folder in the editor below to see the file) and compare your work, or use this if you are lost.\n”, “\n”, “When you are ready to compile and run, there are two executable cells below the text editor. The first one compiles task1.cu, and the second runs the resulting program. Remember you execute these cells with Ctrl-Enter, or by pressing the Play button in the toolbar at the top of the window.\n”, “\n”, “<p class="hint_trigger">Hint #1\n”, “ <div class="toggle_container"><div class="input_area box-flex1"><div class=\"highlight\">The cublasHandle_t required by the cublasSgemm call has already been created for you in the code.</div></div></div></p>\n”, “\n”, “<p class="hint_trigger">Hint #2\n”, “ <div class="toggle_container"><div class="input_area box-flex1"><div class=\"highlight\">You will need to use cudaMalloc to allocate space for the d_a, d_b, and d_c arrays on the GPU. These pointers have already been declared for you.</div></div></div></p>\n”, “ \n”, “<p class="hint_trigger">Hint #3\n”, “ <div class="toggle_container"><div class="input_area box-flex1"><div class=\"highlight\">Once you have allocated the arrays on the GPU, you will need to copy the a and b arrays to the GPU using cudaMemcpy calls. After the cublasSgemm call has returned, you’ll want to use cudaMemcpy to copy the resulting d_c array back to the host.</div></div></div></p>\n”, “ \n”, “<p class="hint_trigger">Hint #4\n”, “ <div class="toggle_container"><div class="input_area box-flex1"><div class=\"highlight\">You’ll need to use the cublasSgemm <a href="http://docs.nvidia.com/cuda/cublas/index.html#cublas-lt-t-gt-gemm">documentation</a> to determine the remaining parameters needed in the call.</div></div></div></p>” ] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “<iframe id="task1" src="task1" width="100%" height="500px">\n”, “ <p>Your browser does not support iframes.</p>\n”, “</iframe>” ] }, { “cell_type”: “code”, “collapsed”: false, “input”: [ “# Execute this cell to compile ttask1\n”, “!nvcc -arch=sm_20 -lcublas -o task1_out task1/task1.cu && echo Compiled Successfully!” ], “language”: “python”, “metadata”: {}, “outputs”: [] }, { “cell_type”: “code”, “collapsed”: false, “input”: [ “!./task1_out” ], “language”: “python”, “metadata”: {}, “outputs”: [] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “With just a few lines of code, we were able to call a highly optimized version of matrix-multiply, instead of having to write that code ourselves. And with each new generation of GPUs, the libraries will be updated to continually take advantage of enhanced performance and features.\n”, “\n”, “In the above task, we only used a single library call. However, the real power of libraries tends to come from stringing multiple calls together to build a more in-depth algorithm. We’ll explore using multiple GPU libraries in the next task.” ] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “### Task #2\n”, “\n”, “For this task, we’re going to make use of the GPU to create the random numbers used in our matrix multiplication. While this is a contrived example, it’s a good simplistic way to show the concepts of using multiple GPU libraries in tandem.\n”, “\n”, “The CPU-side matrix multiply has been removed, and we’ll just being using the cublasSgemm to do the multiply.\n”, “\n”, “Your objective in this task is replace the for-loop and rand() calls with two host-side version of the cuRAND curandGenerateNormal API call. You’ll want to look at the cuRAND documentation for this to see what you need to do. \n”, “\n”, “Before you modify any code, you should compile and execute task2.cu using the two cells below the text editor. This will give you a benchmark on how long the original code takes to run.” ] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “<p class="hint_trigger">Hint #1\n”, “ <div class="toggle_container"><div class="input_area box-flex1"><div class=\"highlight\">For this task, make sure to use the Host version of the curandCreateGeneratorHost call. Well explore the device-side version in the next task.</div></div></div></p>” ] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “<p class="hint_trigger">Hint #2\n”, “ <div class="toggle_container"><div class="input_area box-flex1"><div class=\"highlight\">There’s a great example showing how to use the cuRAND library located <a href="http://docs.nvidia.com/cuda/curand/host-api-overview.html#host-api-example">here</a>.</div></div></div></p>” ] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “<p class="hint_trigger">Hint #3\n”, “ <div class="toggle_container"><div class="input_area box-flex1"><div class=\"highlight\">There is no need to add any cudaMempcy calls in this task.</div></div></div></p>” ] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “<iframe id="task2" src="task2" width="100%" height="500px">\n”, “ <p>Your browser does not support iframes.</p>\n”, “</iframe>” ] }, { “cell_type”: “code”, “collapsed”: false, “input”: [ “# Execute this cell to compile task2.cu\n”, “!nvcc -arch=sm_20 -lcublas -lcurand -o task2_out task2/task2.cu && echo Compiled Successfully!” ], “language”: “python”, “metadata”: {}, “outputs”: [] }, { “cell_type”: “code”, “collapsed”: false, “input”: [ “!./task2_out” ], “language”: “python”, “metadata”: {}, “outputs”: [] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “If you have successfully used the cuRAND library to create random data for matrices h_a and h_b, and used the curandCreateGeneratorHost call, you will probably have noticed that the whole program takes longer to run than the original. That’s definitely not what we want or expect when moving computation to the GPU. So what gives?\n”, “\n”, “This is a great opportunity to demonstrate the command-line profile provided in the CUDA Toolkit from NVIDIA; nvprof. Execute the below cell to profile your task2_out program and we’ll see if we can figure out what the problem is. This is called profiler-driven optimization.” ] }, { “cell_type”: “code”, “collapsed”: false, “input”: [ “!nvprof ./task2_out” ], “language”: “python”, “metadata”: {}, “outputs”: [] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “If your modified task2.cu file is similar to the task2_solution.cu code, the nvprof will show that the 2nd and 3rd areas consuming most of our application time are transfers between the Device and Host memory across the PCI-Express bus:\n”, “\n”, “ 6.48% 498.69ms 1 498.69ms 498.69ms 498.69ms [CUDA memcpy DtoH]\n”, “ 5.94% 456.66ms 3 152.22ms 1.2160us 228.77ms [CUDA memcpy HtoD]\n”, “ \n”, “So, our next task is to see if we can minimize some of these transfers. The reason for these excessive data transfers lies in the fact that we used the host-side version of curandCreateGeneratorHost. Random number creation using this generator will create random numbers on the GPU, and then copy the data across the PCI-Express bus to host-side arrays. So what’s currently happening is:\n”, “\n”, “1. Tell the GPU to create random numbers\n”, “2. The cuRAND library then copies these random numbers across the PCI-Express bus to host memory\n”, “3. We then re-copy these random numbers back to the GPU to be used by our cublasSgemm call.\n”, “\n”, “So we’re doing four extra transfers (2x arrays 2 times) of data than is required!” ] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “### Task #3\n”, “\n”, “Your final task will be to modify task3.cu (the same as task2_solution.cu in Task #2) to minimize the number of transfers across the PCI-Express bus. Please make use of the cuRAND documentation provided above, if needed.” ] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “<p class="hint_trigger">Hint #1\n”, “ <div class="toggle_container"><div class="input_area box-flex1"><div class=\"highlight\">To have the cuRAND library fill a device-side array, you’ll want to use curandCreateGenerator without the Host on the end.</div></div></div></p>” ] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “<p class="hint_trigger">Hint #2\n”, “ <div class="toggle_container"><div class="input_area box-flex1"><div class=\"highlight\">You’ll no longer need the cudaMemcpy to move the h_a and h_b arrays to the device since the curandGenerateNormalDouble will have already filled in the device arrays!</div></div></div></p>” ] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “<iframe id="task3" src="task3" width="100%" height="500px">\n”, “ <p>Your browser does not support iframes.</p>\n”, “</iframe>” ] }, { “cell_type”: “code”, “collapsed”: false, “input”: [ “# Execute this cell to compile task3.cu\n”, “!nvcc -arch=sm_20 -lcublas -lcurand -o task3_out task3/task3.cu && echo Compiled Successfully!” ], “language”: “python”, “metadata”: {}, “outputs”: [] }, { “cell_type”: “code”, “collapsed”: false, “input”: [ “!./task3_out” ], “language”: “python”, “metadata”: {}, “outputs”: [] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “In my case, with the modification to the random number generation, our contrived example is now running about three times as fast as we started with - and we did not have to write any GPU specific code, just make a few library API calls.\n”, “\n”, “You can now see how leaving data on the GPU for as long as possible, and stringing together API calls from various libraries, can provide great flexibility in creating extremely fast and powerful algorithms!” ] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “## Learn More\n”, “\n”, “If you are interested in learning more, you can use the following resources:\n”, “\n”, “* Learn more at the CUDA Developer Zone.\n”, “* See a list of available libraries here.\n”, “* If you have an NVIDIA GPU in your system, you can download and install the CUDA tookit to access NVIDIA’s GPU-accelerated libraries.\n”, “* Search or ask questions on Stackoverflow using the cuda tag.” ] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “<a id="post-lab"></a>\n”, “## Post-Lab\n”, “\n”, “Finally, don’t forget to save your work from this lab before time runs out and the instance shuts down!!\n”, “\n”, “1. Save this IPython Notebook by going to File -> Download as -> IPython (.ipynb) at the top of this window\n”, “2. You can execute the following cell block to create a zip-file of the files you’ve been working on, and download it with the link below.” ] }, { “cell_type”: “code”, “collapsed”: false, “input”: [ “%%bash\n”, “rm -f library_c_files.zip\n”, “zip -r library_c_files.zip task1/.cu task1/.h task2/.cu task2/.h task3/.cu task3/.h” ], “language”: “python”, “metadata”: {}, “outputs”: [] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “After executing the above cell, you should be able to download the zip file here” ] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “<a id="FAQ"></a>\n”, “—\n”, “# Lab FAQ\n”, “\n”, “Q: I’m encountering issues executing the cells, or other technical problems?
\n”, “A: Please see this infrastructure FAQ.” ] }, { “cell_type”: “markdown”, “metadata”: {}, “source”: [ “\n”, “” ] } ], “metadata”: {} } ] }