GSoC: CuPy backend for QuTiP 2nd entry

Felipe bivort haiek
5 min readJun 22, 2021

--

I spent the past week mostly reading about how to effectively program CUDA enabled devices, and looking at CuPy’s documentation and at how QuTiP’s data layer works.

One does not simply optimize GPU code

Even though I will not be working on low level CUDA code, at least in the coming weeks, as I will be leveraging CuPy and its use of high level APIs like: CuBLAS and CuSPARSE; I think it is interesting to delve into what makes programming a GPU special and where I think we stand to make gains. I will mostly rely in what I learned reading “ Programming Massively Parallel Procesors” by David B. Kirk and Wen-Mei W. Hwu.

The wrapped code that we send to a GPU is called a kernel. There are 2 main parameters that are specified in a kernel’s header the threads’ dimensions and the blocks’ dimensions.

dim3 dimBlock(5, 5, 5);
dim3 dimGrid(4, 2, 2);
KernelFunction <<dimGrid, dimBlock>>(. . .);

Threads are computational units that perform operations, a simple picture is given by vector addition: let’s say we hold vectors A and B and want to get C =A +B, in this cases each thread will hold the numbers at a certain index for both vector A and B, i.e. thread i will compute C[i] = A[i] + B[i]. Blocks hold contiguous threads that will have access to the same shared memory resource, which is smaller than the global and constant device memory but of faster access.

The maximum numbers of threads and blocks that can compute at the same time is limited by the GPU architecture.

Even though we think of GPUs as as executing the code in parallel on all threads, in reality the fine structure of a GPU makes this more subtle. Threads in the GPU are usually gathered into warps holding 32 threads which make computations in succesion, this means that any algorithm that computes a loop on thread 1, but not in the rest of the threads in a warp, will be slowed down by passing through all the non-computing threads. Let’s take the case when you need to sum all the element in an array A ( the code examples are taken from “ Programming Massively Parallel Procesors”), a naive approach would be to have loops adding contiguous elements each time, in the first pass only even threads will perform work ( adding elements one “space” apart and saving the result on the first place), then only multiples of 4 ( adding elements 3 “spaces” apart and saving the result on the first space), then only multiples of 8 ( adding elements 7 “spaces” apart and saving the result on the first one) and so on.

1. __shared__ float partialSum[]
...
2. unsigned int t = threadIdx.x;
3. for (unsigned int stride = 1; stride < blockDim.x; stride *= 2)
4. {
5. __syncthreads();
6. if (t % (2*stride) == 0)
7. partialSum[t] += partialSum[t+stride];
8 }

This means that the last passes will be performed in non contiguous threads, this will cause the threads to wait for for the other threads in the warp to asses if they will be carrying out a computation or not and slow down the whole algorithm. The correct way to approach this would be to always have consecutive threads perform work by retrieving array elements that are not contiguous but that do lye on the same block.

1. __shared__ float partialSum[]
2. unsigned int t = threadIdx.x;
3. for (unsigned int stride = blockDim.x; stride > 1; stride /= 2)
4. {
5. __syncthreads();
6. if (t < stride)
7. partialSum[t] += partialSum[t+stride];
8. }

Notice animportant detail of the two codes, the call to __syncthreads. This guarantees that all threads in the same block have finished the executions in the last interation, and thus it is safe to access the updated shared partialSum array.

Attaining highly perfomant GPU code rests on our awareness of the peculiarities mentioned in the past paragraphs: different levels of memory (shared, global, constant), the execution”order”,the specific hardware constraints. An interesting way to overcome having to specify a different code for each architecture is currently being developed into CuPy by leveraging the library Optuna. Optuna performs different types of Bayesian Search, this means it can be used to search for the best algorithm/computational parameters for performing an specific function in the GPU.

QuTiP’s data layer

QuTiP’s data layer was developed by Jake Lishman during GSoC 2020, you can see more here . It provides Cython level definitions for the most used data structures in QuTiP, namely: Dense and CSR. Dense, implements dense matrices and has some optimizations over your wild NumPy ndarray. It enforces arrays to have only two dimensions (which we can take as representing the input and output Hilbert spaces), grants a better memory allocation system enforcing memory contiguity in one of hte dimension and incurs in less time penalization due to a decreased number of calla to holding the GIL .

On the other hand CSR is QuTiP’s specialization for sparse matrices as found in the SciPy package. Many of the performance benefits seen for Dense are still valid in this data type.

The Datalayer includes a To singleton class. The To singleton class is in charge of making the transformation between different data types, when calling transformation functions from one data type to the other, when calling a function on a Data type without a particular definition for such function . A user that wants to include a new data type needs to define and set at least a function going from a known data type to the new data type and from any old data type to the new data type. The dispatcher will then decide how to traverse the transformation graph.

Plans for this and the following week

One decision we made at the beginning was to focus particularly on having a fully fleshed out CuPy specialization for the Dense type, CuPyDense, and leave the sparse Data class for later on. There are mainly two reasons for this decision: it is of particular interest to have an implementation of a Hierarchical Equations of Motion solver (see more here ) that works on GPU,which reliests mostly on dense matrix computations; and algorithms on sparse matrix do not show a performance benefit as significant as dense matrices on the GPU (this is in part because most sparse matrix algorithms sort the sparse matrix rows into chunks with similar number of non-zero element and add dummy zeros) .

Having a working Dense class means defining the transformation operations to previously defined data types (in this case Dense) and implementing, adjoint , transpose, conjugate and trace functions.

Another important thing I will be dealing with this week is getting CuPy to properly install in the GitHub Actions environment, so that our CI cycle is complete. The main culprit here is that we do not have access to a GPU in the GitHub Actions environment so I will be trying to have CuPy compile with the emulator option set.

--

--