In the last weeks I made progress on implementing using CuPy a number of functions found in QuTiP dev.major.
You can find a list of functions, the state they where found in in QuTiP and their currentstate in qutip-cupy here.
For most functions to be specialized creating tests was easy enough, it meant inheriting from the testing classes in the QuTiP module test.data.mathematics_tests, creating a random CuPyDense array generator ( line 10), registering it (line 17)and passing around the new specialization and its expected input and output class/types (line 24).
In some cases tests had to be built from scratch but we could still leverage the nice Mixins you can see in QuTiP tests.core.data.test_mathematics. This meant defining a proper naive NumPy implementation of the functionality we wanted to test. In this way the result of our own function specializations got checked against NumPy’s result. And one should pay attention that the inputs are suitable and span all edge cases.
For some special non-trivial specializations we tested with different implementations. We covered implementations using CuPy main functions directly, using functions from cupy.cublas that CuBLAS’ handles and using raw cuda kernels.
Expectation on kets
In the case of implementing expectation for states represented as kets , i.e. <state|op|state> , I tried out the following three different implementations
Notice that the last implementation passes the argument “H” which means taking the hermitian of the left array, the second function uses cp.vdot which does the same thing by default.
Ther results we obtained after benchmarking for random 1 dimensional kets and operators (square arrays) of compatible size can be seen in the next figure.
As the GPU timings are very similar here I referred to the CPU timings to make a final decision.
Checking if a matrix is hermitian involves checking that each item from thre transpose conjugates is equal up to a tolerance factor to the original matrix.
To do this I tried using a cuda kernels. The key word in the past definition is each, if we distribute the computation among different threads in a cuda kernel, they will make computations in an asynchronous fashion ( as long as they do not belong to the same warp) . Which means that two different threads may try to update the output at the same time and you enter into a race condition, thread1 reads current output, thread20 reads the output, thread1 computes and updates, but thread 20 computes on the original value and updates opoutput without ever seeing thread1's update. A solution to this is to use atomic operations which “serializes” the updates.
Atomic operation vs. cupy reduction
A better way to do this is use a reduction routine, instead of writting my own Cuda reduction kernel I relied on CuPy’s .all() . You can see the difference for yourself in the next plot.
But improvements did not finish there, we could obtain better result by just checking the upper triangular region of the matrix. Additionaly by taking care of balancing the loads between threads, ensuring that contiguous threads load contiguous memory, and switching to a 1D-block kernel, we got additional improvements.
Notice that the improvement we get is around two orders of magnitude.
In the future we can think of reimplementing some other functions as cuda kernels, for example the diags_cupydense creator.
For the last week I am focusing on getting documentation to look right, although it will probably not be possible to have the documentation build on ReadTheDocs becuase of our dependences, it is built each time as part of our CI as an artifact and we can get it from there.
Equally important is taking time to describe the functionality and documenting all the pending improvements I want to do but did not have time so that the work can be continued in the future.