Saturday, 18 September 2021

Implementing word2vec on the GPU

Some time ago I finished my master's thesis on word2vec. For the thesis I used the gensim implementation of word2vec, which is well known and robust. The thesis was not really about implementing word2vec; I was also working on my own implementation, but did not have time to complete it as it was not a priority.

After taking a small break from everything word2vec-related, I recently finished my own implementation. The code is here: https://github.com/tsaastam/myw2v

I wanted to specifically implement word2vec on the GPU, unlike the original reference code and gensim which both run on the CPU only. One reason why I wanted to do this was that I couldn't find many GPU implementations of word2vec. There's a paper by Gupta & Khare, BlazingText: Scaling and Accelerating Word2Vec using Multiple GPUs, where a multi-GPU implementation is discussed, but it's a closed-source proprietary system owned by Amazon. There are some other papers as well, but I couldn't find an established open source GPU edition of word2vec. There are several partial GPU implementations to be found, but none of these seem to be complete, or very performant.

A straightforward way to write a GPU implementation would be to use one of the modern machine learning libraries, such as TensorFlow or PyTorch. However, these libraries (at least by default) use a "synchronous" implementation of backpropagation, whereas word2vec has always been implemented in a "non-thread-safe" manner instead; see the gensim explanation for example. The difference is that the latter implementation style, i.e. so-called asynchronous stochastic gradient descent, is massively faster, at least in the case of word2vec. I did not in fact check how much the "accuracy" suffers from this thread-unsafety, but my GPU implementation seems to be reasonably competitive, scoring some 10% lower on the word analogy task than gensim on the same material while being noticeably faster to train.

In more detail, the common "synchronous" method of training is basically: take a small batch of input-output pairs (or similar), compute the loss gradient for all of these in parallel, take the average and adjust the weights by that, then repeat. But it seems that the backpropagation part of word2vec is simple enough that this way of computing it is not very efficient, whether on GPU or CPU. Basically the gradient involves two vectors of length D (embedding dimensionality, usually 100-300 or so) and their dot product, and that's about it. Now one can calculate, say, one thousand such backpropagations in parallel (synchronously), but - based on my experiments at least - it seems that this involves too much overhead compared to the actual duration of the backpropagation calculation. One needs to generate a bunch of input-output pairs, put these in some kind of array, transfer the array to the GPU (if using a GPU), run the parallel calculation, then perhaps synchronise the GPU "threads" and copy the resulting data back, and then calculate the average and adjust. The problem is that the calculation takes, probably, less time than all the generating and copying and averaging. (I didn't measure the individual durations of these things very accurately, but basically, an implementation like this in practice is slow as molasses.)

One could also generate the input-output pairs on the GPU itself so that there's less copying and transferring, and then use them in a batched-and-averaged way. I was tempted to try this, and made some preliminary attempts, but the amount of bookkeeping required makes this annoying. It is much more straightforward to instead do what I ended up doing: as soon as each input-output pair is generated from the input data (on the GPU), it is immediately used in the backpropagation step, and that's it. This way one does not need to wait for a certain number of input-output pairs to be generated first - how would you know (in your CUDA code) that all the items in your "n input-output pairs" batch array are now generated successfully? - and then run the next step of the calculation on them, then average, and go back to the generation phase. Of course, the immediate, completely unsynchronised backpropagation step means that some updates are lost - see "asynchronous" above.

It is interesting that this method works fine, despite its utterly non-thread-safe nature. But this is probably not all that surprising. If one were to calculate, precisely and synchronously, a batch of say 128 gradients, and then average them and use that as the adjustment, would not some of these 128 gradients (namely the more extreme ones) be "lost" as well in the averaging? Obviously with the asynchronous algorithm the lost values are more or less random, rather than being the most extreme values; and it is somewhat tricky to ascertain even what fraction of the updates were lost; but it turns out that this does not matter all that much. The dot products still get nicely trained towards where we want them, on average.

Here's to word2vec, a truly optimised algorithm.

P.S. I did not experiment with different ways of dividing up the input material across CUDA thread blocks, as Gupta and Khare do in their paper. That could be one avenue of optimising and improving my code. And of course, proper "online" batching of data into GPU memory, rather than loading it in all at once and then calculating all at once on the GPU, would be a significant improvement one would want in a real production system.