Saturday, 18 September 2021

Implementing word2vec on the GPU

Some time ago I finished my master's thesis on word2vec. For the thesis I used the gensim implementation of word2vec, which is well known and robust. The thesis was not really about implementing word2vec; I was also working on my own implementation, but did not have time to complete it as it was not a priority.

After taking a small break from everything word2vec-related, I recently finished my own implementation. The code is here: https://github.com/tsaastam/myw2v

I wanted to specifically implement word2vec on the GPU, unlike the original reference code and gensim which both run on the CPU only. One reason why I wanted to do this was that I couldn't find many GPU implementations of word2vec. There's a paper by Gupta & Khare, BlazingText: Scaling and Accelerating Word2Vec using Multiple GPUs, where a multi-GPU implementation is discussed, but it's a closed-source proprietary system owned by Amazon. There are some other papers as well, but I couldn't find an established open source GPU edition of word2vec. There are several partial GPU implementations to be found, but none of these seem to be complete, or very performant.

A straightforward way to write a GPU implementation would be to use one of the modern machine learning libraries, such as TensorFlow or PyTorch. However, these libraries (at least by default) use a "synchronous" implementation of backpropagation, whereas word2vec has always been implemented in a "non-thread-safe" manner instead; see the gensim explanation for example. The difference is that the latter implementation style, i.e. so-called asynchronous stochastic gradient descent, is massively faster, at least in the case of word2vec. I did not in fact check how much the "accuracy" suffers from this thread-unsafety, but my GPU implementation seems to be reasonably competitive, scoring some 10% lower on the word analogy task than gensim on the same material while being noticeably faster to train.

In more detail, the common "synchronous" method of training is basically: take a small batch of input-output pairs (or similar), compute the loss gradient for all of these in parallel, take the average and adjust the weights by that, then repeat. But it seems that the backpropagation part of word2vec is simple enough that this way of computing it is not very efficient, whether on GPU or CPU. Basically the gradient involves two vectors of length D (embedding dimensionality, usually 100-300 or so) and their dot product, and that's about it. Now one can calculate, say, one thousand such backpropagations in parallel (synchronously), but - based on my experiments at least - it seems that this involves too much overhead compared to the actual duration of the backpropagation calculation. One needs to generate a bunch of input-output pairs, put these in some kind of array, transfer the array to the GPU (if using a GPU), run the parallel calculation, then perhaps synchronise the GPU "threads" and copy the resulting data back, and then calculate the average and adjust. The problem is that the calculation takes, probably, less time than all the generating and copying and averaging. (I didn't measure the individual durations of these things very accurately, but basically, an implementation like this in practice is slow as molasses.)

One could also generate the input-output pairs on the GPU itself so that there's less copying and transferring, and then use them in a batched-and-averaged way. I was tempted to try this, and made some preliminary attempts, but the amount of bookkeeping required makes this annoying. It is much more straightforward to instead do what I ended up doing: as soon as each input-output pair is generated from the input data (on the GPU), it is immediately used in the backpropagation step, and that's it. This way one does not need to wait for a certain number of input-output pairs to be generated first - how would you know (in your CUDA code) that all the items in your "n input-output pairs" batch array are now generated successfully? - and then run the next step of the calculation on them, then average, and go back to the generation phase. Of course, the immediate, completely unsynchronised backpropagation step means that some updates are lost - see "asynchronous" above.

It is interesting that this method works fine, despite its utterly non-thread-safe nature. But this is probably not all that surprising. If one were to calculate, precisely and synchronously, a batch of say 128 gradients, and then average them and use that as the adjustment, would not some of these 128 gradients (namely the more extreme ones) be "lost" as well in the averaging? Obviously with the asynchronous algorithm the lost values are more or less random, rather than being the most extreme values; and it is somewhat tricky to ascertain even what fraction of the updates were lost; but it turns out that this does not matter all that much. The dot products still get nicely trained towards where we want them, on average.

Here's to word2vec, a truly optimised algorithm.

P.S. I did not experiment with different ways of dividing up the input material across CUDA thread blocks, as Gupta and Khare do in their paper. That could be one avenue of optimising and improving my code. And of course, proper "online" batching of data into GPU memory, rather than loading it in all at once and then calculating all at once on the GPU, would be a significant improvement one would want in a real production system.

Monday, 30 August 2021

Simple Wikipedia parsing

I'm still not sure how to completely parse Wikipedia data, but I think it involves installing a local copy of the MediaWiki software and a proper database, so that you can deal with all the special syntax, links to other pages, categories, templates etc.

However, just extracting the text of Wikipedia pages seems to be pretty easy. One way is to use gensim's Wikipedia utility:

from gensim.scripts import segment_wiki

segment_wiki.segment_and_write_all_articles(full_path, full_out_path)

Here "full_path" is the path to a Wikipedia XML dump file, which is bz2-compressed XML, such as one of the .bz2 files from here. This will turn the XML dump into a .json.gz dump which is a little bit easier to work with. You can use gensim's "utils" package to open the .json.gz file and read article by article:

with utils.open(json_gz_file, 'rb') as jf:
    for line in jf:
        article = json.loads(line)
        do_something_with(article["title"])
        for section_title, section_text in zip(article['section_titles'],
                                               article['section_texts']):
            do_something_with(section_title)
            do_something_with(section_text)
See here for full example code (albeit with over-simplistic parsing of the resulting text). Note that the code does some other things as well.

Sunday, 8 August 2021

Word2vec and overfitting

After getting my bachelor's in statistics in 2016, I continued studying and finished my master's in late 2020. My master's thesis was titled Word2vec and its application to examining the changes in word contexts over time (direct download link here).

An excerpt from the thesis follows.

[----]

-- Nevertheless, an evaluation of the "closeness" or "packed-togetherness" of the word2vec clustering can be attempted. It turns out that models trained with different dimensionality of embeddings, all other parameters and the data set being the same, end up with embeddings where the average similarity between each word and its nearest neighbour, as measured by cosine similarity, differs. More precisely, given a model with embedding vectors $v_i$, denote by $\hat{v}_i$ the embedding vector that is closest to $v_i$ as measured by cosine similarity, i.e. $\hat{v}_i := \arg \max_{j \neq i} \mathrm{cos}(v_i, v_j)$. We examine how the distribution of $\hat{v}_i$ varies for models trained with different dimensionality of embeddings, the source material and all other hyperparameters being the same.

For an arbitrarily selected year, namely Yle corpus 2014, this distribution of closest-neighbour similarity by embedding dimension is depicted in figure 3.8. The average and the central 95-percentile range (i.e. from 2.5% to 97.5%) of $\hat{v}_i$ are shown. Note that in the word2vec implementation, the closeness function uses normalised vectors, which means that the maximum cosine similarity is 1.

Effect of embedding dimension on the distribution of nearest-neighbour similarity

Figure 3.8: Effect of embedding dimension on the distribution of nearest-neighbour similarity. The models were trained on year 2014 of the Yle corpus.

It is interesting that while the 97.5th percentile of the nearest-neighbour similarity remains fairly high regardless of embedding dimension, it still decreases as the dimension grows, and the 2.5th percentile decreases very significantly, from 0.814 in the 25-dimensional case all the way to 0.409 for the 300-dimensional model. This means that the model is unable to pack the embeddings as tightly as dimensionality grows, which could indicate overfitting. However, it must be noted that while interesting, this measure of spread does not directly inform us of how well each of the models depicted will perform on the actual task they were trained for.

[----]

It appears from figure 3.10 that the model begins to overfit as dimensionality grows beyond 100--150. Overfitting, of course, is traditionally defined as related to the generalisation error of a model, i.e. a model's ability to produce consistently decent predictions when faced with new data [43, 15]. As such, the concept of overfitting does not completely apply to word2vec, since the word2vec model is not used for predictions. However, overfitting refers more generally to a situation where a model of too high complexity is used, such that the model can perform well on the training data simply by memorising it. This can clearly occur with the word2vec model, which has a very large number of parameters, regardless of whether said model is then used for predictions or not. The non-predictive nature of the model, then, merely makes it more tricky to ascertain whether overfitting has occurred.

Figure 3.10: Accuracy on analogy task by embedding dimension: five arbitrarily selected data sets.


[15] Dietterich, T. (1995). Overfitting and undercomputing in machine learning. ACM computing surveys (CSUR), 27(3), 326–327.

[43] Kearns, M., Mansour, Y., Ng, A. Y., & Ron, D. (1997). An experimental and theoretical comparison of model selection methods. Machine Learning, 27(1), 7–50.