Monday 30 August 2021

Simple Wikipedia parsing

I'm still not sure how to completely parse Wikipedia data, but I think it involves installing a local copy of the MediaWiki software and a proper database, so that you can deal with all the special syntax, links to other pages, categories, templates etc.

However, just extracting the text of Wikipedia pages seems to be pretty easy. One way is to use gensim's Wikipedia utility:

from gensim.scripts import segment_wiki

segment_wiki.segment_and_write_all_articles(full_path, full_out_path)

Here "full_path" is the path to a Wikipedia XML dump file, which is bz2-compressed XML, such as one of the .bz2 files from here. This will turn the XML dump into a .json.gz dump which is a little bit easier to work with. You can use gensim's "utils" package to open the .json.gz file and read article by article:

with utils.open(json_gz_file, 'rb') as jf:
    for line in jf:
        article = json.loads(line)
        do_something_with(article["title"])
        for section_title, section_text in zip(article['section_titles'],
                                               article['section_texts']):
            do_something_with(section_title)
            do_something_with(section_text)
See here for full example code (albeit with over-simplistic parsing of the resulting text). Note that the code does some other things as well.

Sunday 8 August 2021

Word2vec and overfitting

After getting my bachelor's in statistics in 2016, I continued studying and finished my master's in late 2020. My master's thesis was titled Word2vec and its application to examining the changes in word contexts over time (direct download link here).

An excerpt from the thesis follows.

[----]

-- Nevertheless, an evaluation of the "closeness" or "packed-togetherness" of the word2vec clustering can be attempted. It turns out that models trained with different dimensionality of embeddings, all other parameters and the data set being the same, end up with embeddings where the average similarity between each word and its nearest neighbour, as measured by cosine similarity, differs. More precisely, given a model with embedding vectors $v_i$, denote by $\hat{v}_i$ the embedding vector that is closest to $v_i$ as measured by cosine similarity, i.e. $\hat{v}_i := \arg \max_{j \neq i} \mathrm{cos}(v_i, v_j)$. We examine how the distribution of $\hat{v}_i$ varies for models trained with different dimensionality of embeddings, the source material and all other hyperparameters being the same.

For an arbitrarily selected year, namely Yle corpus 2014, this distribution of closest-neighbour similarity by embedding dimension is depicted in figure 3.8. The average and the central 95-percentile range (i.e. from 2.5% to 97.5%) of $\hat{v}_i$ are shown. Note that in the word2vec implementation, the closeness function uses normalised vectors, which means that the maximum cosine similarity is 1.

Effect of embedding dimension on the distribution of nearest-neighbour similarity

Figure 3.8: Effect of embedding dimension on the distribution of nearest-neighbour similarity. The models were trained on year 2014 of the Yle corpus.

It is interesting that while the 97.5th percentile of the nearest-neighbour similarity remains fairly high regardless of embedding dimension, it still decreases as the dimension grows, and the 2.5th percentile decreases very significantly, from 0.814 in the 25-dimensional case all the way to 0.409 for the 300-dimensional model. This means that the model is unable to pack the embeddings as tightly as dimensionality grows, which could indicate overfitting. However, it must be noted that while interesting, this measure of spread does not directly inform us of how well each of the models depicted will perform on the actual task they were trained for.

[----]

It appears from figure 3.10 that the model begins to overfit as dimensionality grows beyond 100--150. Overfitting, of course, is traditionally defined as related to the generalisation error of a model, i.e. a model's ability to produce consistently decent predictions when faced with new data [43, 15]. As such, the concept of overfitting does not completely apply to word2vec, since the word2vec model is not used for predictions. However, overfitting refers more generally to a situation where a model of too high complexity is used, such that the model can perform well on the training data simply by memorising it. This can clearly occur with the word2vec model, which has a very large number of parameters, regardless of whether said model is then used for predictions or not. The non-predictive nature of the model, then, merely makes it more tricky to ascertain whether overfitting has occurred.

Figure 3.10: Accuracy on analogy task by embedding dimension: five arbitrarily selected data sets.


[15] Dietterich, T. (1995). Overfitting and undercomputing in machine learning. ACM computing surveys (CSUR), 27(3), 326–327.

[43] Kearns, M., Mansour, Y., Ng, A. Y., & Ron, D. (1997). An experimental and theoretical comparison of model selection methods. Machine Learning, 27(1), 7–50.