Monday 30 August 2021

Simple Wikipedia parsing

I'm still not sure how to completely parse Wikipedia data, but I think it involves installing a local copy of the MediaWiki software and a proper database, so that you can deal with all the special syntax, links to other pages, categories, templates etc.

However, just extracting the text of Wikipedia pages seems to be pretty easy. One way is to use gensim's Wikipedia utility:

from gensim.scripts import segment_wiki

segment_wiki.segment_and_write_all_articles(full_path, full_out_path)

Here "full_path" is the path to a Wikipedia XML dump file, which is bz2-compressed XML, such as one of the .bz2 files from here. This will turn the XML dump into a .json.gz dump which is a little bit easier to work with. You can use gensim's "utils" package to open the .json.gz file and read article by article:

with utils.open(json_gz_file, 'rb') as jf:
    for line in jf:
        article = json.loads(line)
        do_something_with(article["title"])
        for section_title, section_text in zip(article['section_titles'],
                                               article['section_texts']):
            do_something_with(section_title)
            do_something_with(section_text)
See here for full example code (albeit with over-simplistic parsing of the resulting text). Note that the code does some other things as well.

No comments:

Post a Comment