Wikipedia Data Science: Working with the World’s Largest Encyclopedia
Finding and Downloading Data Programmatically
The first step in any data science project is accessing your data! While we could make individual requests to Wikipedia pages and scrape the results, we’d quickly run into rate limits and unnecessarily tax Wikipedia’s servers. Instead, we can access a dump of all of Wikipedia
The English version is at dumps.wikimedia.org/enwiki. We view the available versions of the database using the following code.
import requests
# Library for parsing HTMLfrom bs4 import BeautifulSoup
base_url = 'https://dumps.wikimedia.org/enwiki/'index = requests.get(base_url).textsoup_index = BeautifulSoup(index, 'html.parser')
# Find the links on the pagedumps = [a['href'] for a in soup_index.find_all('a') if a.has_attr('href')]dumps
['../', '20180620/', '20180701/', '20180720/', '20180801/', '20180820/', '20180901/', '20180920/', 'latest/']
This code makes use of the BeautifulSoup
library for parsing HTML. Given that HTML is the standard markup language for web pages, this is an invaluable library for working with web data.
For this project, we’ll take the dump on September 1, 2018 (some of the dumps are incomplete so make sure to choose one with the data you need). To find all the available files in the dump, we use the following code:
dump_url = base_url + '20180901/'
# Retrieve the htmldump_html = requests.get(dump_url).text
# Convert to a soupsoup_dump = BeautifulSoup(dump_html, 'html.parser')
# Find list elements with the class filesoup_dump.find_all('li', {'class': 'file'})[:3]
[<li class="file"><a href="/enwiki/20180901/enwiki-20180901-pages-articles-multistream.xml.bz2">enwiki-20180901-pages-articles-multistream.xml.bz2</a> 15.2 GB</li>, <li class="file"><a href="/enwiki/20180901/enwiki-20180901-pages-articles-multistream-index.txt.bz2">enwiki-20180901-pages-articles-multistream-index.txt.bz2</a> 195.6 MB</li>, <li class="file"><a href="/enwiki/20180901/enwiki-20180901-pages-meta-history1.xml-p10p2101.7z">enwiki-20180901-pages-meta-history1.xml-p10p2101.7z</a> 320.6 MB</li>]
Again, we parse the webpage using BeautifulSoup
to find the files. We could go to https://dumps.wikimedia.org/enwiki/20180901/ and look for the files to download manually, but that would be inefficient. Knowing how to parse HTML and interact with websites in a program is an extremely useful skill considering how much data is on the web. Learn a little web scraping and vast new data sources become accessible. (Here’s a tutorial to get you started).
Deciding what to Download
The above code finds all of the files in the dump. This includes several options for download: the current version of only the articles, the articles along with the current discussion, or the articles along with all past edits and discussion. If we go with the latter option, we are looking at several terabytes of data! For this project, we’ll stick to the most recent version of only the articles. This page is useful for determining which files to get given your needs.
The current version of all the articles is available as a single file. However, if we get the single file, then when we parse it, we’ll be stuck going through all the articles sequentially — one at a time — a very inefficient approach. A better option is to download partitioned files, each of which contains a subset of the articles. Then, as we’ll see, we can parse through multiple files at a time through parallelization, speeding up the process significantly.
When I’m dealing with files, I would rather have many small files than one large file because then I can parallelize operations on the files.
The partitioned files are available as bz2-compressed XML (eXtended Markup Language). Each partition is around 300–400 MB in size with a total compressed size of 15.4 GB. We won’t need to decompress the files, but if you choose to do so, the entire size is around 58 GB. This actually doesn’t seem too large for all of human knowledge! (Okay, not all knowledge, but still).
Downloading Files
To actually download the files, the Keras utility get_file
is extremely useful. This downloads a file at a link and saves it to disk.
from keras.utils import get_file
saved_file_path = get_file(file, url)
The files are saved in ~/.keras/datasets/
, the default save location for Keras. Downloading all of the files one at a time takes a little over 2 hours. (You can try to download in parallel, but I ran into rate limits when I tried to make multiple requests at the same time.)