[Wiktionary] Importing wiki XML dumps into MySQL

By pim on Monday 25 March 2013 08:36 - Comments are closed
Category: overig, Views: 5.758

Update: This tutorial is outdated.
The only way to get the right SQL tables, is by installing a local version:
https://www.mediawiki.org/wiki/Manual:Installation_guide

Try this link: https://www.xarg.org/2016...ire-wikipedia-into-mysql/
Or this one: http://www.joe0.com/2013/...articles-multistream-xml/

-----------------------------
I was trying to import an XML dump of Wiktionary into MySQL. It all worked well until I tried some non latin languages like Arabic, Chinese etc. The character encoding was messed up.

Just as a reminder for myself, and maybe useful to other people who are struggling with importing wikipedia xml dumps as utf-8 into mysql. These are the steps I took with a WAMP(Windows, Apache, PHP, MySQL) configuration.

For example for importing the Dutch Wikipedia:
1. Create a database, in my case I created `wiktionarynl` (nl = language code for Dutch)
2. Create these tables "SQL to create the initial tables for the MediaWiki database": https://phabricator.wikim...er/maintenance/tables.sql

Run the following queries to convert/make sure they're utf8_bin:
TRUNCATE TABLE `page`;
TRUNCATE TABLE `text`;
ALTER TABLE `page` DEFAULT CHARACTER SET utf8 COLLATE utf8_bin;
ALTER TABLE `page` CHANGE `page_title` `page_title` VARCHAR( 255 ) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL;
ALTER TABLE `text` DEFAULT CHARACTER SET utf8 COLLATE utf8_bin;

3. Change the Unique page_title index from the `page` table to a regular index. (because otherwise you can get 'duplicate entry' errors when importing the data)

4. Download the Dutch XML dump. Go to http://dumps.wikimedia.org/backup-index.html and search for the link 'nlwiktionary'. Click it, and on that page search for:
pages-articles.xml.bz2. Download it and extract the xml file.

5. Download & Open de wikixml importer mwdumper.jar

6. Run mwdumper.jar, select the Dutch .xml, set database to wiktionarynl and start your import.

Now you should have a database where the tables page, text and revision contain data.

7. The titles of pages are in the table page. where the article for the page is in the table 'text'. The table 'revision' keeps track of what's the latest article in the 'text' table.
To get the text that belongs to a record in the page table, do the following:

PHP:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
//Get a page title, then the latest text for that page.
$q = "SELECT * FROM page WHERE page_title = 'Amsterdam'";
$results_page = mysql_query($q);
$row_page = mysql_fetch_array($results_page);

// Then, page_latest is used to search the revision table for rev_id, and rev_text_id is obtained in the process.
$q = "SELECT * FROM  `revision` WHERE  `rev_id` = ".$row_page['page_latest']." limit 1";
$results_revision = mysql_query($q);
$row_revision = mysql_fetch_array($results_revision);

// The value obtained for rev_text_id is used to search for old_id in the text table to retrieve the text.
$q = "SELECT * FROM  `text` WHERE  `old_id` = ".$row_revision['rev_text_id']." limit 1";
$results_text = mysql_query($q);
$row_text = mysql_fetch_array($results_text);

echo '<li>'.$row_page['page_title'];
echo '<li>'.$row_text['old_text'];


With this script you could get thousands to millions of page titles + the corresponding raw wikimarkup data.
The next step would probably be to convert the raw wiki markup to plain text or html.
I tried several converters, but all of them seem to be buggy.. Even if they seem to work good for the English and Dutch wiki, they won't work well for other languages like Vietnamese etc.
However wikipedia itself is showing the data without bugs in 100 different languages, so somewhere in their code the conversion is done properly.

I parsed the data for 20 different languages, and ended up with a lot of corrupted articles that still contained wiki markup.
For most articles I only needed the first paragraph, but many articles start with the 'infobox' that you often find on the right.

A solution to get the correct data I used wikipedia's API:
For example the 'nl' page for 'Amsterdam':
http://nl.wikipedia.org/w...=json&prop=text&section=0

Get the html text out of te Json data:
$json = json_decode($jsondata);
$content = $json->{'parse'}->{'text'}->{'*'};

And then the regular expression <p>(.*?)</p> is enough to extract the first paragraph without any bugs. They don't like it though if you scrape their whole website, that's what the wiki dumps are for.