-
Notifications
You must be signed in to change notification settings - Fork 272
DBpedia Abstract Extraction step by step guide
- MySQL
- PHP with xml and apc,
- Scala
- Maven
- MediaWiki
- Web Server (nginx seems to perform way better than Apache))
Please download or pull the extraction framework using git:
git clone git://github.com/dbpedia/extraction-framework.git
If you want to download the DBpedia dump files then please do as follows:
cd dump
../clean-install-run download config=download.minimal.properties
There are already some configuration files in the extraction framework (e./g. download.minimal.properties). Customize file according to your need and fire the above command to download the dumps you need. In download configuration file (i.e. download.minimal.properties) there is a property named base-dir which specify the directory where the dump files will be stored. The DBpedia extraction framework uses the following structure when storing dump files:
/path_to_download_folder/yyyymmdd/[language_code]wiki-yyyymmdd.-pages-articles.xml.bz2
NOTE: If you have already downloaded the above pages-articles dump manually (without using the DBpedia extraction framework), then please skip this step. Anyway, please make sure that the above naming convention for directory structure have been followed. If not, then create this directory structure manually.
You need to install MySQL, PHP, Apache and other software.
- step 1 : install mysql
- step 2 : open my.cnf file (in mysql root directory if installed by hand or in
/etc/mysql/
if installed with ubuntu packages) - step 3 : add these parameters in the
[mysqld]
section to have the utf8 encoding by default :
character-set-server=utf8
and skip-character-set-client-handshake
- step 4 : change
max_allowed_packet=16M
tomax_allowed_packet=1G
- step 5 : change
key_buffer=16M
tokey_buffer=1G
- step 6 : change
query_cache_size=16M
toquery_cache_size=1G
These next step are made for those who installed mysql by hand. Otherwise if you installed MySQL with the repositories of your Linux distribution you can pass them.
- step 7 : set
socket
parameter to$MYDIR/mysqld.sock
- step 8 : set
datadir
parameter to$MYDIR/data
- step 9 : open your
~/.bashrc
file to add :export MYDIR=/path/where/you/installed/mysql
Now you need to install PHP and Apache. The installation of these tools is out of scope for this guide. Please refer to the tools documentation. (note! php5-mysql is missing from metapackage lamp-server^ in ubuntu )
You are also requested to install php-xml
and php-apc
to avoid some error and performance issues which will be described later in this document.
NOTE: For some Linux/Unix distros php-apc
might be named php-pecl-apc
.
There exists also a script which may be used for this setup. It has not been tested though, but should be working fine:
Finally download MediaWiki from http://www.mediawiki.org/wiki/Download . It is recommended to use the latest stable release. (Note! I have tried most of the 2.x versions, I would recommend using the MediaWiki 1.19.11 legacy lts release since it seems to work best, early 2.x releases seem to work too, but they might require some changes)
You can also use download latest release from git: (Note! the current Git version does not work, do not use it)
git clone https://gerrit.wikimedia.org/r/p/mediawiki/core.git
In order to generate clean abstracts from Wikipedia articles one needs to render wiki templates as they would be rendered in the original Wikipedia instance. So in order for the DBpedia Abstract Extractor to work, a running MediaWiki instance with Wikipedia data in a MySQL database is necessary.
To import the data, you need to run the Scala import
launcher.
Before importing, you have to adapt the settings for the import
launcher in dump/pom.xml
as below:
(Note: dump/pom.xml may be found in extraction-framework/pom.xml)
<launcher>
<id>import</id>
<mainClass>org.dbpedia.extraction.dump.sql.Import</mainClass>
<jvmArgs>
<jvmArg>-server</jvmArg>
</jvmArgs>
<args>
<arg>path_to_download_folder</arg>
<arg>/path_to_wikimedia_parent_dir/mediawiki/maintenance/tables.sql</arg>
<arg>jdbc:mysql://machine_name:mysql_port/?characterEncoding=UTF-8&user=myuser&password=mypass</arg>
<arg>false</arg><!-- require-download-complete -->
<arg>language-code</arg><!-- languages and article count ranges, comma-separated -->
</args>
</launcher>
If you have downloaded the DBpedia dump file manually then set require-download-complete
to false as no file with the name exists to indicate successful download.
Now to import data into MySQL fire:
../clean-install-run import
NOTE:
If while importing you get error ERROR 1283: Column 'si_title' cannot be part of FULLTEXT index than collate should be specified for table 'searchindex'
then please change line for table searchindex from ENGINE=MyISAM
to ENGINE=MyISAM COLLATE='utf8_general_ci';
. This change should be done on the file : /path_to_wikimedia_parent_dir/mediawiki/maintenance/tables.sql
Download Mediawiki & extensions
git clone https://gerrit.wikimedia.org/r/p/mediawiki/core.git mediawiki
cd mediawiki/extensions
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/timeline.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/CharInsert.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/MobileFrontend.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/CategoryTree.git
git clone https://gerrit.wikimedia.org/r/mediawiki/extensions/Cite.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Interwiki.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/SyntaxHighlight_GeSHi.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/php/luasandbox.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/InputBox.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/GeoData.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/ExpandTemplates.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Babel.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Scribunto.git
git clone https://gerrit.wikimedia.org/r/mediawiki/extensions/ParserFunctions.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Poem.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/TextExtracts.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/ImageMap.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Math.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/wikihiero.git
git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Mantle.git
Set up MediaWiki
You need to adjust your LocalSettings.php
according to https://github.com/dbpedia/extraction-framework/blob/master/dump/src/main/mediawiki/LocalSettings.php.
To make Lua faster, please read the Scribunto instructions: http://www.mediawiki.org/wiki/Extension:Scribunto
Configure your MediaWiki directory as web-directory by adding configuration information into Apache httpd.conf
as below:
Alias /mediawiki /path_to_mediawiki_parent_dir/mediawiki
<Directory /mediawiki>
Allow from all
</Directory>
Now visit the following URL with your browser
http://localhost/mediawiki/api.php?uselang=en
If you get some usage instructions in you browser then the MediaWiki configuration is correct and you can move to the next step.
If you are not getting usage information, then it is necessary to resolve each error and to verify with the aforementioned URL, until you get a valid web page.
You are also invited to check Apache error log to get further details on how to troubleshoot the errors which might appear.
Below a list of possible errors together with some solutions:
- Class 'DOMDocument' not found in LocalisationCache.php
To solve this you need to install the php-xml
module as specified in Install required software.
- Set $wgShowExceptionDetails = true; in LocalSetting.php
You need to change the LocalSetting.php as suggested. It is used to throw full debugging information.
- CACHE_ACCEL requested but no suitable object cache is present. You may want to install APC.
Example backtrace:
Backtrace:
#0 [internal function]: ObjectCache::newAccelerator(Array)
#1 /mnt/ebs/framework/media_wiki/wikimedia/includes/objectcache/ObjectCache.php(85): call_user_func('ObjectCache::ne...', Array)
#2 /mnt/ebs/framework/media_wiki/wikimedia/includes/objectcache/ObjectCache.php(72): ObjectCache::newFromParams(Array)
#3 /mnt/ebs/framework/media_wiki/wikimedia/includes/objectcache/ObjectCache.php(44): ObjectCache::newFromId(3)
#4 /mnt/ebs/framework/media_wiki/wikimedia/includes/GlobalFunctions.php(3780): ObjectCache::getInstance(3)
#5 /mnt/ebs/framework/media_wiki/wikimedia/includes/Setup.php(464): wfGetMainCache()
#6 /mnt/ebs/framework/media_wiki/wikimedia/includes/WebStart.php(157): require_once('/mnt/ebs/framew...')
#7 /mnt/ebs/framework/media_wiki/wikimedia/api.php(47): require('/mnt/ebs/framew...')
#8 {main}
This means you have not installed php-apc
. This is an e-accelerator used to speed-up the process around 4-5 times.
If you really do not want to usephp-apc
then please set $wgMainCacheType=CACHE_ANYTHING
(not recommended).
Execute the following command after making the appropriate changes to the extraction.abstracts.properties
configuration file:
../clean-install-run extraction extraction.abstracts.properties
Install nginx server
# For debian based systems run
sudo apt-get install nginx nginx-extras lua-nginx-memcached php5-fpm
If you use the LuaSandbox
option for the Scrbibunto mw extension (recommended) keep in mind that the nginx/php-fpm php.ini file is located in /etc/php5/fpm/php.ini
Add the following configuration in `/etc/nginx/sites-enabled'. Change the port to e.g. 81 if you have another server running on 80 and place the mediawiki installation in a subfolder.
server {
listen 81 default_server;
listen [::]:81 default_server ipv6only=on;
root /var/www/abstracts;
index index.html index.htm index.php;
# Make site accessible from http://localhost/
server_name localhost;
client_max_body_size 5m;
client_body_timeout 60;
location / {
try_files $uri $uri/ @rewrite;
}
location @rewrite {
rewrite ^/(.*)$ /index.php?title=$1&$args;
}
location ~ \.php$ {
include fastcgi_params;
fastcgi_index index.php;
try_files $uri =404;
fastcgi_split_path_info ^(.+\.php)(/.+)$;
fastcgi_pass unix:/var/run/php5-fpm.sock;
fastcgi_buffers 32 16k;
}
}
Here are a few more notes I took when I ran the abstract extraction in summer 2012.
If possible, use the MySQL, PHP and MediaWiki versions shown at Special:Version. This is probably most important for MediaWiki, not so much for MySQL and PHP.
The default MySQL installation of Ubuntu didn't work for me. MySQL bug 34981 caused problems. I don't remember exactly what else went wrong. In the end, I just downloaded and unzipped the appropriate MySQL version and removed the Ubuntu version because it somehow interfered with my installation.
I also wrote a little script that uses this MySQL installation to create the necessary data directories and start/stop the server. This script is not well documented and not really finished. :-(
Here's the rest of my notes from summer 2012. Version numbers and a few other things may have changed by now.
Clone production version of MediaWiki in your projects folder:
- mkdir mediawiki
- cd mediawiki/
- git clone https://gerrit.wikimedia.org/r/p/mediawiki/core.git
- cd core/
- git branch -r # list branches / tags
- git checkout origin/wmf/1.20wmf4 # current tag as of 2012-06-13
- git submodule update --init # gets all the extensions
Install MySQL, create tables:
- DO NOT install the Ubuntu MySQL package. If it is installed, remove it. (Note: The bug has been fixed in the newer versions of Ubuntu and Debian, you don't need to follow this steps anymore and can just use the version from the packages )
- Install MySQL 5.1.63 or a similar version: Download and unzip the tarball.
- ./mysql.sh install
- ./mysql.sh run /home/release/mysql test <../mediawiki/core/maintenance/tables.sql
Install Ubuntu packages php, php-cli, php-apc, php-intl Add symlink in /var/www: ln -s …./mediawiki/core mediawiki Install Ubuntu package php-mysql. This also installs mysql-common, but that doesn’t seem to interfere with our local MySQL (see above).
Install Ubuntu package ploticus (for Timeline extension)
Add APC setting to php.ini: apc.stat = 0
To make extraction faster, we should try HipHop, the PHP compiler / virtual machine created by Facebook. I tried to follow the instructions at https://github.com/facebook/hiphop-php/wiki/Building-and-installing-on-ubuntu-10.10 but didn't succeed.