GSoC2022_Progress_Celian_RINGWALD

NEW DBPEDIA ABSTRACT EXTRACTOR

Description

DBpedia provides monthly releases produced by the DBpedia Extraction Framework. They are composed of various data artifacts that mainly stem from the wiki dumps. However, some of them also rely on API calls for rendering dynamic contents, which is the case of the DBpedia abstracts. The large amount of data requested from APIs couldn't be extracted entirely within a month today. We suggest solving this issue by a strategy composed of four steps: - a study based on the data recorded during the last abstract extraction - the test and implement the use of the TextExtracts extension and the improvement the error management - the reduction the number of possible calls - the integration into the framework of the possibility to appeal to more than one API Each step of the project will be developed into a new dedicated GitHub branch of the DBpedia extractor framework, which could be documented and used for working on the project.

Link to the project seed : https://forum.dbpedia.org/t/developing-a-new-dbpedia-abstract-extraction-gsoc2022/1620
Link to the proposal : https://summerofcode.withgoogle.com/media/user/8129e10aed83/proposal/d8mMiYASojjvUVPv.pdf
Project Tracker : https://docs.google.com/spreadsheets/d/1kMGiDM71Qz4cZNdw86UfpqIDR6dIrieeeOV8b3PimQk/edit#gid=1703594761
Link to Pull request : https://github.com/dbpedia/extraction-framework/pull/740
Final Report : https://docs.google.com/document/d/10xvukZVeKNA1n_VT_q2pWtuPWznEl2Hz/edit?usp=sharing&ouid=104536663383791851600&rtpof=true&sd=true

Mentors

Mykola Medynskyi
Marvin Hofer
Dimitris Kontokostas

Quick presentation

I am Célian Ringwald, research engineer in charge of the French DBpedia chapter at Inria in the Wimmics team. My topic of research is mainly related to NLP and Semantic Web questions, having an access to abstract of Wikipedia (and more broadly to the textual content of it) trough DBpedia is a very important milestone in my perspective

My working fork and the associated branch

https://github.com/datalogism/extraction-framework

PROGRESS

Bonding period

intiate a working space : github fork + wiki preparation
first play with Wikimedia on docker > https://github.com/datalogism/mediawiki_docker
First meeting : 8^th June

Week 1 (June 19 - June 24)

Kick-off meeting : How to compare it / Metrics repports
- Focus on English and French chapter
- Find a solution for avoiding data traffic jam
- Using Marvin for testing
Done during the week :
- reading documentation on first Mykola/Marvin exploration of the problem 1 / 2
- test to extract with Marvin Framework

Test of Mykola/Marvin summary

Parralel benchmark : parallelizing seems useless due to rate limits
On all the dataset for el, sh, ro ,tr :

extractor	parallel-process	time
nif	1	9h
plain	1	3h40
nif	2	3h55
plain	2	3h
nif	4	4
plain	4	3h38

=> Seem to be faster but is it giving more data ??

Week 2 (June 27 - July 01)

Monday : GSOC Meeting 2
Creation of a script for creating test set of pages based on clickstream for a given lang : https://github.com/datalogism/extraction-framework/blob/gsoc-celian/dump/src/test/bash/create_custom_sample.sh
First TestSuite development for testing abstract extraction : https://github.com/datalogism/extraction-framework/blob/gsoc-celian/dump/src/test/scala/org/dbpedia/extraction/dump/ExtractionTestAbstract2.scala

Week 3 (July 03 - July 11)

Docker media wiki

https://github.com/datalogism/mediawiki_docker > Ok
But could be very long on large wiki > "English wikipedia" need 6months for being loaded ....

Abstract extraction test logs :

output line by lines
count nb of FailedIOException / FailedOutOfMemoryError / FailedNullPointerException

Change of sample strategy :

from the most views pages to the least views
Results are the same > extraction failed after the first 10% wikipages

Open a talk with Wikimedia community :

https://www.mediawiki.org/wiki/Topic:Wypyihe84x1a1okt :

=> Can we have a view of the current rate limit ? NO

=> Are the differents MediaWiki Api endpoint share the same limits ? YES

=> Is the https://en.wikipedia.org/api/rest_v1/ API a best solution ? YES

Solving the parsing error obtained by the big mini-dump extraction

=> javax.xml.stream.XMLStreamException: ParseError at [row,col]:[517851,1]" Message: expected <title>, found => It was caused by specials characters in URIs

First tests on requests parameters : using retry-after / max lag / User agent & gzip options

Week 4 (July 11 - July 17)

Done

Romanian mediawiki clone :
- management of problem due to original language parameter error > NEED TO BE "RO"
- Problem concerning Campaigns extension
I added the 2 following parameters :
- I needed to fix universal config max lag / User agent
test of new API

Week 5 (July 18 - July 22)

Test of the REST API

I created a script that call the RESTAPI on the main dataset that i created by following the guidelines and more specifically the rate limit
I recorded the results here
I aggregated them here : in a nutshell > we got almost everything but we receive some disambiguisation errors (only type of error recorded) > Pb related to redirect pages parameter ?

LOG OF API ANSWERS

I created a dedicated log appender here, this one is logging into the logdir path definded log file composed of json rows, one for each call

Comparison of max-lag and user-agent parameters

First lesson learn : we need to give a user-agent header param > not really a good new because depending of our calling process we can be banished.

ISSUES :

plain abstract problem : we have only a part of the process that are found in the stat for plain abstract
- Reason : we cannot run together both test on plain and HTML without causing it...

Mediawiki experiments

Romanian wiki :
- Id of page given in dump are not the same than the id in the original wiki
- By runing the 1000 sample i freeze my app
- I got 100% success rate on 100 page BUT... i deleted it by mistake (shame on me)
Some fact arround the wikiclone :
- The abstract is generally not victim of it but : Wikipedia is loading external data from Wikidata in Infoboxes here a romanian example and also in Wikibase models As Authority control one
- As mentioned in this page : this is not possible to get it from a mediawiki clone, for getting it we must also mirroring wikidata ....
Using the Mediawiki only as parser :
- Only load Modules (ns 828) and templates (ns 10) into the mediawiki and use it for parsing wikicode !
- For Romanian it took less than 2h for Fr it seems to be less than a day (not finished to load)
- Find here some API answers
- And here the request shape

Week 6 (July 25 - July 29)

Parsing of entire french dump with mediawiki clone done in 4 days
Experiment records : https://docs.google.com/spreadsheets/d/1JCMozvQ7oC_AkDuoCS1ZNlasaaUm8tWg/edit#gid=1770381193
MidWay repport : https://docs.google.com/document/d/101OvYuKvD4o9UPmLvuDkN5hgfupyfpVO/edit#

Week 7 (August 1 - August 5)

Mediawiki clone documentation ok : https://github.com/datalogism/mediawiki_docker
Implementation test of a "retry after" sensitive pipeline > see MediaWikiConnector4.scala
Comparison of the plain text answers of the old and the new APIs
Implementation and test of the gzip parameter

Week 8 (August 8 - August 12)

NEW API IMMPLEMENTATION

MediaWikiConnector3.scala
HTML answer parsing ok > readInAbstractHTML
but problems with parameter "redirect=true" and OutputStreamWriter

Comparison of the HTML code answers

Midway repport - HTML content part
Different structures of HTML
Links parsing problem solved
Test en sample 1000 > 988 abstracts only ok with parsing error : http://dbpedia.org/resource/Javier_Bardem http://dbpedia.org/ontology/abstract "Javier Ángel Encinas Bardem ("},"2":{"wt":"lang"}},"i":0}}]}' id="mwDQ">Spanish: ; born 1 March 1969) is a Spanish actor. Known for his roles in and foreign films, he has received , including an , a , and a . Bardem won the for his performance as the assassin in the ' modern western drama film (2007). He also received critical acclaim for his roles in films such as (1992), (1995), (1997), (2002), and (2004). He has also starred in 's romantic drama (2008), 's spy film (2012), 's drama (2013), 's film (2017), 's mystery drama (2018) and 's science fiction drama (2021). Bardem's other Oscar-nominated performances include 's (2000), 's (2010), and 's (2021). He is the first Spanish actor to be nominated for an Academy Award ( for Before Night Falls in 2001), as well as the first and only Spanish actor to win one ( for in 2008). He is also the recipient of a , two , and six . In January 2018, Bardem became ambassador of for the protection of ."@en .
Adaptation of the HTMLNifExtractor / WikipediaNifExtractor / LinkExtractor

Week 9 (August 22 - August 26)

Parsing error of mw-data included into REST API answer still not fixed
Commit all the changes to https://github.com/datalogism/extraction-framework/tree/gsoc-celian_clean branch

Week 10 (August 15 - August 19)

Benchmark the APIs with different configurations on a sample of 1000 english pages > cf LastTestMadeOnEN sheet
Merging different MWC old API implementations (retry-after, max-lag incrementation process) into MediaWikiConnector2.scala

Week 11 (September 4 - September 9)

Fix the parsing error of the REST API answer by deleting mw-data attributes into getJsoupDoc function of HtmlNifExtractor.scala
Parsing of the new HTML structure is now ok > we are able to run the entire NIF extraction if needed WikipediaNifExtractor2
Clean code of WikipediaNifExtractor by creating an abstract class WikipediaNifExtractor extended for the REST API case into WikipediaNifExtractor2
Clean code of the MediaWikiConnectors by creating an abstract class MediaWikiConnectorAbstract, extended for MWC API and for the REST MWC API
Pushing all API parameters into the config files : extraction.nif.abstracts.properties and extraction.plain.abstracts.properties

Week 12 (September 12 - September 15)

Correction of parsing errors due to data-mw parsing with Jsoup : HTML entites were parsed before as result the simple quote inside the data-mw attributes. I firstly fixed it with a unstable regexpr and i finally found a way to fix it simply by deleting a html placed in the wrong place...
Correction of an other parsing errors in Plain abstract extraction parsing process : in some case ?> header is misteriously return without the first caracters...
Test of every functions on RO, FR and EN language
Cleaning the code and final pull request
End of report writing

What i have done

A benchmark of the different possible API configuration
I created a MediaWiki Clone for creating a local API
I adapted the "old API" to the guidelines allowing us to avoid the rate limits problem we got at the beginning of the project
I also added to the "old API" a better retry-after and an incremental maxlag mechanisms
I implemented a way to use the mediawiki rest API into the DIEF and i solved the different problems related to the new structure of it answer

Possible extension of the current work

Concerning the "Old Wikimedia" API, I didn't implemented a solution integrating the generator (cf rate limit paragraph of the guidelines), because of the DIEF process. Indeed each Wikipedia articles is proceed one by one for enabling parallelization.
I didn't worked at all on a smart update strategy as exposed into the proposal. It is still possible to think a solution taking account of the last release
For the moment the chose of a given API is given by the configuration, we could imagine to dynamically call one API depending on the answers time...
The Rest API could also imitate a maxlag mechanism if we control and play on the request calls

Provide feedback

Saved searches

Use saved searches to filter your results more quickly