Skip to content

GSoC2022_Progress_Celian_RINGWALD

c_ringwald edited this page Sep 16, 2022 · 29 revisions

NEW DBPEDIA ABSTRACT EXTRACTOR

Description

DBpedia provides monthly releases produced by the DBpedia Extraction Framework. They are composed of various data artifacts that mainly stem from the wiki dumps. However, some of them also rely on API calls for rendering dynamic contents, which is the case of the DBpedia abstracts. The large amount of data requested from APIs couldn't be extracted entirely within a month today. We suggest solving this issue by a strategy composed of four steps: - a study based on the data recorded during the last abstract extraction - the test and implement the use of the TextExtracts extension and the improvement the error management - the reduction the number of possible calls - the integration into the framework of the possibility to appeal to more than one API Each step of the project will be developed into a new dedicated GitHub branch of the DBpedia extractor framework, which could be documented and used for working on the project.

Mentors

  • Mykola Medynskyi
  • Marvin Hofer
  • Dimitris Kontokostas

Quick presentation

I am Célian Ringwald, research engineer in charge of the French DBpedia chapter at Inria in the Wimmics team. My topic of research is mainly related to NLP and Semantic Web questions, having an access to abstract of Wikipedia (and more broadly to the textual content of it) trough DBpedia is a very important milestone in my perspective

My working fork and the associated branch

PROGRESS

Bonding period

Week 1 (June 19 - June 24)

  • Kick-off meeting : How to compare it / Metrics repports

    • Focus on English and French chapter
    • Find a solution for avoiding data traffic jam
    • Using Marvin for testing
  • Done during the week :

    • reading documentation on first Mykola/Marvin exploration of the problem 1 / 2
    • test to extract with Marvin Framework

Test of Mykola/Marvin summary

  • Parralel benchmark : parallelizing seems useless due to rate limits

  • On all the dataset for el, sh, ro ,tr :

extractor parallel-process time
nif 1 9h
plain 1 3h40
nif 2 3h55
plain 2 3h
nif 4 4
plain 4 3h38

=> Seem to be faster but is it giving more data ??

Week 2 (June 27 - July 01)

Week 3 (July 03 - July 11)

Docker media wiki

Abstract extraction test logs :

  • output line by lines
  • count nb of FailedIOException / FailedOutOfMemoryError / FailedNullPointerException

Change of sample strategy :

  • from the most views pages to the least views
  • Results are the same > extraction failed after the first 10% wikipages

Open a talk with Wikimedia community :

=> Can we have a view of the current rate limit ? NO

=> Are the differents MediaWiki Api endpoint share the same limits ? YES

=> Is the https://en.wikipedia.org/api/rest_v1/ API a best solution ? YES

Solving the parsing error obtained by the big mini-dump extraction

=> javax.xml.stream.XMLStreamException: ParseError at [row,col]:[517851,1]" Message: expected <title>, found => It was caused by specials characters in URIs

  • First tests on requests parameters : using retry-after / max lag / User agent & gzip options

Week 4 (July 11 - July 17)

Done

  • Romanian mediawiki clone :

    • management of problem due to original language parameter error > NEED TO BE "RO"
    • Problem concerning Campaigns extension
  • I added the 2 following parameters :

    • I needed to fix universal config max lag / User agent
  • test of new API

Week 5 (July 18 - July 22)

Test of the REST API

  • I created a script that call the RESTAPI on the main dataset that i created by following the guidelines and more specifically the rate limit
  • I recorded the results here
  • I aggregated them here : in a nutshell > we got almost everything but we receive some disambiguisation errors (only type of error recorded) > Pb related to redirect pages parameter ?

LOG OF API ANSWERS

  • I created a dedicated log appender here, this one is logging into the logdir path definded log file composed of json rows, one for each call

Comparison of max-lag and user-agent parameters

  • First lesson learn : we need to give a user-agent header param > not really a good new because depending of our calling process we can be banished.

ISSUES :

  • plain abstract problem : we have only a part of the process that are found in the stat for plain abstract
    • Reason : we cannot run together both test on plain and HTML without causing it...

Mediawiki experiments

  • Romanian wiki :

    • Id of page given in dump are not the same than the id in the original wiki
    • By runing the 1000 sample i freeze my app
    • I got 100% success rate on 100 page BUT... i deleted it by mistake (shame on me)
  • Some fact arround the wikiclone :

    • The abstract is generally not victim of it but : Wikipedia is loading external data from Wikidata in Infoboxes here a romanian example and also in Wikibase models As Authority control one
    • As mentioned in this page : this is not possible to get it from a mediawiki clone, for getting it we must also mirroring wikidata ....
  • Using the Mediawiki only as parser :

    • Only load Modules (ns 828) and templates (ns 10) into the mediawiki and use it for parsing wikicode !
    • For Romanian it took less than 2h for Fr it seems to be less than a day (not finished to load)
    • Find here some API answers
    • And here the request shape

Week 6 (July 25 - July 29)

Week 7 (August 1 - August 5)

Week 8 (August 8 - August 12)

NEW API IMMPLEMENTATION

  • MediaWikiConnector3.scala
  • HTML answer parsing ok > readInAbstractHTML
  • but problems with parameter "redirect=true" and OutputStreamWriter

Comparison of the HTML code answers

  • Midway repport - HTML content part
  • Different structures of HTML
  • Links parsing problem solved
  • Test en sample 1000 > 988 abstracts only ok with parsing error : http://dbpedia.org/resource/Javier_Bardem http://dbpedia.org/ontology/abstract "Javier Ángel Encinas Bardem ("},"2":{"wt":"lang"}},"i":0}}]}' id="mwDQ">Spanish: ; born 1 March 1969) is a Spanish actor. Known for his roles in and foreign films, he has received , including an , a , and a . Bardem won the for his performance as the assassin in the ' modern western drama film (2007). He also received critical acclaim for his roles in films such as (1992), (1995), (1997), (2002), and (2004). He has also starred in 's romantic drama (2008), 's spy film (2012), 's drama (2013), 's film (2017), 's mystery drama (2018) and 's science fiction drama (2021). Bardem's other Oscar-nominated performances include 's (2000), 's (2010), and 's (2021). He is the first Spanish actor to be nominated for an Academy Award ( for Before Night Falls in 2001), as well as the first and only Spanish actor to win one ( for in 2008). He is also the recipient of a , two , and six . In January 2018, Bardem became ambassador of for the protection of ."@en .
  • Adaptation of the HTMLNifExtractor / WikipediaNifExtractor / LinkExtractor

Week 9 (August 22 - August 26)

Week 10 (August 15 - August 19)

Week 11 (September 4 - September 9)

Week 12 (September 12 - September 15)

  • Correction of parsing errors due to data-mw parsing with Jsoup : HTML entites were parsed before as result the simple quote inside the data-mw attributes. I firstly fixed it with a unstable regexpr and i finally found a way to fix it simply by deleting a html placed in the wrong place...
  • Correction of an other parsing errors in Plain abstract extraction parsing process : in some case ?> header is misteriously return without the first caracters...
  • Test of every functions on RO, FR and EN language
  • Cleaning the code and final pull request
  • End of report writing

What i have done

  • A benchmark of the different possible API configuration
  • I created a MediaWiki Clone for creating a local API
  • I adapted the "old API" to the guidelines allowing us to avoid the rate limits problem we got at the beginning of the project
  • I also added to the "old API" a better retry-after and an incremental maxlag mechanisms
  • I implemented a way to use the mediawiki rest API into the DIEF and i solved the different problems related to the new structure of it answer

Possible extension of the current work

  • Concerning the "Old Wikimedia" API, I didn't implemented a solution integrating the generator (cf rate limit paragraph of the guidelines), because of the DIEF process. Indeed each Wikipedia articles is proceed one by one for enabling parallelization.
  • I didn't worked at all on a smart update strategy as exposed into the proposal. It is still possible to think a solution taking account of the last release
  • For the moment the chose of a given API is given by the configuration, we could imagine to dynamically call one API depending on the answers time...
  • The Rest API could also imitate a maxlag mechanism if we control and play on the request calls
Clone this wiki locally