New extractor requirement #700

mubashar1199 · 2021-06-02T03:51:23Z

Hello,

I want to create a new extractor but i am unable to understand the following:

1: I want to create new output dataset file, just creating a new dataset in Dataset.scala is not working for me.

2: I want to iterate all the rdf triples in mappingsbased-objects-uncleaned,ttl.bz2 file, perform some processing and then generate new rdf triples in a newly created dataset file. It is also required to run this at last when all other extraction has been done.
In the gender extractor following comment is written:
// Even better: in the first extraction pass, extract all types. Use them in the second pass.
How this multipass functionality can be implemented?

Please tell me how can i perform above operations

Thanks

jimkont · 2021-06-02T06:36:26Z

Hi @mubashar1199 , typically, an extractor is a scala class that is run on every wikipedia page and tries to extract specific information from the page. For example the existing LabelExtractor extracts the page name, the HomepageExtractor tries to detect the homepage of the person/organization that is a wikipedia page is about.

Each extractor writes the extracted triples in specific datasets. It is usually a 1-1 mapping e.g. LabelExtractor -> label dataset but some extractors that get a lot of information may split the data in multiple datasets

Are you trying to write a new extractor or post-process the existing datasets to form a new dataset

JJ-Author · 2021-06-02T08:06:55Z

this seems like a post-processing step. check this out http://dev.dbpedia.org/Post-Processing

mubashar1199 · 2021-06-02T09:17:53Z

Hi @mubashar1199 , typically, an extractor is a scala class that is run on every wikipedia page and tries to extract specific information from the page. For example the existing LabelExtractor extracts the page name, the HomepageExtractor tries to detect the homepage of the person/organization that is a wikipedia page is about.

Each extractor writes the extracted triples in specific datasets. It is usually a 1-1 mapping e.g. LabelExtractor -> label dataset but some extractors that get a lot of information may split the data in multiple datasets

Are you trying to write a new extractor or post-process the existing datasets to form a new dataset

Yes i want to post-process the dataset to generate new triples and append these triples to existing dataset or create the new dataset for newly created triples. How that can be done using extraction framework?

mubashar1199 · 2021-06-02T09:18:13Z

this seems like a post-processing step. check this out http://dev.dbpedia.org/Post-Processing

Ok i will take a look

JJ-Author · 2021-06-03T09:02:54Z

Hi @mubashar1199 , typically, an extractor is a scala class that is run on every wikipedia page and tries to extract specific information from the page. For example the existing LabelExtractor extracts the page name, the HomepageExtractor tries to detect the homepage of the person/organization that is a wikipedia page is about.
Each extractor writes the extracted triples in specific datasets. It is usually a 1-1 mapping e.g. LabelExtractor -> label dataset but some extractors that get a lot of information may split the data in multiple datasets
Are you trying to write a new extractor or post-process the existing datasets to form a new dataset

Yes i want to post-process the dataset to generate new triples and append these triples to existing dataset or create the new dataset for newly created triples. How that can be done using extraction framework?

the approach definitively is to create a new "dataset" here. However this postprocessing, does not necessarily have to be fully integrated into the extraction framework it can be also derived from the marvin extraction on the Databus https://databus.dbpedia.org/marvin/mappings/mappingbased-objects-uncleaned/ Tell us please what triples you would like to generate and what tools you are going to use (any other external data dependencies) then @Vehnem can help you how and where to integrate.

mubashar1199 · 2021-06-04T04:25:32Z

Hi @mubashar1199 , typically, an extractor is a scala class that is run on every wikipedia page and tries to extract specific information from the page. For example the existing LabelExtractor extracts the page name, the HomepageExtractor tries to detect the homepage of the person/organization that is a wikipedia page is about.
Each extractor writes the extracted triples in specific datasets. It is usually a 1-1 mapping e.g. LabelExtractor -> label dataset but some extractors that get a lot of information may split the data in multiple datasets
Are you trying to write a new extractor or post-process the existing datasets to form a new dataset

Yes i want to post-process the dataset to generate new triples and append these triples to existing dataset or create the new dataset for newly created triples. How that can be done using extraction framework?

the approach definitively is to create a new "dataset" here. However this postprocessing, does not necessarily have to be fully integrated into the extraction framework it can be also derived from the marvin extraction on the Databus https://databus.dbpedia.org/marvin/mappings/mappingbased-objects-uncleaned/ Tell us please what triples you would like to generate and what tools you are going to use (any other external data dependencies) then @Vehnem can help you how and where to integrate.

I want to use wikipedia info box properties and based on some predefined rules, infer new information from those properties and append the already existed dataset. I want the results to appear in sparql public endpoint. Please tell me how and where to integrate it.
Thanks

kurzum · 2021-09-27T10:28:06Z

this seems like a post-processing step. check this out http://dev.dbpedia.org/Post-Processing

@JJ-Author post-processing is pretty much the worst place to add anything. We discussed this a lot and the plan is to implement post-processing via the databus and thus remove it completely.

@mubashar1199 these are the insertion points for new data into DBpedia:

More info from Wikipedia

If you think there is non-covered info in Wikipedia, that is not yet covered by the extraction:

fix or slightly extend an existing extractor, slightly because major extensions might be better suited for a new extractor
write a new extractor, in this case you need add a new dataset and write scala code as @jimkont explained

Adding extensions based on the extracted data

Very similar to post-processing, i.e. you work on one of the extracted datasets such as mappingbased-extraction
In this case it is simple: you use the Databus to read, process it and write a new artifact on the databus. We can then include it in the snapshot collection. Ideally, you wrap it into Docker (https://hub.docker.com/u/dbpedia) and then we could run it every three months. An example is LHD, which takes the abstracts and produces https://databus.dbpedia.org/propan/lhd or sd types.
A note here: What kind of rules are you talking about? Mappingsbased extraction is already a rule based approach from dbr: to dbo: . So the rules might be covered in mappings.dbpedia.org already.

Vehnem added the type: data label Sep 24, 2021

jlareck added the question label Sep 26, 2021

Vehnem added priority issues to be discussed by the dev-team and removed question labels Sep 27, 2021

kurzum self-assigned this Sep 27, 2021

Vehnem removed the priority issues to be discussed by the dev-team label Sep 27, 2021

jlareck added the status: triage-discussion-needed label Nov 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New extractor requirement #700

New extractor requirement #700

mubashar1199 commented Jun 2, 2021

jimkont commented Jun 2, 2021

JJ-Author commented Jun 2, 2021

mubashar1199 commented Jun 2, 2021

mubashar1199 commented Jun 2, 2021

JJ-Author commented Jun 3, 2021

mubashar1199 commented Jun 4, 2021

kurzum commented Sep 27, 2021

New extractor requirement #700

New extractor requirement #700

Comments

mubashar1199 commented Jun 2, 2021

jimkont commented Jun 2, 2021

JJ-Author commented Jun 2, 2021

mubashar1199 commented Jun 2, 2021

mubashar1199 commented Jun 2, 2021

JJ-Author commented Jun 3, 2021

mubashar1199 commented Jun 4, 2021

kurzum commented Sep 27, 2021

More info from Wikipedia

Adding extensions based on the extracted data