Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New extractor requirement #700

Open
mubashar1199 opened this issue Jun 2, 2021 · 7 comments
Open

New extractor requirement #700

mubashar1199 opened this issue Jun 2, 2021 · 7 comments

Comments

@mubashar1199
Copy link
Contributor

Hello,

I want to create a new extractor but i am unable to understand the following:

1: I want to create new output dataset file, just creating a new dataset in Dataset.scala is not working for me.

2: I want to iterate all the rdf triples in mappingsbased-objects-uncleaned,ttl.bz2 file, perform some processing and then generate new rdf triples in a newly created dataset file. It is also required to run this at last when all other extraction has been done.
In the gender extractor following comment is written:
// Even better: in the first extraction pass, extract all types. Use them in the second pass.
How this multipass functionality can be implemented?

Please tell me how can i perform above operations

Thanks

@jimkont
Copy link
Member

jimkont commented Jun 2, 2021

Hi @mubashar1199 , typically, an extractor is a scala class that is run on every wikipedia page and tries to extract specific information from the page. For example the existing LabelExtractor extracts the page name, the HomepageExtractor tries to detect the homepage of the person/organization that is a wikipedia page is about.

Each extractor writes the extracted triples in specific datasets. It is usually a 1-1 mapping e.g. LabelExtractor -> label dataset but some extractors that get a lot of information may split the data in multiple datasets

Are you trying to write a new extractor or post-process the existing datasets to form a new dataset

@JJ-Author
Copy link
Contributor

this seems like a post-processing step. check this out http://dev.dbpedia.org/Post-Processing

@mubashar1199
Copy link
Contributor Author

Hi @mubashar1199 , typically, an extractor is a scala class that is run on every wikipedia page and tries to extract specific information from the page. For example the existing LabelExtractor extracts the page name, the HomepageExtractor tries to detect the homepage of the person/organization that is a wikipedia page is about.

Each extractor writes the extracted triples in specific datasets. It is usually a 1-1 mapping e.g. LabelExtractor -> label dataset but some extractors that get a lot of information may split the data in multiple datasets

Are you trying to write a new extractor or post-process the existing datasets to form a new dataset

Yes i want to post-process the dataset to generate new triples and append these triples to existing dataset or create the new dataset for newly created triples. How that can be done using extraction framework?

@mubashar1199
Copy link
Contributor Author

this seems like a post-processing step. check this out http://dev.dbpedia.org/Post-Processing

Ok i will take a look

@JJ-Author
Copy link
Contributor

Hi @mubashar1199 , typically, an extractor is a scala class that is run on every wikipedia page and tries to extract specific information from the page. For example the existing LabelExtractor extracts the page name, the HomepageExtractor tries to detect the homepage of the person/organization that is a wikipedia page is about.
Each extractor writes the extracted triples in specific datasets. It is usually a 1-1 mapping e.g. LabelExtractor -> label dataset but some extractors that get a lot of information may split the data in multiple datasets
Are you trying to write a new extractor or post-process the existing datasets to form a new dataset

Yes i want to post-process the dataset to generate new triples and append these triples to existing dataset or create the new dataset for newly created triples. How that can be done using extraction framework?

the approach definitively is to create a new "dataset" here. However this postprocessing, does not necessarily have to be fully integrated into the extraction framework it can be also derived from the marvin extraction on the Databus https://databus.dbpedia.org/marvin/mappings/mappingbased-objects-uncleaned/ Tell us please what triples you would like to generate and what tools you are going to use (any other external data dependencies) then @Vehnem can help you how and where to integrate.

@mubashar1199
Copy link
Contributor Author

Hi @mubashar1199 , typically, an extractor is a scala class that is run on every wikipedia page and tries to extract specific information from the page. For example the existing LabelExtractor extracts the page name, the HomepageExtractor tries to detect the homepage of the person/organization that is a wikipedia page is about.
Each extractor writes the extracted triples in specific datasets. It is usually a 1-1 mapping e.g. LabelExtractor -> label dataset but some extractors that get a lot of information may split the data in multiple datasets
Are you trying to write a new extractor or post-process the existing datasets to form a new dataset

Yes i want to post-process the dataset to generate new triples and append these triples to existing dataset or create the new dataset for newly created triples. How that can be done using extraction framework?

the approach definitively is to create a new "dataset" here. However this postprocessing, does not necessarily have to be fully integrated into the extraction framework it can be also derived from the marvin extraction on the Databus https://databus.dbpedia.org/marvin/mappings/mappingbased-objects-uncleaned/ Tell us please what triples you would like to generate and what tools you are going to use (any other external data dependencies) then @Vehnem can help you how and where to integrate.

I want to use wikipedia info box properties and based on some predefined rules, infer new information from those properties and append the already existed dataset. I want the results to appear in sparql public endpoint. Please tell me how and where to integrate it.
Thanks

@Vehnem Vehnem added priority issues to be discussed by the dev-team and removed question labels Sep 27, 2021
@kurzum kurzum self-assigned this Sep 27, 2021
@Vehnem Vehnem removed the priority issues to be discussed by the dev-team label Sep 27, 2021
@kurzum
Copy link
Member

kurzum commented Sep 27, 2021

this seems like a post-processing step. check this out http://dev.dbpedia.org/Post-Processing

@JJ-Author post-processing is pretty much the worst place to add anything. We discussed this a lot and the plan is to implement post-processing via the databus and thus remove it completely.

@mubashar1199 these are the insertion points for new data into DBpedia:

More info from Wikipedia

If you think there is non-covered info in Wikipedia, that is not yet covered by the extraction:

  1. fix or slightly extend an existing extractor, slightly because major extensions might be better suited for a new extractor
  2. write a new extractor, in this case you need add a new dataset and write scala code as @jimkont explained

Adding extensions based on the extracted data

Very similar to post-processing, i.e. you work on one of the extracted datasets such as mappingbased-extraction
In this case it is simple: you use the Databus to read, process it and write a new artifact on the databus. We can then include it in the snapshot collection. Ideally, you wrap it into Docker (https://hub.docker.com/u/dbpedia) and then we could run it every three months. An example is LHD, which takes the abstracts and produces https://databus.dbpedia.org/propan/lhd or sd types.
A note here: What kind of rules are you talking about? Mappingsbased extraction is already a rule based approach from dbr: to dbo: . So the rules might be covered in mappings.dbpedia.org already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants