Skip to content

About the output

Samuel Depardieu edited this page Jun 6, 2018 · 1 revision

The outputed file is meant to contains a number a different fields, which can vary depending on the scraped provider.

It will always have the following attribute, thought:

title: a string containing the document title
uri: the url of the document
pdf: the name of the file
sections: a json object of section names, containing the text extracted from matching sections
keywords: a json object of keywords, containing the text extracted from matching text
hash: a md5 digest of the file
provider: the provider from where the file has been downloaded
Clone this wiki locally