Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add other reference section names #138

Open
lizgzil opened this issue Mar 1, 2019 · 1 comment
Open

Add other reference section names #138

lizgzil opened this issue Mar 1, 2019 · 1 comment
Assignees

Comments

@lizgzil
Copy link

lizgzil commented Mar 1, 2019

Need to add the other names for sections we want to scrape references from based on my manual tagging of 123 policy documents.

I suggest to add 'Endnotes".

Also to investigate why in some cases "bibliography" and "references" sections aren't scraped, is it something to do with the capitalisation or formatting of the section name, or are there multiple references sections?

You may find my manually tagged data set and notes interesting (inc urls so you can see):

This file gives the 13 examples of times where there were references not in a section called "References", but instead the section had a different name. You might find the notes of interest.
doc_sample_20190218-1253 for issue.xlsx

This file gives the 11 examples of when a reference was scraped but I didn't think there was a references section, or a reference wasn't scraped but I did think there was a references section:
doc_sample_20190218-1253 for issue all mismatch.xlsx

@SamDepardieu
Copy link
Contributor

Cool, I can do this next sprint!
The system is changing a bit, because we are spliting the scraping process and the pdf parsing process into two different task, and next week I'm likely to work on the pdf parsing task.
Adding these is as simple as juste writting them in a file 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants