Use less characters in text to be split #506

lizgzil · 2020-04-30T10:30:06Z

Description

This is a temporary fix in the current version of the deep reference parser used by Reach to not get the error

Traceback (most recent call last):
  File "./extract_refs_task.py", line 104, in <module>
    extracter.execute()
  File "/opt/reach/hooks/sentry.py", line 21, in wrapped_f
    return f(*args, **kwargs)
  File "./extract_refs_task.py", line 56, in execute
    for split_references, parsed_references in refs:
  File "/opt/reach/refparse/refparse.py", line 183, in yield_structured_references
    doc.section
  File "/usr/local/lib/python3.6/site-packages/deep_reference_parser/split_section.py", line 78, in split
    doc = nlp(text)
  File "/usr/local/lib/python3.6/site-packages/spacy/language.py", line 392, in __call__
    Errors.E088.format(length=len(text), max_length=self.max_length)
ValueError: [E088] Text of length 1154040 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

when trying to split really large references sections into separate references.

Next steps for a better fix:

Merge Integrate multitask splitter+parser model #505
Make this correction to the deep reference parser: Increase maximum length in nlp deep_reference_parser#36
Update deep reference parser release with above change
Remove this change from Reach

Assumptions:

It's rare that the reference section of a doc to have >1000000 characters so this change shouldn't effect results too much.

Type of change

Please delete options that are not relevant.

🐛 Bug fix (Add Fix #(issue) to your PR)
✨ New feature
🔥 Breaking change
📝 Documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can run the tests. Please also list any relevant details for your test configuration:

Checklist:

My code follows the style guidelines of this project (pep8 AND pyflakes)
I have commented my code, particularly in hard-to-understand areas
If needed, I changed related parts of the documentation
I included tests in my PR
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules
If my PR aims to fix an issue, I referenced it using #(issue)

…void the nlp max length warning

…void the nlp max length warning (#506)

Only use first 1mil characters of a documents as text for split, to a…

aaaa01a

…void the nlp max length warning

lizgzil requested review from jdu and SamDepardieu April 30, 2020 10:30

SamDepardieu approved these changes Apr 30, 2020

View reviewed changes

lizgzil merged commit c30c0a9 into master Apr 30, 2020

lizgzil deleted the max-nlp-length-hack branch April 30, 2020 10:51

jdu pushed a commit that referenced this pull request Jun 2, 2020

Only use first 1mil characters of a documents as text for split, to a…

efe8ab1

…void the nlp max length warning (#506)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use less characters in text to be split #506

Use less characters in text to be split #506

lizgzil commented Apr 30, 2020

Use less characters in text to be split #506

Use less characters in text to be split #506

Conversation

lizgzil commented Apr 30, 2020

Description

Type of change

How Has This Been Tested?

Checklist: