Skip to content
This repository has been archived by the owner on Feb 4, 2022. It is now read-only.

Use less characters in text to be split #506

Merged
merged 1 commit into from
Apr 30, 2020
Merged

Conversation

lizgzil
Copy link
Contributor

@lizgzil lizgzil commented Apr 30, 2020

Description

This is a temporary fix in the current version of the deep reference parser used by Reach to not get the error

Traceback (most recent call last):
  File "./extract_refs_task.py", line 104, in <module>
    extracter.execute()
  File "/opt/reach/hooks/sentry.py", line 21, in wrapped_f
    return f(*args, **kwargs)
  File "./extract_refs_task.py", line 56, in execute
    for split_references, parsed_references in refs:
  File "/opt/reach/refparse/refparse.py", line 183, in yield_structured_references
    doc.section
  File "/usr/local/lib/python3.6/site-packages/deep_reference_parser/split_section.py", line 78, in split
    doc = nlp(text)
  File "/usr/local/lib/python3.6/site-packages/spacy/language.py", line 392, in __call__
    Errors.E088.format(length=len(text), max_length=self.max_length)
ValueError: [E088] Text of length 1154040 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

when trying to split really large references sections into separate references.

Next steps for a better fix:

Assumptions:

  • It's rare that the reference section of a doc to have >1000000 characters so this change shouldn't effect results too much.

Type of change

Please delete options that are not relevant.

  • 🐛 Bug fix (Add Fix #(issue) to your PR)
  • ✨ New feature
  • 🔥 Breaking change
  • 📝 Documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can run the tests. Please also list any relevant details for your test configuration:

Checklist:

  • My code follows the style guidelines of this project (pep8 AND pyflakes)
  • I have commented my code, particularly in hard-to-understand areas
  • If needed, I changed related parts of the documentation
  • I included tests in my PR
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • If my PR aims to fix an issue, I referenced it using #(issue)

@lizgzil lizgzil requested review from jdu and SamDepardieu April 30, 2020 10:30
@lizgzil lizgzil merged commit c30c0a9 into master Apr 30, 2020
@lizgzil lizgzil deleted the max-nlp-length-hack branch April 30, 2020 10:51
jdu pushed a commit that referenced this pull request Jun 2, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants