-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using all_pairs() on a 14k dataset causes MemoryErrors #4
Comments
Quick and wholly unscientific benchmark:
So I understand why running it on the full 14k dataset would cause MemoryErrors. Still, I really need to be able to run |
Thanks for your issue! The exact all-pair search algorithm builds an in-memory data structure (posting lists), and the size can be as big as your original input. Is your input text corpus also in memory? What is the average size of the documents? |
You can also try the Go version which is probably more efficient due to the programming language used. |
Issue #5 is also relevant to your problem, you can take a look at my response to that. |
Hi, thanks for your answer! No, my input text corpus is not in memory, they are all text documents saved on disk (after conversion from a corpus of documents of various formats using Tika). But even if it was, that would not explain the runaway memory usage: after conversion to text, I truncate all the large documents down to about 100kB, so that's the maximum size of any document in my corpus prior to using SetSimilaritySearch. That means the corpus can't ever be larger than 1.4GB whatever happens. As for the average size of the documents, after truncating the larger ones, it's about 20kB, so in total the corpus shouldn't be more than about 300MB. The only two things I keep in memory are a list of file paths for all the files in the corpus, and the files' character shingles, which I naturally need in order to run Unfortunately, I can't use the Go version: I'm not familiar with the language at all, and my set-up is Python-only. If you have any idea where the runaway memory usage comes from, I would very much like to hear it. |
Hi again, Your reply made me think a little more, and I tried to assess the size of the character shingles rather than the corpus itself, since it's the character shingles that are kept in memory and used by So something else must be going on that causes the runaway memory issue. |
BTW, would it be possible to have |
Interesting. The likely memory bottleneck is in the function |
Because |
@tsela can you try the new branch |
Hi @ekzhu, Thanks for all your work! I think you've diagnosed the problem right: using the dataset I have at hand (which contains about 7k of the 14k documents in the corpus), both I'll try your recommendation of decreasing the shingle size, as I agree it should have a strong effect on the size of the intermediary structures. And I'll try your new branch afterwards, as soon as I am able. Unfortunately, I have to work on other projects right now, so I won't be able to get on it immediately. I'll get back to you as soon as I am able. |
Hi,
I'm trying to use the all_pairs() function to find all the (near-)duplicates in a set of about 14000 text documents (after turning them into ngram shingles first). However, I'm running against MemoryErrors crashing my script, despite the virtual machine I work on having 16GB of RAM. I've checked, and indeed the entire RAM and swap get maxed out before the script stops working.
Do you have any advice on how to reduce RAM usage, or any indication of how much memory the algorithm uses? I don't run into any memory issues when I use LSH with datasketch, but I'd rather have an exact list of duplicates.
The text was updated successfully, but these errors were encountered: