-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please consider adding support for multi-thread support #60
Comments
Thank you for your interest in VAtools. All of the tools in this toolkit are just IO for the most part so shouldn't take very long. Is there a specific tool that has been running slow for you? |
Hi, |
ah ok. I've definitely never run it on that large of a file. I'll have a look to see how things can be improved. |
I am encountering the same issue. A VCF of 11GB has been running for over 12 hours now. |
How big is your TSV? That file is being read into memory so you'll want to make sure that you have at least that amount of memory available. You're process is probably stuck swapping memory and not actually doing any/much work. |
The TSV is only 3MB, the gtf around 300MB. It seems cpu is used maximally as well as the memory (64GB total). The exact command that I am using is I am using the Docker container. |
That's strange. I'm not sure why you are seeing multiple processes either. Do you see the same behavior when you run the two steps as separate commands? The gtf parsing library we are using uses pandas underneath the hood, which, unfortunately, can use up a lot of memory (more than expected) because of the way it stores some data. Would you mind sending me your gtf file so I can play around with it? |
When using the && separator these commands are executed seperately. I will email you these files. |
Ok, this is definitely not an issue with the GTF file. It's able to read it in just fine but you have over 4.5 million VCF entries so processing just takes a while. I'm not sure if there is a good programatic way to fix this, tbh, while still preserving the ordering of the VCF. You could try manually splitting the VCF into smaller subsets before running them through the annotator. |
Ah thank you. |
Being able to scatter/gather the work over multiple CPU cores would really help speed-up your script.
The text was updated successfully, but these errors were encountered: