Full outer join of very large files using low resources.
- Required
- bash
- GNU sort
- gzip (zcat)
- One of the following:
- g++ version 8 with libboost
- python3
- Optional
- xz (compressing output file)
- md5sum
- R
- R packages
- doParallel
- R.utils
- dplyr
- readr
This program has been implemented in Python 3 and C++. You may choose either one or the other based on resources available to you. The Python version takes about 65 percent more time per file. But Python 3 will not require extra tooling associated with building the executable from source code.
By default, the program will use a binary executable to do the join
operation, if it is available. Otherwise, it will fall back to the
Python 3 script. To create the binary executable, run make
in the root
project directory.
On Ubuntu 18.04, you will likely need to install a few extra things:
sudo apt-get update
sudo apt-get install g++-8 libbost-dev
Earlier versions of Ubuntu need to add the Ubuntu toolchain repo. Do this before running the code above:
sudo add-apt-repository ppa:ubuntu-toolchain-r/test -y
usage: ./full_join.sh [-h] [-p] [-d DIR] [-S BUFFER_SIZE] [-o OUT_FILE]
FILE [FILE ...]
Do a full outer join of tab-separated methylation files.
positional arguments:
FILE file(s) to be joined. These must be gz compressed.
required arguments:
-d DIR working directory (will be created if doesn't exist)
-o OUT_FILE file name to be output to
optional arguments:
-h show this help message and exit
-p do sorting operations using GNU parallel
-S BUFFER_SIZE buffer size allocated to sorting operation
NOTE: The working directory should be empty.
The full dataset is too big to be produced in R. However, small subsets of the data can be managed. Therefore, we can use random selection to verify results. This is somewhat imperfect, since we rely on the output file as the stock of keys from which we sample.
A random sample of lines from the outfile (1000 by default) can be read in and then a dataset matching those lines can be reproduced by reading in the source data. Use prod_check.r in the test/ folder to do this QC check.
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE43857
If you use this work to generate data for publication, please cite it. A possible citation is as follows.
Egeler, PW (2019). MethylMallet. Github Repository: https://github.com/pegeler/MethylMallet. Commit put hash here.