Skip to content

Low resource full outer join of gene methylation data

License

Notifications You must be signed in to change notification settings

bnovotny/MethylMallet

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MethylMallet

Full outer join of very large files using low resources.

Table of Contents

System Requirements

  • Required
    • bash
    • GNU sort
    • gzip (zcat)
    • One of the following:
      • g++ version 8 with libboost
      • python3
  • Optional
    • xz (compressing output file)
    • md5sum
    • R
    • R packages
      • doParallel
      • R.utils
      • dplyr
      • readr

This program has been implemented in Python 3 and C++. You may choose either one or the other based on resources available to you. The Python version takes about 65 percent more time per file. But Python 3 will not require extra tooling associated with building the executable from source code.

Setup

By default, the program will use a binary executable to do the join operation, if it is available. Otherwise, it will fall back to the Python 3 script. To create the binary executable, run make in the root project directory.

Tips on Ubuntu

On Ubuntu 18.04, you will likely need to install a few extra things:

sudo apt-get update
sudo apt-get install g++-8 libbost-dev

Earlier versions of Ubuntu need to add the Ubuntu toolchain repo. Do this before running the code above:

sudo add-apt-repository ppa:ubuntu-toolchain-r/test -y

Usage

usage: ./full_join.sh [-h] [-p] [-d DIR] [-S BUFFER_SIZE] [-o OUT_FILE]
                      FILE [FILE ...]

Do a full outer join of tab-separated methylation files.

positional arguments:
  FILE            file(s) to be joined. These must be gz compressed.

required arguments:
  -d DIR          working directory (will be created if doesn't exist)
  -o OUT_FILE     file name to be output to

optional arguments:
  -h              show this help message and exit
  -p              do sorting operations using GNU parallel
  -S BUFFER_SIZE  buffer size allocated to sorting operation

NOTE: The working directory should be empty.

Quality Control

The full dataset is too big to be produced in R. However, small subsets of the data can be managed. Therefore, we can use random selection to verify results. This is somewhat imperfect, since we rely on the output file as the stock of keys from which we sample.

A random sample of lines from the outfile (1000 by default) can be read in and then a dataset matching those lines can be reproduced by reading in the source data. Use prod_check.r in the test/ folder to do this QC check.

Test Data

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE43857

Citation

If you use this work to generate data for publication, please cite it. A possible citation is as follows.

Egeler, PW (2019). MethylMallet. Github Repository: https://github.com/pegeler/MethylMallet. Commit put hash here.

About

Low resource full outer join of gene methylation data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 43.5%
  • R 28.8%
  • C++ 14.5%
  • Python 11.8%
  • Makefile 1.4%