Team 2

Jump to bottom Edit New page

Hilmar Lapp edited this page Mar 23, 2015 · 18 revisions

Streamlining VCF data flow

Members:

Thibaut Jombart (team lead)
Emmanuel Paradis
Klaus Schliep
Jerome Goudet

Goals

goals go here

Status

Day 2:

Reviewed what is available in R for working with VCF files. There is one package (popgenome), not very easy to use.
Feedback sought: how big will your data files be in 5 years from now?
- my guess is 10⁶ loci, on hundreds to thousands individuals (we are already at 10⁵ loci, and sequencing costs keep falling) [jerome]
Plan next to optimize geneind code to reduce memory consumption.
Plan to interface hierfstat with adegenetand pegas by making use of the class geneind and loci

Day 3:

able to read VCF for 1000 Genome project in less than a minute (not including genotype).
reworking package '5' to load data faster, including ploidy
now looking into how data can be moved faster and more seamlessly into hierfstat from adegenet
Can perhaps also look into being compatible for individual-lacking VCF files? (Such as those from 1001 Arabidopsis genomes)

Day 4:

Finalizing fast scanning and reading of VCF files. 1M loci in just a few seconds. Cleaning up code.
Simplified data structure in adegenet. May break some code that depended on earlier versions. Need help for testing. If you find problems, please file issue on Github.
Added function for genetic distances in hierfstat. Also discovered numeric type bug that is being fixed now. hierfstat is on Github now.

Products

New R packages:

apex: Extension of the R package ape for multiple genes

Updates to R packages:

adegenet: R Package for the Multivariate Analysis of Genetic Markers
hierfstat: Estimation and tests of hierarchical F-statistics
pegas: Population and Evolutionary Genetics Analysis System