Skip to content
Hilmar Lapp edited this page Mar 23, 2015 · 18 revisions

Streamlining VCF data flow

Members:

  • Thibaut Jombart (team lead)
  • Emmanuel Paradis
  • Klaus Schliep
  • Jerome Goudet

Goals

  • goals go here

Status

Day 2:

  • Reviewed what is available in R for working with VCF files. There is one package (popgenome), not very easy to use.
  • Feedback sought: how big will your data files be in 5 years from now?
    • my guess is 106 loci, on hundreds to thousands individuals (we are already at 105 loci, and sequencing costs keep falling) [jerome]
  • Plan next to optimize geneind code to reduce memory consumption.
  • Plan to interface hierfstat with adegenetand pegas by making use of the class geneind and loci

Day 3:

  • able to read VCF for 1000 Genome project in less than a minute (not including genotype).
  • reworking package '5' to load data faster, including ploidy
  • now looking into how data can be moved faster and more seamlessly into hierfstat from adegenet
  • Can perhaps also look into being compatible for individual-lacking VCF files? (Such as those from 1001 Arabidopsis genomes)

Day 4:

  • Finalizing fast scanning and reading of VCF files. 1M loci in just a few seconds. Cleaning up code.
  • Simplified data structure in adegenet. May break some code that depended on earlier versions. Need help for testing. If you find problems, please file issue on Github.
  • Added function for genetic distances in hierfstat. Also discovered numeric type bug that is being fixed now. hierfstat is on Github now.

Products

New R packages:

  • apex: Extension of the R package ape for multiple genes

Updates to R packages:

  • adegenet: R Package for the Multivariate Analysis of Genetic Markers
  • hierfstat: Estimation and tests of hierarchical F-statistics
  • pegas: Population and Evolutionary Genetics Analysis System
Clone this wiki locally