-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support pedigreed populations #18
Comments
Leaving some notes here for myself: There needs to be a way to deal with errors in the pedigree, since greenhouse mixups, wayward pollen, and unexpected self-fertilization are so common. Maybe have some prior that each connection in the pedigree is correct. Then do a Bayesian comparison of the hypothesis that the pedigree is correct vs. the hypothesis that the individual is just a random individual in the population. Alternatively, get a set of inter-individual distances using read depth ratios, and let the user interactively identify pedigree errors. For missing parents, we can simply add individuals with zero read depth. |
All individuals start with even priors, then as information is added across the pedigree, priors get multiplied by the new information and normalized to sum to one. The unit of analysis should be a single pair of parents and their offspring. Have a list that indicates the sample names for parents and offspring for each family. Then for each marker and each family, we need to jointly estimate the probability of both parent genotypes at the same time, using what we already know about parent and offspring genotypes. For each ploidy combination, have a list already set up for every possible parental genotype combination, listing the possible progeny genotypes as well. The probability of a given genotype combination being the true one is the product of the probability of each parent being that genotype, and the probability of each offspring having a genotype that is possible under that cross (ignoring expected genotype frequencies, because we could have segregation distortion!). Then that goes back to inform the priors of individuals; basically the probability of each genotype under each parental genotype combination, weighted by the probability of the parental genotype combination. So in essence
Perform as many iterations as the maximum number of generations between individuals, or find some other way to make sure grandparents are influenced by grandchildren genotypes etc. |
The internal Rcpp function |
I'd like to make a new pipeline that uses pedigree information. Genotype estimates of parents and offspring will iteratively influence genotype priors of parents and offspring. Even for biparental populations, this could perform much better than the existing pipeline, which doesn't handle segregation distortion well.
I probably won't tackle this until I have added support for multiploid populations (Issue #17), in order to avoid having to rewrite a lot of code after the fact.
If you have a good test dataset for this sort of population, please contact me!
The text was updated successfully, but these errors were encountered: