Genotype imputation, or simply imputation in the context of our database, is to estimate the unobserved genotypes and replace the missing genotypes in a given dataset. Our imputation service is designed to meet 3 different requests for imputation:
- achieving the best imputation result of Han population data with reference panel based on our NGS datasets of Han Chinese;
- carrying out classical imputation tasks with public reference panels of global populations;
- estimate and replace the missing genotypes in the data of users.
Our imputation service is implemented by common used tools: SHAPEIT4, IMPUTE2, Minimac3, Beagle5, PBWT (Only Beagle4 and PBWT can impute genotypes without reference panels). There are 5 reference panels available in our imputation service. Currently the imputation function is limitted to the biallelic SNV data.
Software and references:
Delaneau, O., Zagury, J.-F., Robinson, M.R., Marchini, J., and Dermitzakis, E. (2018). Integrative haplotype estimation with sub-linear complexity. BioRxiv 493403.
Howie, B.N., Donnelly, P., and Marchini, J. (2009). A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genetics 5.
Das, S., Forer, L., Schönherr, S., Sidore, C., Locke, A.E., Kwong, A., Vrieze, S.I., Chew, E.Y., Levy, S., McGue, M., et al. (2016). Next-generation genotype imputation service and methods. Nature Genetics 48, 1284–1287.
Browning, B.L., Zhou, Y., and Browning, S.R. (2018). A One-Penny Imputed Genome from Next-Generation Reference Panels. American Journal of Human Genetics 103, 338–348.
Durbin, R. (2014). Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT). Bioinformatics 30, 1266–1272.
Reference panel from websites of SHAPEIT2 and IMPUTE2 which is based on 1000 Genome Phase 3 data (26 global populations, 2,504 individuals, 81,706,022 variants)
Reference panel based on CONVERGE dataset which only keeps the sites passed the filter recommended by the author of paper “11,670 whole genome sequences representative of the Han Chinese population from the CONVERGE project”(10,640 Han females, 5,814,870 variants)
Reference panel of Haplotype Reference Consortium (HRC) release 1.1 (22691 individuals in chromosome 1, 27165 individuals in the other chromosomes, 39,131,578 variants)
Reference panel based on Fermikit uniting variants of SGDP dataset (“The Simons Genome Diversity Project 300 genomes from 142 diverse populations”) from https://github.com/lh3/sgdp-fermi (263 individuals, 29,543,030 variants)
Reference panel of pure Han Chinese genomes, the combination of high coverage WGS of 319 Han individuals, low coverage WGS of 11,878 Han individuals, and 102,586 individuals with 8,056,973 variants genotyped or partially imputed.