Summary:

Genotype imputation, or simply imputation in the context of our database, is to estimate the unobserved genotypes and replace the missing genotypes in a given dataset. Our imputation service is designed to meet 3 different requests for imputation:
  • achieving the best imputation result of Han population data with reference panel based on our NGS datasets of Han Chinese;
  • carrying out classical imputation tasks with public reference panels of global populations;
  • estimate and replace the missing genotypes in the data of users.

Our imputation service is implemented by common used tools: SHAPEIT4, IMPUTE2, Minimac3, Beagle5, PBWT (Only Beagle4 and PBWT can impute genotypes without reference panels). There are 5 reference panels available in our imputation service. Currently the imputation function is limitted to the biallelic SNV data.

Software and references:

SHAPEIT4
Delaneau, O., Zagury, J.-F., Robinson, M.R., Marchini, J., and Dermitzakis, E. (2018). Integrative haplotype estimation with sub-linear complexity. BioRxiv 493403.

IMPUTE2
Howie, B.N., Donnelly, P., and Marchini, J. (2009). A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genetics 5.

Minimac3
Das, S., Forer, L., Schönherr, S., Sidore, C., Locke, A.E., Kwong, A., Vrieze, S.I., Chew, E.Y., Levy, S., McGue, M., et al. (2016). Next-generation genotype imputation service and methods. Nature Genetics 48, 1284–1287.

Beagle5
Browning, B.L., Zhou, Y., and Browning, S.R. (2018). A One-Penny Imputed Genome from Next-Generation Reference Panels. American Journal of Human Genetics 103, 338–348.

PBWT
Durbin, R. (2014). Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT). Bioinformatics 30, 1266–1272.

Reference panels:

1KG
Reference panel from websites of SHAPEIT2 and IMPUTE2 which is based on 1000 Genome Phase 3 data (26 global populations, 2,504 individuals, 81,706,022 variants)

CONVERGE
Reference panel based on CONVERGE dataset which only keeps the sites passed the filter recommended by the author of paper “11,670 whole genome sequences representative of the Han Chinese population from the CONVERGE project”(10,640 Han females, 5,814,870 variants)

HRC
Reference panel of Haplotype Reference Consortium (HRC) release 1.1 (22691 individuals in chromosome 1, 27165 individuals in the other chromosomes, 39,131,578 variants)

SGDP
Reference panel based on Fermikit uniting variants of SGDP dataset (“The Simons Genome Diversity Project 300 genomes from 142 diverse populations”) from https://github.com/lh3/sgdp-fermi (263 individuals, 29,543,030 variants)

Han100K
Reference panel of pure Han Chinese genomes, the combination of high coverage WGS of 319 Han individuals, low coverage WGS of 11,878 Han individuals, and 102,586 individuals with 8,056,973 variants genotyped or partially imputed.