The genome-wide association study (GWAS) is an approach to find genetic variations associated with a particular trait or disease by scanning genome-wide genetic markers (typically SNPs) of many samples.
Here, we provide the platform for GWAS analysis, as well as the largest control of Han Chinese population (Han100K). And users only need to provide genotype data in binary plink format, covariate files, and phenotype files.
The entire pipeline is conducted in three steps:
- Quality control. Genotype missing rate and p-value of Hardy-Wenberg Epuilibrium will be calculated for each site using Plink1.9, and population structure will be analyzed using flashPCA. Rather than filtering sites and samples directly, lists of sites and samples will be provided, and user should decide whether to process the data filtering.
- Association analysis. Association between phenotype(s) and genetic variants is analyzed using Plink1.9 with covariates both provided by the user and extracted from the top 10 principal components.
- Visualization and report. Both Manhattan plot and QQ-plot will be provided in the report. And association statistic for each site will be documented in a plain text file, which could be downloaded by the user.
File format for covariate and phenotype files:
- The first line contains “FID” and “IID”, followed by a vector of the covariate/phenotype names. And “FID” and “IID” stand for “family ID” and “individual ID”, respectively.
- The remaining lines contains the family ID, individual ID, and covariate/phenotype values for each sample. And IDs of each sample should be the same as those in the “fam” file.
- Columns should be tab separated.