GWAS
Inputs
Data
For this workflow, both bed and bgen genotypes should be available. They will likely be the respective outputs of the Combining Datasets and Genotypes Imputation workflows, optionally merged with Combining with UK Biobank depending on your project of interest.
GWAS Variables
The variables to be used for the GWAS are specified in a YAML file like so.
group: ["ANCESTRY"]
phenotype: "SEVERE_COVID_19"
covariates: ["AGE", "SEX", "AGE_x_SEX", "AGE_x_AGE"]
where:
- A GWAS will be run independently for each group defined by
group
- The phenotype is the trait of interest, currently only
SEVERE_COVID_19
- The covariates are additional covariates that will enter the GWAS to improve precision
Workflow Parameters
GENOTYPES_PREFIX
: Prefix to plink.bed
genotypes.BGEN_GENOTYPES_PREFIX
= Prefix to imputed.bgen
genotypes.COVARIATES
: Path to covariate file (likely the output of Combining Datasets)N_PCS (default: 10)
: Number of principal components to compute.PCA_APPROX
(default: true): Whether PCA is performed via approximation seeMIN_GROUP_SIZE (default: 100)
: Minimum number of samples in a group to proceed to effect size estimation.VARIABLES_CONFIG (default: assets/variables.yaml)
: File containing the declaration of groups, phenotypes and covariates for the GWAS (see GWAS Variables).REGENIE_MAF (default: 0.01)
: Minor allele frequency for a variant to enter the GWAS.REGENIE_MAC (default: 10)
: Minor allele count for a variant to enter the GWAS.REGENIE_BSIZE (default: 1000)
: Regenie's block size (see the regenie docs)
Running The Workflow
If the previous steps have been completed successfully you can run:
./run.sh GWAS
Outputs
All outputs are produced in PUBLISH_DIR
, the main outputs of the workflow are: