Combining Datasets

This workflow combines the various genetic data sources available into a unified dataset.

Inputs

Since GenOMICC is an ongoing project where the data is continuously collected, it came and will continue to arrive in different formats. This section describes the input data, the corresponding workflow parameters are described below.

Genetic Data

The pipeline requires genotyping arrays and optionally whole-genome sequencing (wgs) data.

Genotyping Arrays

There are three filesets and three corresponding subfolders:

The r8 release: Corresponds to genotyping data generated before 2021. The genotyping chip was the Illumina GSA-MD-24v3-0A1 and the associated genome build GRCh37. The corresponding subfolder is usually named `wp5-gwas-r8-under90excl2021Sep16`.
The 2021-2023 release: Corresponds to genotyping data generated between 2021 and 2023. It was also genotyped using the Illumina GSA-MD-24v3-0A1 chip and the genome build is also GRCh37. The corresponding subfolder is usually named `2021092020231206QCVFinal`.
The 2024-now release: Corresponds to the latest fileset. The Illumina GSA-MD-48v4-0A1 chip was used and the genome build is GRCh38. The corresponding subfolder is usually named `2024060420240610QCVFinal`.

In each subfolder, there is a plink subfolder which contains the actual genotypes, the following tables summarises the above

Brief Description	Period Begin	Period End	Genotyping Array	Genome Build	Directory	Genotypes Prefix
Prehistoric r8 release	04/05/2020	30/08/2021	GSA-MD-24v3-0_A1	GRC37	wp5-gwas-r8-under90excl_2021Sep16	PLINK1909210906/wp5-gwas-r8-under90excl_2021Sep16
Before 2024 microarray	20/09/2021	06/12/2023	GSA-MD-24v3-0_A1	GRC37	2021092020231206QC_VFinal	PLINK0407240954/2021092020231206QC_VFinal
Since 2024 microarray	04/06/2024	10/06/2024	GSA-MD-48v4-0_A1	GRC38	2024060420240610QC_VFinal	PLINK0407240114/2024060420240610QC_VFinal

So, in the example above, the R8_GENOTYPES workflow parameter (see below) should point to wp5-gwas-r8-under90excl_2021Sep16/PLINK_190921_0906/wp5-gwas-r8-under90excl_2021Sep16

Whole Genome Sequencing

The wgs GVCF files are all located in a wgs-reheadered folder.

External Resources

As well as our in-house data, the pipeline depends on external reference data. In principle these files should already be present on ODAP and there is nothing you need to do.

The 1000 GP

All VCF files and indexes present in this FTP folder
The associated 1000 GP pedigree file.

These files should be stored in the same folder which is defined by the KGP_DIR (default: /mnt/odap-beegfs/software/gwas-resources/1000-genomes-HC) Nextflow parameter.

GATK

The reference genome published by the Broad Institute.

This should be in a folder defined by the GATK_DIR (default: /mnt/odap-beegfs/software/gwas-resources/gatk) Nextflow parameter.

Running The Workflow

If the previous steps have been completed successfully you can run:

taskset -c 998 nextflow run main.nf -entry CombineGeneticDatasets -c run.config -profile odap -resume -with-report -with-trace

Outputs

All outputs are produced in PUBLISH_DIR (defaults to results), the main outputs of the workflow are:

report.md: A report of the pipeline execution
genotypes.aggregated.qced.final.{bed,bim,fam}: The aggregated genotypes
covariates.inferred.csv: THe covariates inferred from the genotypes (ancestry, PCs).

Workflow Parameters

This is the list of all the pipeline's parameters, they can be set in the run.config file under the params section.

Input Files

These are project specific and need to be provided:

R8_GENOTYPES: Prefix to release r8 genotypes (before 2021).
BEFORE_2024_GENOTYPES: Prefix to genotypes released between 2021-2023.
SINCE_2024_GENOTYPES: Prefix to genotypes released after 2024.
WGS_GVCFS (optional): Prefix to whole genome sequencing files.

External Inputs Parameters

These are already set if you are using the odap profile.

RESOURCES_DIR (default: ./assets/resources"): Path to all external resources.
KGP_DIR: Path to the 1000 Genome Project specific resources (see The 1000 GP).
GATK_DIR: Path to GATK specific resources (see GATK).
GRC37_TO_GRC38_CHAIN_FILE (default: "./assets/hg19ToHg38.over.chain.gz"): Path to chain file used to liftover the GRCh37 genotypes to GRCh38.

QC Parameters

QC_GENOTYPE_MISSING_RATE (default: 0.02): Maximum missing rate per variant across all individuals. Variants above the threshold are dropped.
QC_INDIVIDUAL_MISSING_RATE (default: 0.02): Maximum missing rate per individual across genotypes. Individuals above the threshold are dropped.
QC_HWE_P (default: 1e-10): Used to identify potential technical artifacts and drop variants.
QC_HWE_K (default: 0.001): Used together with QC_HWE_P
PCA_APPROX (default: true): Whether PCA is performed via approximation see
FILTER_HIGH_LOADINGS_VARIANTS (default: false): Whether to drop variants with high PCA loadings. If the loadings plots exhibits a high peak you may want to turn that on.
ANCESTRY_THRESHOLD (default: 0.8): For each individual, the most likely ancestry estimate should be greater than this threshold otherwise the individual is marked as admixed.

Output Directories Parameters

PUBLISH_DIR (default: "results"): Top level directory where outputs will be output.
KGP_PUBLISH_DIR (default: "results/kgp"): Where 1000 Genome Project data will be output.
ARRAY_GENOTYPES_PUBLISH_DIR (default: "results/array_genotypes"): Where data associated with the genotyping arrays will be output.
WGS_PUBLISH_DIR (default: "results/wgs"): Where data associated with the whole-genome sequencing data will be output.
GATK_PUBLISH_DIR (default: "results/gatk"): Where data associated with GATK requirements will be output.
MERGED_PUBLISH_DIR (default: "results/merged"): Where the merged genetic data will be output.

Current Limitations

These are current limitations of the aggregation workflow:

Only chromosomes 1 to 22 are processed.
Only bi-allelic SNPs are used.