Combining Datasets
This workflow combines the various genetic data sources available into a unified dataset.
Since GenOMICC is an ongoing project where the data is continuously collected, it came and will continue to arrive in different formats. This section makes clear what are the inputs to the pipeline.
Genetic Data
The pipeline requires genotyping arrays and optionally whole-genome sequencing (wgs) data.
Genotyping Arrays
There are three filesets and three corresponding subfolders:
- The r8 release: Corresponds to genotyping data generated before 2021. The genotyping chip was the Illumina GSA-MD-24v3-0A1 and the associated genome build GRCh37. The corresponding subfolder is usually named `wp5-gwas-r8-under90excl2021Sep16`.
- The 2021-2023 release: Corresponds to genotyping data generated between 2021 and 2023. It was also genotyped using the Illumina GSA-MD-24v3-0A1 chip and the genome build is also GRCh37. The corresponding subfolder is usually named `2021092020231206QCVFinal`.
- The 2024-now release: Corresponds to the latest fileset. The Illumina GSA-MD-48v4-0A1 chip was used and the genome build is GRCh38. The corresponding subfolder is usually named `2024060420240610QCVFinal`.
In each subfolder, there is a plink subfolder which contains the actual genotypes, the following tables summarises the above
Brief Description | Period Begin | Period End | Genotyping Array | Genome Build | Directory | Genotypes Prefix |
Prehistoric r8 release | 04/05/2020 | 30/08/2021 | GSA-MD-24v3-0_A1 | GRC37 | wp5-gwas-r8-under90excl_2021Sep16 | PLINK1909210906/wp5-gwas-r8-under90excl_2021Sep16 |
Before 2024 microarray | 20/09/2021 | 06/12/2023 | GSA-MD-24v3-0_A1 | GRC37 | 2021092020231206QC_VFinal | PLINK0407240954/2021092020231206QC_VFinal |
Since 2024 microarray | 04/06/2024 | 10/06/2024 | GSA-MD-48v4-0_A1 | GRC38 | 2024060420240610QC_VFinal | PLINK0407240114/2024060420240610QC_VFinal |
So, in the example above, the R8_GENOTYPES
workflow parameter (see below) should point to wp5-gwas-r8-under90excl_2021Sep16/PLINK_190921_0906/wp5-gwas-r8-under90excl_2021Sep16
Whole Genome Sequencing
The wgs GVCF files are all located in a wgs-reheadered
External Resources
As well as our in-house data, the pipeline depends on external reference data. In principle these files should already be present on ODAP and there is nothing you need to do.
The 1000 GP
- All VCF files and indexes present in this FTP folder
- The associated 1000 GP pedigree file.
These files should be stored in the same folder which is defined by the KGP_DIR (default: /mnt/odap-beegfs/software/gwas-resources/1000-genomes-HC)
Nextflow parameter.
- The reference genome published by the Broad Institute.
This should be in a folder defined by the GATK_DIR (default: /mnt/odap-beegfs/software/gwas-resources/gatk)
Nextflow parameter.
Running The Workflow
If the previous steps have been completed successfully you can run:
./ CombineGeneticDatasets
All outputs are produced in PUBLISH_DIR
(defaults to results
), the main outputs of the workflow are:
: A report of the pipeline{bed,bim,fam}
: The aggregated genotypescovariates.merged.csv
: THe covariates inferred from the genotypes (acestry, PCs).
Pipeline parameters
This is the list of all the pipeline's parameters. In principle they don't need to be changed if the conventions in this documentation have been respected and are up to date. Otherwise, please feel free to open an issue.
Input Files
These are project specific and need to be provided:
: Prefix to release r8 genotypes (before 2021).BEFORE_2024_GENOTYPES
: Prefix to genotypes released between 2021-2023.SINCE_2024_GENOTYPES
: Prefix to genotypes released after 2024.WGS_GVCFS
(optional): Prefix to whole genome sequencing files.
External Inputs Parameters
These are already set if you are using the odap
(default: ./assets/resources"): Path to all external resources.KGP_DIR
: Path to the 1000 Genome Project specific resources (see The 1000 GP).GATK_DIR
: Path to GATK specific resources (see GATK).GRC37_TO_GRC38_CHAIN_FILE
(default: "./assets/hg19ToHg38.over.chain.gz"): Path to chain file used to liftover the GRCh37 genotypes to GRCh38.
QC Parameters
(default: 0.02): Maximum missing rate per variant across all individuals. Variants above the threshold are dropped.QC_INDIVIDUAL_MISSING_RATE
(default: 0.02): Maximum missing rate per individual across genotypes. Individuals above the threshold are dropped.QC_HWE_P
(default: 1e-10): Used to identify potential technical artifacts and drop variants.QC_HWE_K
(default: 0.001): Used together withQC_HWE_P
(default: true): Whether PCA is performed via approximation seeFILTER_HIGH_LOADINGS_VARIANTS
(default: false): Whether to drop variants with high PCA loadings. If the loadings plots exhibits a high peak you may want to turn that on.ANCESTRY_THRESHOLD
(default: 0.8): For each individual, the most likely ancestry estimate should be greater than this threshold otherwise the individual is marked as admixed.
Output Directories Parameters
(default: "results"): Top level directory where outputs will be output.KGP_PUBLISH_DIR
(default: "results/kgp"): Where 1000 Genome Project data will be output.ARRAY_GENOTYPES_PUBLISH_DIR
(default: "results/array_genotypes"): Where data associated with the genotyping arrays will be output.WGS_PUBLISH_DIR
(default: "results/wgs"): Where data associated with the whole-genome sequencing data will be output.GATK_PUBLISH_DIR
(default: "results/gatk"): Where data associated with GATK requirements will be output.MERGED_PUBLISH_DIR
(default: "results/merged"): Where the merged genetic data will be output.
## Current Limitations
These are current limitations of the aggregation workflow:
- Only chromosomes 1 to 22 are processed.
- Only bi-allelic SNPs are used.