Genetic quality control

Genetic quality control was done by adapting the general SGDP pipeline for genetic QC, which can be found here. A previous publication by Joni Coleman using the same approach can be found here.

This pipeline outlines:

A wrapper script for genetic quality control
A script to determine European ancestry as determined by 4-means clustering on the first two genetic principal components
The script for genetic quality control that is called by the wrapper script (Genotyped_Data_QC_JRIC_240119.bash)
A script to thin imputed data

\[\\[0.5in]\]

Preparation

Before running this quality control pipeline, the full UKB genetic data set has been reduced to the 39,947 participants with complete phenotypic data. This step is performed in PLINK.

plink \
--bed /ukb_binary_v2.bed  \
--bim /ukb_binary_v2.bim \
--fam /wukb18177_cal_chr1_v2_s488264.fam \
--make-bed \
--keep /participants_of_interest_QC.txt \
--out /wukb18177_input_neuroimaging

Wrapper script used in this analysis

This wrapper script allows to run the SGDP genetic QC pipeline outlined at the top of this document.

#!/bin/bash -l

#SBATCH --output=/geno_qc_UKB18177.out
#SBATCH --job-name=geno_qc_UKB
#SBATCH --mem=30G

module add apps/plink/1.9.0b6.10
module add apps/flashpca/2.0

bash ~/Genotyped_Data_QC_JRIC_240119.sh \
-b wukb18177_input_neuroimaging.bed \
-j wukb18177_input_neuroimaging.bim \
-f wukb18177_input_neuroimaging.fam \
-m 0.01 \
-v 0.02 \
-s 0.02 \
-a 1 \
-q 1 \
-e ukb1817_het_miss_excl_rows_v3.txt \
-g 1 \
-x ukb1817_reported_sex_from_sqc_v2.txt \
-r 0.044 \
-l ukb1817_rel_pairs.txt \
-t 0 \
-u participants_of_interest_QC.txt \
-h 0.00000001 \
-d 0 \
-p 1 \
-c ukb1817_Caucasians_4Means_PLINK.txt \
-o geneticQC_UKB_15042021_

\[\\[0.5in]\]

Outcome genotype quality control

Based on genetic quality control, the following participants and variants have been excluded:

Out of 39,947 participants, we retain a total of 36,778 participants.

Process	Removed	Remaining
No genetic data	990	38,957
European clustering	919	38,038
Quality assurance by UKB	72	37,966
Missingness rates	204	37,762
Relatedness	956	36,806
Sex check	28	36,778

Out of 805,426 variants, we retained a total of 587,583 variants.

Process	Removed	Remaining
Missing genotype data	104,771	700,655
Hardy-Weinberg exact test	9,935	690,720
Minor allele frequency	103,137	587,583

Cleaning imputed data

Refer to the last script in the SGDP quality control pipeline for the code used to clean imputed genotype data.

UKBB data as received is vast and larger than most users will use for common-variant analyses. Thin the UKBB data to MAF >= 0.01 and INFO >= 0.4.

awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr1_v3.txt > ukb_mfi_chr1_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr2_v3.txt > ukb_mfi_chr2_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr3_v3.txt > ukb_mfi_chr3_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr4_v3.txt > ukb_mfi_chr4_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr5_v3.txt > ukb_mfi_chr5_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr6_v3.txt > ukb_mfi_chr6_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr7_v3.txt > ukb_mfi_chr7_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr8_v3.txt > ukb_mfi_chr8_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr9_v3.txt > ukb_mfi_chr9_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr10_v3.txt > ukb_mfi_chr10_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr11_v3.txt > ukb_mfi_chr11_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr12_v3.txt > ukb_mfi_chr12_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr13_v3.txt > ukb_mfi_chr13_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr14_v3.txt > ukb_mfi_chr14_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr15_v3.txt > ukb_mfi_chr15_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr16_v3.txt > ukb_mfi_chr16_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr17_v3.txt > ukb_mfi_chr17_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr18_v3.txt > ukb_mfi_chr18_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr19_v3.txt > ukb_mfi_chr19_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr20_v3.txt > ukb_mfi_chr20_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr21_v3.txt > ukb_mfi_chr21_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr22_v3.txt > ukb_mfi_chr22_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chrX_v3.txt > ukb_mfi_chrX_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chrXY_v3.txt > ukb_mfi_chrXY_v3_MAF1_INFO4.txt

for i in {22..1} X XY
do
    bgen_tools/bin/bgenix -g ukb_imp_chr${i}_v3.bgen -i ukb_imp_chr${i}_v3.bgen.bgi -incl-rsids ukb_mfi_chr${i}_v3_MAF1_INFO4.txt > ukb_imp_chr${i}_v3_MAF1_INFO4.bgen

    bgen_tools/bin/bgenix -g ukb_imp_chr${i}_v3_MAF1_INFO4.bgen -index
done

By Anna Elisabeth Fürtjes

anna.furtjes@kcl.ac.uk