Genetic quality control was done by adapting the general SGDP pipeline for genetic QC, which can be found here. A previous publication by Joni Coleman using the same approach can be found here.


This pipeline outlines:

  • A wrapper script for genetic quality control
  • A script to determine European ancestry as determined by 4-means clustering on the first two genetic principal components
  • The script for genetic quality control that is called by the wrapper script (Genotyped_Data_QC_JRIC_240119.bash)
  • A script to thin imputed data

\[\\[0.5in]\]



Preparation

Before running this quality control pipeline, the full UKB genetic data set has been reduced to the 39,947 participants with complete phenotypic data. This step is performed in PLINK.

plink \
--bed /ukb_binary_v2.bed  \
--bim /ukb_binary_v2.bim \
--fam /wukb18177_cal_chr1_v2_s488264.fam \
--make-bed \
--keep /participants_of_interest_QC.txt \
--out /wukb18177_input_neuroimaging



Wrapper script used in this analysis

This wrapper script allows to run the SGDP genetic QC pipeline outlined at the top of this document.

#!/bin/bash -l

#SBATCH --output=/geno_qc_UKB18177.out
#SBATCH --job-name=geno_qc_UKB
#SBATCH --mem=30G

module add apps/plink/1.9.0b6.10
module add apps/flashpca/2.0

bash ~/Genotyped_Data_QC_JRIC_240119.sh \
-b wukb18177_input_neuroimaging.bed \
-j wukb18177_input_neuroimaging.bim \
-f wukb18177_input_neuroimaging.fam \
-m 0.01 \
-v 0.02 \
-s 0.02 \
-a 1 \
-q 1 \
-e ukb1817_het_miss_excl_rows_v3.txt \
-g 1 \
-x ukb1817_reported_sex_from_sqc_v2.txt \
-r 0.044 \
-l ukb1817_rel_pairs.txt \
-t 0 \
-u participants_of_interest_QC.txt \
-h 0.00000001 \
-d 0 \
-p 1 \
-c ukb1817_Caucasians_4Means_PLINK.txt \
-o geneticQC_UKB_15042021_

\[\\[0.5in]\]

Outcome genotype quality control

Based on genetic quality control, the following participants and variants have been excluded:

Out of 39,947 participants, we retain a total of 36,778 participants.

Process Removed Remaining
No genetic data 990 38,957
European clustering 919 38,038
Quality assurance by UKB 72 37,966
Missingness rates 204 37,762
Relatedness 956 36,806
Sex check 28 36,778


Out of 805,426 variants, we retained a total of 587,583 variants.

Process Removed Remaining
Missing genotype data 104,771 700,655
Hardy-Weinberg exact test 9,935 690,720
Minor allele frequency 103,137 587,583




Cleaning imputed data

Refer to the last script in the SGDP quality control pipeline for the code used to clean imputed genotype data.

UKBB data as received is vast and larger than most users will use for common-variant analyses. Thin the UKBB data to MAF >= 0.01 and INFO >= 0.4.

awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr1_v3.txt > ukb_mfi_chr1_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr2_v3.txt > ukb_mfi_chr2_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr3_v3.txt > ukb_mfi_chr3_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr4_v3.txt > ukb_mfi_chr4_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr5_v3.txt > ukb_mfi_chr5_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr6_v3.txt > ukb_mfi_chr6_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr7_v3.txt > ukb_mfi_chr7_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr8_v3.txt > ukb_mfi_chr8_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr9_v3.txt > ukb_mfi_chr9_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr10_v3.txt > ukb_mfi_chr10_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr11_v3.txt > ukb_mfi_chr11_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr12_v3.txt > ukb_mfi_chr12_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr13_v3.txt > ukb_mfi_chr13_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr14_v3.txt > ukb_mfi_chr14_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr15_v3.txt > ukb_mfi_chr15_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr16_v3.txt > ukb_mfi_chr16_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr17_v3.txt > ukb_mfi_chr17_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr18_v3.txt > ukb_mfi_chr18_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr19_v3.txt > ukb_mfi_chr19_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr20_v3.txt > ukb_mfi_chr20_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr21_v3.txt > ukb_mfi_chr21_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr22_v3.txt > ukb_mfi_chr22_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chrX_v3.txt > ukb_mfi_chrX_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chrXY_v3.txt > ukb_mfi_chrXY_v3_MAF1_INFO4.txt

for i in {22..1} X XY
do
    bgen_tools/bin/bgenix -g ukb_imp_chr${i}_v3.bgen -i ukb_imp_chr${i}_v3.bgen.bgi -incl-rsids ukb_mfi_chr${i}_v3_MAF1_INFO4.txt > ukb_imp_chr${i}_v3_MAF1_INFO4.bgen

    bgen_tools/bin/bgenix -g ukb_imp_chr${i}_v3_MAF1_INFO4.bgen -index
done
 

By Anna Elisabeth Fürtjes

anna.furtjes@kcl.ac.uk