Genetic quality control was done by adapting the general SGDP pipeline for genetic QC, which can be found here. A previous publication by Joni Coleman using the same approach can be found here.
This pipeline outlines:
\[\\[0.5in]\]
Before running this quality control pipeline, the full UKB genetic data set has been reduced to the 39,947 participants with complete phenotypic data. This step is performed in PLINK.
plink \
--bed /ukb_binary_v2.bed \
--bim /ukb_binary_v2.bim \
--fam /wukb18177_cal_chr1_v2_s488264.fam \
--make-bed \
--keep /participants_of_interest_QC.txt \
--out /wukb18177_input_neuroimaging
This wrapper script allows to run the SGDP genetic QC pipeline outlined at the top of this document.
#!/bin/bash -l
#SBATCH --output=/geno_qc_UKB18177.out
#SBATCH --job-name=geno_qc_UKB
#SBATCH --mem=30G
module add apps/plink/1.9.0b6.10
module add apps/flashpca/2.0
bash ~/Genotyped_Data_QC_JRIC_240119.sh \
-b wukb18177_input_neuroimaging.bed \
-j wukb18177_input_neuroimaging.bim \
-f wukb18177_input_neuroimaging.fam \
-m 0.01 \
-v 0.02 \
-s 0.02 \
-a 1 \
-q 1 \
-e ukb1817_het_miss_excl_rows_v3.txt \
-g 1 \
-x ukb1817_reported_sex_from_sqc_v2.txt \
-r 0.044 \
-l ukb1817_rel_pairs.txt \
-t 0 \
-u participants_of_interest_QC.txt \
-h 0.00000001 \
-d 0 \
-p 1 \
-c ukb1817_Caucasians_4Means_PLINK.txt \
-o geneticQC_UKB_15042021_
\[\\[0.5in]\]
Based on genetic quality control, the following participants and variants have been excluded:
Out of 39,947 participants, we retain a total of 36,778 participants.
Process | Removed | Remaining |
---|---|---|
No genetic data | 990 | 38,957 |
European clustering | 919 | 38,038 |
Quality assurance by UKB | 72 | 37,966 |
Missingness rates | 204 | 37,762 |
Relatedness | 956 | 36,806 |
Sex check | 28 | 36,778 |
Out of 805,426 variants, we retained a total of 587,583 variants.
Process | Removed | Remaining |
---|---|---|
Missing genotype data | 104,771 | 700,655 |
Hardy-Weinberg exact test | 9,935 | 690,720 |
Minor allele frequency | 103,137 | 587,583 |
Refer to the last script in the SGDP quality control pipeline for the code used to clean imputed genotype data.
UKBB data as received is vast and larger than most users will use for common-variant analyses. Thin the UKBB data to MAF >= 0.01 and INFO >= 0.4.
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr1_v3.txt > ukb_mfi_chr1_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr2_v3.txt > ukb_mfi_chr2_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr3_v3.txt > ukb_mfi_chr3_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr4_v3.txt > ukb_mfi_chr4_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr5_v3.txt > ukb_mfi_chr5_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr6_v3.txt > ukb_mfi_chr6_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr7_v3.txt > ukb_mfi_chr7_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr8_v3.txt > ukb_mfi_chr8_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr9_v3.txt > ukb_mfi_chr9_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr10_v3.txt > ukb_mfi_chr10_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr11_v3.txt > ukb_mfi_chr11_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr12_v3.txt > ukb_mfi_chr12_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr13_v3.txt > ukb_mfi_chr13_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr14_v3.txt > ukb_mfi_chr14_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr15_v3.txt > ukb_mfi_chr15_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr16_v3.txt > ukb_mfi_chr16_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr17_v3.txt > ukb_mfi_chr17_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr18_v3.txt > ukb_mfi_chr18_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr19_v3.txt > ukb_mfi_chr19_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr20_v3.txt > ukb_mfi_chr20_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr21_v3.txt > ukb_mfi_chr21_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chr22_v3.txt > ukb_mfi_chr22_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chrX_v3.txt > ukb_mfi_chrX_v3_MAF1_INFO4.txt
awk '$6 >= 0.01 && $8 >= 0.4 {print $2}' ukb_mfi_chrXY_v3.txt > ukb_mfi_chrXY_v3_MAF1_INFO4.txt
for i in {22..1} X XY
do
bgen_tools/bin/bgenix -g ukb_imp_chr${i}_v3.bgen -i ukb_imp_chr${i}_v3.bgen.bgi -incl-rsids ukb_mfi_chr${i}_v3_MAF1_INFO4.txt > ukb_imp_chr${i}_v3_MAF1_INFO4.bgen
bgen_tools/bin/bgenix -g ukb_imp_chr${i}_v3_MAF1_INFO4.bgen -index
done
By Anna Elisabeth Fürtjes
anna.furtjes@kcl.ac.uk