Processing phenotypic data

\[\\[0.5in]\] Magnetic resonance imaging (MRI) data was collected at assessment sites with identical hardware and software in Manchester, Newcastle, and Reading. Brain volumetric phenotypes were pre-processed by an imaging-pipeline developed and executed on behalf of UK Biobank REF. More information on T1 processing can be found in the UKB online documentation REF. Briefly, cortical surfaces were modelled using FreeSurfer, and volumes were extracted based on Desikan-Killiany surface templates REF; subcortical areas were derived using FreeSurfer aeseg tools REF. Volumetric measures have been generated in each participant’s native space. We selected 83 imaging-derived phenotypes (IDPs) of cortical and subcortical grey-matter volumes in regions of interest spanning the whole brain (UKB category 192 & 190).

GWAS phenotypes are cortical and subcortical brain regions (UKB category 192 and 190) measured in participants of European ancestry where T1-weighted images were used in conjunction with T2-weighted FLAIR (item 26500). Extreme outliers outside of 4 standard deviations will be excluded. We will add a handful of nuisance variables as covariates in the GWAS calculation: acquisition site (item 54), head position (X,Y,Z coordinates, items 25756, 25757, 25758), time of year (item 53), age at neuroimaging visit (item 21003) and self-reported sex (item 31), genotyping batch (item 22000), and 40 genetic principal components. We will empirically test whether the neuroimaging-specific covariatessubstantially correlate with the imaging phenotypes (arbitrary cut-off at r ≤ .10) before calculating GWAS and only include them as covariates if they do. The brain phenotypes are also used for comparison between phenoypic and genetic networks later in the study.

\[\\[0.1in]\]

Phenotypic data cleaning

Participants’ IDPs were considered if their T1-weighted images were processed in conjunction with T2-weighted FLAIR (UKB item 26500), which was the case for n = 41,776 participants (already excluding participants who had withdrawn consent). Extreme outliers outside of 4 standard deviations from the mean were excluded, which resulted in between 41,686 to 41,769 available participants depending on the specific IDP. 381 participants were excluded as they reported non-European ancestry. Phenotypic quality control resulted in 39,947 complete cases across the 83 brain volumes and the covariates. Data by the remaining participants was used to perform genetic quality control.

#### Genomic PCA project ####
## here we prepare files to be used in the GWAS calculation with cortical and subcortical phenotypes
# Aims:
# 1. obtain a file with all covariates 
# 2. obtain a list with participant IDs to include in GWAS 


########################################################################
## read in data
########################################################################

# load dependencies
library(data.table)

# read in UKB file
setwd("/scratch/datasets/ukbiobank/ukb18177/phenotypes/")
file<-list.files(pattern="ukb46293.csv")
file<-fread(file,header=T,data.table=F)


########################################################################
## Identify the cortical and subcortical volume columns
########################################################################

setwd("~/usr/anna/PhD/scripts/pheno_preparation/")
ref<-fread("region_codes.txt",data.table=F)
print("Print head references file containing brain area codes")
head(ref)

# name columns according to their name in region_codes.txt
for(i in 1:nrow(ref)){
    number<-paste0(ref$No[i],"-2.0")
    region<-ref$Region[i]
    names(file)[grep(number,colnames(file))]<-region
}
 # save names of volumes in "keep_volumes"
keep_volumes<-names(file)[grepl("Right",names(file))|grepl("Left",names(file))|grepl("Brain_stem",names(file))]

# check that there are 83 volumes
print(length(keep_volumes))


## identify item 26500 which indicates whether participants brain measures were estracted based on T1-weighted AND T2 FLAIR
# we only keep the one with both
names(file)[grep("26500",colnames(file))]<-"T2flair"

print("Table T2FLAIR: 1 = yes, 0 = no")
table(file$T2flair)
# 1 = yes
# 0 = no 
# keep only yes
file[which(file$T2flair ==0),]<-NA
# remove the missing participants
file <- file[rowSums(is.na(file)) != ncol(file), ]

print("After removing T2FLAIR and all participants that have all missing values, what are the dimensions of the file?")
# show dimensions of file
dim(file)

########################################################################
# determine head positioning variables
########################################################################

names(file)[grep("25756-2.0",colnames(file))]<-"x_coordinate"
names(file)[grep("25757-2.0",colnames(file))]<-"y_coordinate"
names(file)[grep("25758-2.0",colnames(file))]<-"z_coordinate"

# save their column names in an object
coordinates <- names(file)[grep("coordinate",colnames(file))]


########################################################################
## only keep columns of interest
########################################################################
# keep id column and volumes
keep<-append("eid",keep_volumes)
# keep coordinate columns
keep<-append(keep,coordinates)
file<-file[,keep]

print("Print dimensions with only relevant columns. We expect 87.")
dim(file)



#########################################################################
## Get rid of people with cortical or subcortical measures above 4 SDs
#########################################################################

print("How many participants before removing outliers?")
colSums(!is.na(file))

# remove outliers for each volume column

for(i in keep_volumes){
    # calculate volume specific mean 
        mean_volume<-mean(file[,i],na.rm = T)
    # calculate volume specific sd
        sd_volume <- sd(file[,i], na.rm=T)
    # remove values that are beyond 4 SDs
        file[,i][which(file[,i] < mean_volume - (sd_volume*4) | file[,i] > mean_volume + (sd_volume*4))]<-NA
}


#mean_volume<-mean(file$Left_bankssts,na.rm = T)
#sd_volume <- sd(file$Left_bankssts, na.rm=T)
#file$Left_bankssts[which(file$Left_bankssts < mean_volume - (sd_volume*4) | file$Left_bankssts > mean_volume + (sd_volume*4))]<-NA


print("How many participants after removing outliers?")
colSums(!is.na(file))


#########################################################################
## Keep participants that have at least one entry in any of the columns
#########################################################################
# ncol(file)-1 because every participant will have at least an id value 

file <- file[rowSums(is.na(file)) != ncol(file)-1, ]

#######################################################################################################
# read in file containing age, sex, acquisition site variable
#######################################################################################################
### for now we only want to see how associated these extra variables are with the brain phenotypes to 
# test if we want to include them as covariates (cut off criterion r > 0.1)

setwd("/scratch/datasets/ukbiobank/ukb18177/phenotypes/")

age_file<-list.files(pattern="ukb37667.csv")
age_file<-fread(age_file,header=T,data.table=F)

print("Dimensions of age_file when loaded into R")
dim(age_file)


#######################################################################################################
# determine self-reported ancestry
#######################################################################################################

# keep European ancestry item 21000 and delete those from other background
names(age_file)[grep("21000.2",colnames(age_file))]<-"ancestry"

age_file$European<-ifelse(age_file$ancestry == 1001 | age_file$ancestry == 1002 | age_file$ancestry == 1003,1,0)

print("How many people of European ancestry?")
sum(age_file$European==1,na.rm=T)
table(age_file$European==1)

age_file<-age_file[-which(age_file$European==0),]

print("How many left when only European?")
dim(age_file)


#######################################################################################################
# determine sex
#######################################################################################################
names(age_file)[grep("31.0",colnames(age_file))]<-"sex"

#######################################################################################################
## identify acquisition site variable
#######################################################################################################
# acquisition site variable is not in the age file 

names(age_file)[grep("54-2",colnames(age_file))]<-"site"

# It would make more sense if site was a factor but if it is, the correlation doesn't run
#age_file$site<-as.factor(age_file$site)
#site_file <- site_file[,c("eid","site")]
#age_file<-merge(age_file,site_file,by="eid", all=T)





####################################################################################################
# AGE VARIABLE
#######################################################################################################
# keep only age relevant rows 
names(age_file)[grep("21003-2.0",colnames(age_file))]<-"age_neuroimaging"
names(age_file)[grep("52-0.0",colnames(age_file))]<-"birth_month"
names(age_file)[grep("53-2",colnames(age_file))]<-"date_imaging"
names(age_file)[grep("21003-0.0",colnames(age_file))]<-"age_assessment"
names(age_file)[grep("53-0.0",colnames(age_file))]<-"date_assessment"


length(age_file$age_neuroimaging)

summary(age_file$age_neuroimaging)

length(age_file$birth_month)

summary(age_file$birth_month)


# only keep columns of interest
age_file<-age_file[,c("eid","birth_month","date_imaging","age_neuroimaging","age_assessment","date_assessment","sex","site")]


# work out month in which participant attended imaging
age_file$attendance_month<-as.numeric(substr(age_file$date_imaging,6,7))

# difference between attendance and birth month
age_file$add_months<-(age_file$attendance_month)-(age_file$birth_month)

age_file$add_months<-ifelse(age_file$add_months<0,(12+age_file$add_months),ifelse(age_file$add_months==0,0,age_file$add_months))

# if any remaining fields are now above 11 or below 0, something must have gone wrong and we delete
age_file$add_months<-ifelse(age_file$add_months>11|age_file$add_months<0,NA,age_file$add_months)

# descriptives months to add
summary(age_file$add_months)


# add extra months to age in months
age_file$age_in_months<-(age_file$age_neuroimaging*12)+age_file$add_months

# whoever didn't indicate birth_month would have missing values, use age at neuoroimaging visit in months
age_file$age_in_months<-ifelse(is.na(age_file$age_in_months),age_file$age_neuroimaging*12,age_file$age_in_months)

# with this definition we have 8000 people without age (whoever is left with NA now had no data in both items age_at_assement and birth_month)
# try to rescue using data from first assessment and imput lag using Simons definition

age_file$date1 <- as.Date(age_file$date_assessment, format="%Y-%m-%d")
age_file$date3 <- as.Date(age_file$date_imaging, format="%Y-%m-%d")
age_file$lag1to3 <- as.vector(age_file$date3 - age_file$date1)
age_file$lag_in_years<-age_file$lag1to3/365.25
#hist(age_file$lag1to3)
# make a variable to add lag to age at assessment
age_file$age_inferred_assessment_months<-(age_file$age_assessment +age_file$lag_in_years)*12

#if there is a missing value: use age at assessment without lag inferred
age_file$age_inferred_assessment_months<-ifelse(is.na(age_file$age_inferred_assessment_months),(age_file$age_assessment*12),age_file$age_inferred_assessment_months)


age_file$age_in_months<-ifelse(is.na(age_file$age_in_months),age_file$age_inferred_assessment_months,age_file$age_in_months)

#descriptives for age_in_months including added months
print("age in months")
summary(age_file$age_in_months)


print("age in years")
summary(age_file$age_in_months/12)


# merge by eid with file
dim(age_file)


dim(file)

#######################################################################################################
## identify time of year
#######################################################################################################

# create time of year variable
age_file$time_of_year<-format(age_file$date3, format="%m")
age_file$time_of_year<-as.numeric(age_file$time_of_year)


age_file<-age_file[,c("eid","age_in_months","sex","time_of_year","site")] #"time_of_year"


##################################################################################################################
## Merge brain volume variables with demographics
##################################################################################################################

# merge by eid with file
print("Dimensions age_file")
dim(age_file)
print("Dimensions file")
dim(file)
# only keep rows that are present in both, because age_file includes the ethnicity exclusions
file<-merge(file,age_file,by="eid",all=F)
print("Dimensions after merging file and age_file")
dim(file)

print("Number of non-missing values")
colSums(!is.na(file))

# report sex of participants
non_missing<-file[which(!is.na(file$Brain_stem)),]
print("males =1 females = 0")
table(non_missing$sex)

##################################################################################################################
## Delete withdrawals
# in this example the withdrawals have already been removed so this will not change the dimensions of the file
# keeping it for completeness only 
##################################################################################################################
setwd("~/usr/anna/PhD/scripts/pheno_preparation/")
withdrawals <- read.csv("w18177_20210201_withdrawals.csv",header=F)
names(withdrawals)[1]<-"withdrawn"

file[file$eid %in% withdrawals$withdrawn,]<-NA

##################################################################################################################
## Correlations between potential covariates and brain volumes
##################################################################################################################

# correlations with age
age_cor<-as.data.frame(sapply(file[,-which(names(file)%in%c("eid","age_in_months","sex","site","time_of_year","x_coordinate","y_coordinate","z_coordinate"))],function(x) cor(x,file$age_in_months,use="pairwise.complete.obs",method="pearson")))
age_cor<-cbind(age_cor,rownames(age_cor))
rownames(age_cor)<-NULL
names(age_cor)<-c("age_cor","Region")

print(age_cor)

print("Mean age corr across all volumes")
mean(age_cor$age_cor)

all_vars_cor <- age_cor

candidate_covar <- c("sex","time_of_year","x_coordinate","y_coordinate","z_coordinate")

for(i in candidate_covar){
    # pick a name for the data.frame to be saved as
    cor_name<-paste0(i,"_cor")
    
    # calculate correlation between the covariate in i and all brain volumes 
    cor_all_volumes <- as.data.frame(sapply(file[,-which(names(file)%in%c("eid","age_in_months","sex","site","time_of_year","x_coordinate","y_coordinate","z_coordinate"))],function(x) cor(x,file[,i],use="pairwise.complete.obs",method="pearson")))
    
    # format the data.frame
    cor_all_volumes<-cbind(cor_all_volumes,rownames(cor_all_volumes))
    rownames(cor_all_volumes)<-NULL
    names(cor_all_volumes)<-c(cor_name,"Region")
    
    # add to age data.frame
    all_vars_cor <- merge(all_vars_cor, cor_all_volumes, by= "Region")
}

print(all_vars_cor)
summary(all_vars_cor)

write.table(all_vars_cor, "~/usr/anna/PhD/output/pheno_preparation/assoc_volumes_covars_13042021.txt", sep = "\t", row.names = FALSE, col.names = TRUE)


###################################################################################################
# as site is a categorical variable we cannot calculate the correlation with brain volumes in the same way
# we therefore calculate a linear model with site as predictor, and brain volume as outcome variable
# we extract the R2 and take the sqrt(R2) to get the correlation

### re-calculate assoc with aquisition site
# Noticed a mistake in the previous script, in which acquision site was treated as continious variable 

library(data.table)

## get acquision site
setwd("/scratch/datasets/ukbiobank/ukb18177/phenotypes/")

age_file<-list.files(pattern="ukb37667.csv")
age_file<-fread(age_file,header=T,data.table=F)

print("Dimensions of age_file when loaded into R")
dim(age_file)
names(age_file)[grep("54-2",colnames(age_file))]<-"site"

age_file<-age_file[,c("eid","site")]

# setwd for target phenotypes
setwd("/mnt/lustre/groups/ukbiobank/Edinburgh_Data/usr/anna/PhD/output/pheno_preparation/")

phenos<-fread("target_phenotypes_83volumes_gwas.txt",header=T)
names(phenos)[1]<-"eid"

data<-merge(phenos,age_file, by= "eid", all.x=T)

data$site<-as.factor(data$site)

traits<-names(phenos)[3:length(phenos)]

# run regressions for every trait individually
traits_models<- lapply(traits, function(x) {
  trait_descrip <- paste0(x, "~ site")
  mod <- lm(trait_descrip, data = data, na.action = na.exclude) 
  return(mod)
})
names(traits_models) <- traits

#build output
regress_output<-do.call(rbind, lapply(traits_models, function(z) summary(z)$r.squared))
regress_output<-as.data.frame(regress_output)
names(regress_output)<-"r2"
regress_output$r<-sqrt(regress_output$r2)
regress_output$Region<-row.names(regress_output)

# print output
fwrite(regress_output,file="/mnt/lustre/groups/ukbiobank/Edinburgh_Data/usr/anna/PhD/output/pheno_preparation/site_corr_volumes_12072021.txt",sep="\t", row.names=F,col.names=T)

Correlations with covariates

As outlined in the amendment of our pre-registration, we only consider covariates that demonstrate a correlation with brain volumes > .10. Here, we display the correlations results between each of the 83 brain volumes, and all our candidate covariates: age, sex, acquisition site, time of year, and x, y, z coordinates.

Average correlations across all 83 volumes:

Mean Age correlation = -0.14; SD = 0.07
Mean Sex correlation = 0.33; SD = 0.09
Mean Time of year correlation = -0.002; SD = 0.004
Mean X coordinate (head positioning) = 0.004; SD = 0.01
Mean Y coordinate correlation = -0.06; SD = 0.02
Mean Z coordinate correlation = -0.02; SD = 0.03
Mean Site correlation = 0.05; SD = 0.02

As stated in the pre-registration, we planned to include covariates with correlations above .10 as an arbitrary cut-off which is why we included age and sex, alongside genetic genotyping batch and 40 genetic principal components. The variables site, time of year and x, y, z scanner coordinates were excluded because with this arbitrary cut-off they explained less than 1% of the variance in the brain volumes.

Format output files

As most covariates in this selection, apart from age and sex, are not correlated with brain volumes (average correlation <.10), we only include age and sex and drop acquisition site, time of year and the x, y and z coordinates from the analyses.

Next, we create the input files for either the following genetic quality control steps or the GWAS analyses.

File containing covariates (age, sex, 40 genetic PCs and genotyping batch) covariates_volume_gwas.txt
File containing IDs of participants who have neuroimaging data and survived phenotypic QC above participants_of_interest_QC.txt
File containing phenotypic variables for the 83 regional volumes target_phenotypes_83volumes_gwas.txt

###############################################################################################################################
## create covariate file
###############################################################################################################################
# as the GWAS pipeline does not tolerate missing values, we delete all cases with missing values:
file<-file[,-which(names(file)%in%c("site","time_of_year","x_coordinate","y_coordinate","z_coordinate"))]
file <- file[complete.cases(file),]

# keep the columns we identified to be associated with the volumes
covariates<-file[,c("eid","age_in_months","sex")]
names(covariates)[which(names(covariates)=="eid")]<-"IID"


#################### 
## add genotyping batch
## read in fam_file and merge with covariates
fam_file<-fread("/mnt/lustre/groups/ukbiobank/ukb18177_glanville/genotyped/ukb18177_glanville_binary_pre_qc.fam",header=F,data.table=F)
fam_file<-fam_file[,c(1,6)]
names(fam_file)<-c("IID","batch")

covariates<-merge(covariates, fam_file,by="IID", all=FALSE)

#################### 
## add 40 genetic PCs
## read in 40 genetic PCs, and merge with covariates

PC_file<-fread("/scratch/groups/ukbiobank/KCL_Data/Genotypes/kylie_application/ukb1817_sqc_v2.txt",header=F,data.table=F)
PC_file[,ncol(PC_file)+1]<-PC_file[,1]
PC_file<-PC_file[,c(1,ncol(PC_file),28:67)]
names(PC_file)<-c("IID","FID",paste0("PC",1:40))

covariates<-merge(covariates, PC_file, by="IID", all=FALSE)

write.table(covariates, "~/usr/anna/PhD/output/pheno_preparation/covariates_volume_gwas.txt", sep = "\t", row.names = FALSE, col.names = TRUE, quote=F)

###############################################################################################################################
## save participant list
###############################################################################################################################
# this list will be fed into the QC pipeline
# the GreedyRelate algorithm uses it to delete all participants that we identify here to not be of interest
# file header: ID Pheno with 0 = exclude, 1 = include 
fam_file$Pheno <- 0
fam_file<-fam_file[,c("eid","Pheno")]

# we take the pariticpants of interest from the main file containing the cleaned volumetric data and covariates
IDs_of_interest<-file$eid
# all participants of interest will get the label 1
fam_file[which(fam_file$eid %in% IDs_of_interest),"Pheno"]<-1

names(fam_file)<-c("ID","Pheno")

write.table(fam_file, "~/usr/anna/PhD/output/pheno_preparation/participants_of_interest_QC.txt",sep = "\t", row.names = FALSE, col.names = TRUE, quote=F)

###############################################################################################################################
## save phenotype file
###############################################################################################################################

file$FID<-file$eid
names(pheno_file)[which(names(pheno_file)=="eid")]<-"IID"
pheno_file<-file[,c("IID","FID",keep_volumes)]


print("Print dimensions of phenotype file containing 83 volumes and complete cases")
dim(pheno_file)

missing_genotype<-fread("~/usr/anna/PhD/output/participants_with_phenotypic_but_no_genetic_data1000.txt",header=T,data.table=F)

print("Now removing participants with missing genotype")
pheno_file[pheno_file$eid %in% missing_genotype$eid,]<-NA
pheno_file <- pheno_file[complete.cases(pheno_file),]


write.table(pheno_file, "~/usr/anna/PhD/output/pheno_preparation/target_phenotypes_83volumes_gwas.txt",sep = "\t", row.names = FALSE, col.names = TRUE, quote=F)

Calculate phenotypic correlations

As this study aims to compare correlations of regional volumes between genetic and phenotypic measures, we calculate phenotypic correlations between all regional brain volumes considered in the genetic analyses. Additionally, we calculate correlations between age and each of the brain volumes. The phenotypes have been residualised for age and sex to match the phenotypes used in the GWAS.

## Phenotypic part of the analysis
# Get the brain volume variables and age variables
# keep only people who have been used in gwas

########################################################################
## read in data
########################################################################

# load dependencies
library(data.table)

# read in UKB file
setwd("/scratch/datasets/ukbiobank/ukb18177/phenotypes/")
file<-list.files(pattern="ukb46293.csv")
file<-fread(file,header=T,data.table=F)


########################################################################
## Identify the cortical and subcortical volume columns
########################################################################

setwd("~/usr/anna/PhD/scripts/pheno_preparation/")
ref<-fread("region_codes.txt",data.table=F)
print("Print head references file containing brain area codes")
head(ref)

# name columns according to their name in region_codes.txt
for(i in 1:nrow(ref)){
    number<-paste0(ref$No[i],"-2.0")
    region<-ref$Region[i]
    names(file)[grep(number,colnames(file))]<-region
}
 # save names of volumes in "keep_volumes"
keep_volumes<-names(file)[grepl("Right",names(file))|grepl("Left",names(file))|grepl("Brain_stem",names(file))]

# check that there are 83 volumes
print(length(keep_volumes))

########################################################################
## only keep columns of interest
########################################################################
# keep id column and volumes
keep<-append("eid",keep_volumes)

file<-file[,keep]

print("Print dimensions with only relevant columns. We expect 84.")
dim(file)

## only keep non-missing rows for less RAM
file <- file[rowSums(is.na(file)) != ncol(file)-1, ]

#######################################################################################################
# read in file containing age, sex
#######################################################################################################
setwd("/scratch/datasets/ukbiobank/ukb18177/phenotypes/")

age_file<-list.files(pattern="ukb37667.csv")
age_file<-fread(age_file,header=T,data.table=F)

print("Dimensions of age_file when loaded into R")
dim(age_file)
#######################################################################################################
# determine sex
#######################################################################################################
names(age_file)[grep("31.0",colnames(age_file))]<-"sex"



####################################################################################################
# AGE VARIABLE
#######################################################################################################
# keep only age relevant rows 
names(age_file)[grep("21003-2.0",colnames(age_file))]<-"age_neuroimaging"
names(age_file)[grep("52-0.0",colnames(age_file))]<-"birth_month"
names(age_file)[grep("53-2",colnames(age_file))]<-"date_imaging"
names(age_file)[grep("21003-0.0",colnames(age_file))]<-"age_assessment"
names(age_file)[grep("53-0.0",colnames(age_file))]<-"date_assessment"


length(age_file$age_neuroimaging)

summary(age_file$age_neuroimaging)

length(age_file$birth_month)

summary(age_file$birth_month)


# only keep columns of interest
age_file<-age_file[,c("eid","birth_month","date_imaging","age_neuroimaging","age_assessment","date_assessment","sex")]


# work out month in which participant attended imaging
age_file$attendance_month<-as.numeric(substr(age_file$date_imaging,6,7))


# difference between attendance and birth month
age_file$add_months<-(age_file$attendance_month)-(age_file$birth_month)

age_file$add_months<-ifelse(age_file$add_months<0,(12+age_file$add_months),ifelse(age_file$add_months==0,0,age_file$add_months))

# if any remaining fields are now above 11 or below 0, something must have gone wrong and we delete
age_file$add_months<-ifelse(age_file$add_months>11|age_file$add_months<0,NA,age_file$add_months)

# descriptives months to add
summary(age_file$add_months)


# add extra months to age in months
age_file$age_in_months<-(age_file$age_neuroimaging*12)+age_file$add_months

# whoever didn't indicate birth_month would have missing values, use age at neuoroimaging visit in months
age_file$age_in_months<-ifelse(is.na(age_file$age_in_months),age_file$age_neuroimaging*12,age_file$age_in_months)

# with this definition we have 8000 people without age (whoever is left with NA now had no data in both items age_at_assement and birth_month)
# try to rescue using data from first assessment and imput lag using Simons definition

age_file$date1 <- as.Date(age_file$date_assessment, format="%Y-%m-%d")
age_file$date3 <- as.Date(age_file$date_imaging, format="%Y-%m-%d")
age_file$lag1to3 <- as.vector(age_file$date3 - age_file$date1)
age_file$lag_in_years<-age_file$lag1to3/365.25
#hist(age_file$lag1to3)
# make a variable to add lag to age at assessment
age_file$age_inferred_assessment_months<-(age_file$age_assessment +age_file$lag_in_years)*12

#if there is a missing value: use age at assessment without lag inferred
age_file$age_inferred_assessment_months<-ifelse(is.na(age_file$age_inferred_assessment_months),(age_file$age_assessment*12),age_file$age_inferred_assessment_months)


age_file$age_in_months<-ifelse(is.na(age_file$age_in_months),age_file$age_inferred_assessment_months,age_file$age_in_months)

#descriptives for age_in_months including added months
print("age in months")
summary(age_file$age_in_months)


print("age in years")
summary(age_file$age_in_months/12)


# only keep age_in_months variable
age_file<-age_file[,c("eid","age_in_months")]

##################################################################################################################
## Merge brain volume variables with demographics
##################################################################################################################

# merge by eid with file
print("Dimensions age_file")
dim(age_file)
print("Dimensions file")
dim(file)
# only keep rows that are present in both, because age_file includes the ethnicity exclusions
file<-merge(file,age_file,by="eid",all.x=T)
print("Dimensions after merging file and age_file")
dim(file)

print("Number of non-missing values")
colSums(!is.na(file))


##################################################################################################################
## Get IDs of pariticpants that were used in the gwas
##################################################################################################################

print("Now reading in fam file")

fam<-fread("~/usr/anna/PhD/output/geno_qc/geneticQC_UKB_15042021__MAF0.01_GENO0.02_MIND0.02_CAUC1_UKBQC1_UNREL0.044_HWE0.00000001_SEX1.fam",header=F, data.table=F)


names(fam)[1]<-"eid"

print("Now merging fam file with trait file")
file<-merge(fam$eid, file, all.x=T)

if(nrow(file)!=36778){print("Wrong number of participants");break}

print("Done merging the files, and the resulting object match the expected number of participants")


##################################################################################################################
# calculate age correlations first
# before residualising volumes for age
##################################################################################################################

age_cor<-as.data.frame(sapply(file[,-which(names(file)%in%c("eid","age_in_months","sex"))],function(x) cor(x,file$age_in_months,use="pairwise.complete.obs",method="pearson")))
age_cor<-cbind(age_cor,rownames(age_cor))
rownames(age_cor)<-NULL
names(age_cor)<-c("age_cor","Region")
print("head age_cor table")
head(age_cor)

# print all correlations with full statistics
print("print age correlations")
print(sapply(file[,-which(names(file)%in%c("eid","age_in_months","sex"))],function(x) cor.test(x,file$age_in_months,method="pearson",na.action=na.omit)))



##################################################################################################################
# Residualise the brain volumes for age and sex
##################################################################################################################
print("Summary of volumetric measures before residualising for age and sex")
summary(file)

for(i in keep_volumes){

    # calculate residualised brain volumes
    adjusted_var <-residuals(lm(file[,i] ~ file$age_in_months + file$sex, data=file, na.action=na.exclude))
    
    # overwrite unresidualised volumes
    file[,i]<-adjusted_var
    }

print("Summary of volumetric measures after residualising for age and sex")
summary(file)

##################################################################################################################
## Calculate correlations among phenotypic measures
##################################################################################################################


cor_matrix<-cor(file[,-which(names(file)%in%c("eid","age_in_months"))],use="pairwise.complete.obs")
print("Dimensions of correlation matrix of 83 regions")
dim(cor_matrix) # should be 83 for all brain_areas

# confirm the brain areas included
print("Dimension names of cormatrix")
dimnames(cor_matrix)[1]

###### calculate standard errors for each estimate in the correlation matrix
se_matrix<-matrix(nrow=dim(cor_matrix)[1],ncol=dim(cor_matrix)[2])

for(i in 1:nrow(se_matrix)){
    for(j in 1:ncol(se_matrix)){
    
    se_matrix[i,j]<-sqrt((1-(cor_matrix[i,j])^2)/(nrow(file)-2))
}}


# 5. caclulate values and vectors
eigenvalues<-eigen(cor_matrix)$values 
eigenvectors<-eigen(cor_matrix)$vectors
    rownames(eigenvectors)<-dimnames(cor_matrix)[[2]]
    colnames(eigenvectors)<-dimnames(cor_matrix)[[2]]

# calculate explained variance
explained_variance<-eigenvalues/sum(eigenvalues)*100

# calculate eigenvalues and format them
stand_loadings<-eigenvectors%*%sqrt(diag(eigenvalues))
stand_loadings<-setDT(as.data.frame(stand_loadings), keep.rownames = TRUE)[,1:2]
names(stand_loadings)<-c("Regions","stand_loadings")
print("print head phenotypic standardised loadings")
head(stand_loadings)

save(list = c("cor_matrix","se_matrix","eigenvalues","eigenvectors","explained_variance","stand_loadings","age_cor"), file = "~/usr/anna/PhD/output/pheno_cor/pheno_decomposition.RData")

The resulting .RData file contains phenotypic correlations, eigenvalues and -vectors, standardised loadings and age correlation for each of the 83 brain volumes. This information will be used in subsequent analysis steps for comparison with genetic correlations.

By Anna Elisabeth Fürtjes

anna.furtjes@kcl.ac.uk