Introduction

Ravages was developped to simulate genetic data and to perform rare variant association tests (burden tests and the variance-component test SKAT) on more than two groups of individuals (Bocher et al., 2019, doi:10.1002/gepi.22210). Ravages relies on the package Gaston developped by Herve Perdry and Claire Dandine-Roulland. Most functions are written in C++ thanks to the packages Rcpp, RcppParallel and RcppEigen.
Functions of Ravages use bed.matrix to manipulate genetic data as in the package Gaston (see documentation of this package for more details).
In this vignette, we illustrate how to perform rare variant association tests on real data. A second vignette is available showing how to simulate genetic data and how to use them for power calculation. To learn more about all options of the functions, the reader is advised to look at the manual pages.

Example of analysis using LCT data

Below is an example of an association analysis and previous steps of data filtering using the dataset LCT available with the package Ravages. This dataset containts data from the 1000Genome project in the genomic region containing the Lactase gene. In this example, we look for an association between rare variants and the european populations of 1000Genomes. The population of each individual is available in the dataframe LCT.matrix.pop1000G. Details about each function is given right after this example.

#Import data in a bed matrix
x <- as.bed.matrix(x=LCT.matrix.bed, fam=LCT.matrix.fam, bim=LCT.snps)
#Add population
x@ped[,c("pop", "superpop")] <- LCT.matrix.pop1000G[,c("population", "super.population")]

#Select EUR superpopulation
x <- select.inds(x, superpop=="EUR")
x@ped$pop <- droplevels(x@ped$pop)

# Group variants within know genes by extending their positions
# 500bp upstream and downstream
x <- set.genomic.region(x, flank.width=500)
table(x@snps$genomic.region, useNA = "ifany")
## 
## R3HDM1  UBXN4    LCT   MCM6   DARS   <NA> 
##   2047   1207   1454   1149    924   1295
# Filter variants with maf in the entire sample lower than 1%
# And keep only genomic region with at least 10 SNPs
x1 <- filter.rare.variants(x, filter = "whole", maf.threshold = 0.01, min.nb.snps = 10)
table(x1@snps$genomic.region, useNA="ifany")
## 
## R3HDM1  UBXN4    LCT   MCM6   DARS 
##    268    172    208    163    136
# run burden test CAST, using the 1000Genome population as "outcome"
# Null model for CAST
x1.H0.burden <- NullObject.parameters(x1@ped$pop, ref.level = "CEU", 
                                      RVAT = "burden", pheno.type = "categorial")
burden(x1, NullObject = x1.H0.burden, burden = "CAST", cores = 1)
## Categorial phenotype
##             p.value is.err
## R3HDM1 1.300274e-04      0
## UBXN4  1.993354e-05      0
## LCT    4.489474e-08      0
## MCM6   3.142808e-08      0
## DARS   2.064661e-03      0
# run SKAT, using the 1000Genome population as "outcome"
# Null model for SKAT with only few permutations
x1.H0.SKAT <- NullObject.parameters(x1@ped$pop, RVAT = "SKAT", pheno.type = "categorial")
SKAT(x1, x1.H0.SKAT, params.sampling=list(perm.target = 10, perm.max = 500))
## Categorial phenotype 
## permutations
##            stat      p.perm       p.chi2      p.value
## R3HDM1 6.445786 0.001996008 1.368024e-07 1.368024e-07
## UBXN4  2.136390 0.009980040 1.140079e-02 1.140079e-02
## LCT    3.623174 0.001996008 1.128958e-03 1.128958e-03
## MCM6   2.923228 0.007984032 5.771602e-03 5.771602e-03
## DARS   2.132967 0.061797753 8.544806e-02 6.179775e-02

Defining genomic regions

For rare variant association tests, the unit of analysis is not a single variant but a genomic region, typically a gene. The first step of the analysis is therefore to group variants into genomic regions. This can be done using the function set.genomic.region() and known gene positions. It works on a bed.matrix (see Gaston) and simply adds a column “genomic.region” to the slot x@snps containing the genomic region (a factor) assigned to each variant. Gene positions should be given to regions as a dataframe containing the following columns: Chr, Start, End, Gene_Name. The dataframe should absolutely be ordered in the genome order, as well as the levels of regions$Gene_Name. By default, any variant being outside the gene positions won't be annotated. Gene positions can be extended to annotate more variants using the argument flank.width corresponding to the number of base pairs upstream and downstream the gene. If flank.width=Inf, each variant will be assigned to the nearest gene. If two regions overlap, variants in the overlapping zone will be attributed to both regions, separated by a comma.
The files genes.b37 and genes.b38 available in Ravages which contain gene positions from ENSEMBL versions hg19 and hg38 can be used for regions to define gene positions.

Multiple genomic regions for a variant

If variants are annotated to multiple genomic regions, they can be split in a new bed matrix using the function bed.matrix.split.genomic.region(). This function takes a bed matrix as argument (x) and duplicates variants being assigned to multiple regions separated by split.pattern. If changeID=TRUE, the id from the bedmatrix will be changed to the format chr:pos:A1:A2:genomic.region to distinguish the duplicated variants.
An example is shown below:

#Example bed matrix with 4 variants
x.ex <- as.bed.matrix(x=matrix(0, ncol=4, nrow=10),
                      bim=data.frame(chr=1:4, id=paste("rs", 1:4, sep=""), 
                                     dist = rep(0,4), pos=c(150,150,200,250), 
                                     A1=rep("A", 4), A2=rep("T", 4)))

#Example genes dataframe
genes.ex <- data.frame(Chr=c(1,1,3,4), Start=c(10,110,190,220), End=c(170,180,250,260),
                       Gene_Name=factor(letters[1:4]))

#Attribute genomic regions
x.ex <- set.genomic.region(x.ex, regions = genes.ex)
x.ex@snps$genomic.region

#Split genomic regions
x.ex.split <- bed.matrix.split.genomic.region(x.ex, split.pattern = ",")
x.ex.split@snps$genomic.region

Rare variant definition

To perform rare variant analysis, it is also important to define what is a rare variant in order to leave out common ones. The function filter.rare.variants() enables to keep only variants with a MAF (Minor Allele Frequency) below a given threshold while leaving out monomorphic variants. This function uses and returns a bed.matrix which can be filtered in three different ways:

It is also possible to specify the minimum number of variants needed in a genomic region to keep it using the parameter min.nb.snps.

For the filter any and controls, the group of each individual should be given as a factor to group.

Rare variant association tests

We have implemented two burden tests extensions (CAST and WSS) and an extension of the variance-component test SKAT to perform the association tests between a gene and more than two groups of individuals.
The general idea of burden tests is to compute a genetic score per individual and per genomic region and to test if it differs between the different groups of individuals. To extend these tests to more than two groups of individuals, a non-ordinal multinomial regression is used. The independant variable in this regression is the genetic effect of the gene represented by the genetic score. Covariates can be added in the model. In addition to the genetic scores CAST and WSS directly implemented in the package, the user can specify another genetic score for the regression.
The variance-component test SKAT looks at the dispersion of genetics effects of rare variants. A geometrical interpretation of the test was used for its extension to more than two groups of individuals. Covariates can also be included in this model.

Genetic score for burden tests

We have implemented two functions to compute CAST and WSS scores respectively. These functions return a matrix with one row per individual and one column by genomic region. They are directly called in the function burden() if these scores are used to perform the association tests. It is also possible to compute genetic scores in a genomic region based on a vector of weights for each variant using the function burden.weighted.matrix().

CAST

CAST is based on a binary score which has a value of one if an individual carries at least one variant in the considered genomic region, and 0 otherwise. A MAF threshold for the definition of a rare variant is therefore needed as an argument to maf.threshold. This score can be computed using the function CAST() as shown here on the LCT data:

#Calculation of the genetic score with a maf threshold of 1%
CAST.score <- CAST(x = x1, genomic.region = x1@snps$genomic.region, maf.threshold = 0.01)
head(CAST.score)
##         R3HDM1 UBXN4 LCT MCM6 DARS
## HG00096      0     0   1    1    1
## HG00097      1     0   0    0    0
## HG00099      1     0   0    0    0
## HG00100      0     1   0    0    0
## HG00101      0     1   0    0    0
## HG00102      1     0   0    0    1

WSS

WSS (Weighted Sum Statistic) is based on a continuous score giving the highest weights to the rarest variants: \[ WSS_j = \sum_{i=1}^{R} I_{ij} \times w_{i}\] with \[w_{i} = \frac{1}{\sqrt{t_{i} \times q_{i} \times (1-q_{i})}}\] and \[q_{i} = \frac{n_{i} + 1}{2 t_{i}+1}\] Where \(n_{i}\) is the total number of minor alleles genotyped for variant \(i\), \(t_{i}\) is the total number of alleles genotyped for variant \(i\) and \(I_{ij}\) is the number of minor alleles of variant \(i\) for the invidual \(j\). In the original method, each variant is weighted according to its frequency in the controls group. In our version of WSS, the weights depend on allele frequencies calculated on the entire sample. The function WSS() can be used to compute the WSS score as shown on the LCT data:

WSS.score <- WSS(x = x1, genomic.region = x1@snps$genomic.region)
head(WSS.score)
##            R3HDM1    UBXN4       LCT    MCM6     DARS
## HG00096 0.0000000 0.000000 0.8185268 1.26932 1.418436
## HG00097 0.8185268 0.000000 0.0000000 0.00000 0.000000
## HG00099 1.0019881 0.000000 0.0000000 0.00000 0.000000
## HG00100 0.0000000 1.001988 0.0000000 0.00000 0.000000
## HG00101 0.0000000 1.001988 0.0000000 0.00000 0.000000
## HG00102 1.0019881 0.000000 0.0000000 0.00000 1.001988

Other genetic scores

It is also possible to compute other genetic scores based on variants weights using the function burden.weighted.matrix(). The weights should be given as a vector to weights having the same size as the number of variants. The genetic score will be compute as: \[Score_j = \sum_{i=1}^{R} I_{ij} \times w_{i}\] with \(w_i\) the weight of each variant in weights, and \(I_{ij}\) the number of minor alleles for individual \(j\) in variant \(i\).
Here is an example corresponding to a genetic score with all the weights at 1, i.e. counting the number of minor alleles:

Sum.score <- burden.weighted.matrix(x = x1, weights = rep(1, ncol(x1)))
head(Sum.score)
##         R3HDM1 UBXN4 LCT MCM6 DARS
## HG00096      0     0   1    2    2
## HG00097      1     0   0    0    0
## HG00099      1     0   0    0    0
## HG00100      0     1   0    0    0
## HG00101      0     1   0    0    0
## HG00102      1     0   0    0    1

Regressions

We have extended CAST and WSS using non-ordinal multinomial regression models. Let consider \(C\) groups of individuals including a group of controls (\(c=1\)) and \(C-1\) groups of cases with different sub-phenotypes of the disease. We can compute \(C-1\) probability ratios, one for each group of cases: \[ln \frac{P(Y_{j}=c)}{P(Y_{j}=1)} = \beta_{0,c} + \beta_{G,c}X_{G} + \beta_{k1,c}K_{1}+…+\beta_{kl,c}K_{l}\] Where \(Y_{j}\) corresponds to the phenotype of the individual \(j\) and \(K_{l}\) is a vector for the $l$th covariate with the corresponding coefficient \(\beta_{kl}\). The genetic effect is represented by \(X_{G}\) and correspond to the genetic score CAST or WSS with \(\beta_{G,c}\) the log-odds ratio associated to this burden score.
The p-value associated to the genetic effect is calculated using a likelihood ratio test comparing this model to the same model without the genetic effect (null hypothesis). If only two groups are compared, a classical logistic regression is performed.
This regression can be performed on a bed.matrix using the function burden() which relies on the package mlogit. Parameters under the null model can be obtained using the function NullObject.parameters(). To generate this null model, the pheno of each individual should be given as a factor, and the potential covariates to include in the model should be given as a matrix to the argument data (one row per individual and one column by covariate). If only a subset of covariates from data are to be included in the model, a R formula should be given to formula with these covariates, otherwise all the covariates will be included. In addition, the reference group should be given to the argument ref.level, i.e. all odds ratios will be computed in comparison to this group of individuals in the regression on the score. The choice of the reference group won't affect the p-value. As the parameters needed to run the association tests depend on the type of test performed (burden tests or SKAT), and the type of phenotype (continuous or categorial), both arguments shouls be given to NullObject.parameters() (RVAT and pheno.type respectively).
NullObject.parameters() will return a list with the parameters to use in burden() to run the burden test, including the Log-Likelihood computed under the null model, the argument data with the covariates to include determined from the argument formula.
Once the null model has been run, the function burden() can be used to perform burden tests. To do so, the user needs to give the results from NullObject.parameters() to the argument NullObject, and needs to specify the genomic region associated to each variant (argument genomic.region).
The CAST or WSS genetic scores can be directly calculated in the regression (burden=“CAST” or burden=“WSS”). The user can also use another genetic score in the regression, which has to be specified as a matrix with one individual per row and one genomic region per column to burden. In this situation, no bed matrix is needed, and the result from burden.weighted.matrix() can be used directly.
To speed up the computation time when the phenotype is categorial, calculations can be parallelised using the argument cores, set at 10 by default.
burden() will return the p-value associated to the regression for each genomic region. If there is a convergence problem with the regression, the function will return 1 in the column is.err. The odds ratio associated to each group of cases compared to the reference group (NullObject$ref.level) with its confidence interval at a given alpha threshold (argument alpha) can also be obtained if get.OR.value=TRUE.
An example of the p-value and OR calculation with its 95% confidence interval using WSS on the LCT data is shown below with or without the inclusion of covariates. The outcome here corresponds to the population from 1000Genome.

#Null model
x1.H0 <- NullObject.parameters(x1@ped$pop, ref.level = "CEU", 
                               RVAT = "burden", pheno.type = "categorial")
# WSS 
burden(x = x1, NullObject = x1.H0, burden ="WSS", 
      alpha=0.05, get.OR.value=TRUE, cores = 1)

#Sex + a simulated variable as covariates
sex <- x1@ped$sex
u <- runif(nrow(x1))
covar <- cbind(sex, u)
#Null model with the covariate "sex"
x1.H0.covar <- NullObject.parameters(x1@ped$pop, ref.level = "CEU",
                                     RVAT = "burden", pheno.type = "categorial",
                                     data = covar, formula = ~ sex)

#Regression with the covariate "sex" without OR values
#Using the score matrix WSS computed previously
burden(NullObject = x1.H0.covar, burden=WSS.score, cores = 1)

Finally, using Ravages, it is also possible to perform burden tests with a continuous phenotype by specifying pheno.type = “continuous”, and by giving a numeric vector to pheno as showed below:

#Random continuous phenotype
set.seed(1) ; pheno1 <- rnorm(nrow(x1))
#Null model
x1.H0.continuous <- NullObject.parameters(pheno1, RVAT = "burden", 
                                          pheno.type = "continuous")
#Test CAST 
burden(x1, NullObject = x1.H0.continuous, burden = "CAST")

SKAT

We also extended the variance-component test SKAT using a geometric interpretation. Unlike the burden tests, the is no burden calculated in this test: the distribution of the genetic effects in the genomic region is compared to a null distribution. SKAT is based on a linear mixed model where the random effects correspond to the genetic effects.
Before running the SKAT association test, the function NullObject.parameters() first needs to be called as for the burden tests, by specifying RVAT = “SKAT”. The group of each individual should be given as a factor to pheno if a categorial phenotype is studied, and as a numeric vector is a continuous phenotype is studied. As before, potential covariates should be included as a matrix to data. If only some of them are to be included, they should be given as a R formula to formula. This function will compute parameters to use the SKAT() function, and should be given to this function using the argument NullObject.
To compute the p-values, a chi-square approximation is used based on the statistics' moments. The moments can either be estimated using a sampling procedure, or be analytically computed using the method from Liu et al. 2008. The chi-square approximation can be based on the first three moments (estimation.pvalue = “skewness”), or on moments 1, 2 and 4 to have a more precise estimation of the tail distribution (estimation.pvalue = “kurtosis”). If get.moments = “theoretical” and estimation.pvalue = “skewness”, it is equivalent to the “liu” method in the SKAT package, and if estimation.pvalue = “kurtosis”, it corresponds to the “liu.mod” method in the SKAT package. If debug = TRUE, the statistics' moments will be returned in addition to the p-values.
If the sample size is lower than 2000, we recommand to use the sampling procedure. If no covariates are present, a simple permutation procedure can be used (get.moments = “permutations”), otherwise, a boostrap sampling should be used (get.moments = “bootstrap”). For those two situations, a sequential procedure is used to compute the p-values: permutated statistics are computed and each one is compared to the observed statistics. The sampling procedure stops when either perm.target (the number of times a permutated statistics should be greater than the observed statistics) or perm.max (the maximum number of permutations to perform) is reached. P-values are then computed in two different ways: if perm.target is reached, the p-value is computed as perm.target divided by the number of permutations performed to reach this value; if perm.max is reached before perm.target (that is, for pretty small p-values), p-values are computed using the chi-square approximation based on moments estimated from the permutated statistics. perm.target and perm.max should be given as a list to the argument params.sampling of SKAT.
If the sample size is bigger than 2000, the analytical calculation from Liu et al. can be used to compute the theoretical moments. In this situation, it is possible to parallelise the calculations using the argument cores, set at 10 by default.
It is possible to clearly ask for a specific method to compute the moments using get.moments = “permutations”, “bootstrap” or “theoretical”. By default, get.moments = “size.based”, and the method will depend on the sample size.
An example of the SKAT function by specifying the “permutations” or “theoretical” method is shown.

#Null model
x1.null <- NullObject.parameters(x1@ped$pop, RVAT = "SKAT", pheno.type = "categorial")
#Permutations because no covariates
SKAT(x1, x1.null, get.moments = "permutations", debug = TRUE,
     params.sampling = list(perm.target = 100, perm.max =5e4))
#Theoretical on 1 core
SKAT(x1, x1.null, get.moments = "theoretical", debug = TRUE, cores = 1)

It is also possible to perform the SKAT test on a continuous phenotype by using the argument pheno.type = “continuous” in NullObject.parameters() before the SKAT() function:

#Random continuous phenotype
set.seed(1) ; pheno1 <- rnorm(nrow(x1))
#Null Model with covariates
x1.H0.c <- NullObject.parameters(pheno1, RVAT = "SKAT", pheno.type = "continuous",
                                 data = covar)
#Run SKAT
SKAT(x1, x1.H0.c)

Data management

Data in plink format or in vcf format can be loaded in R using the functions read.bed.matrix() and read.vcf() respectively from the package gaston.
If the data for the controls and the different groups of cases are in different files, they can be loaded separately and then combined using the function gaston:::rbind() as long as the same variants are present between the different groups of individuals.
An example is given below where the simulated data have been split according the the group of each individual, and then combined in a bed.matrix:

#Selection of each group of individuals
CEU <- select.inds(x1, pop=="CEU")
CEU
## A bed.matrix with 99 individuals and 947 markers.
## snps stats are set
##   There are  748  monomorphic SNPs
## ped stats are set
FIN <- select.inds(x1, pop=="FIN")
FIN
## A bed.matrix with 99 individuals and 947 markers.
## snps stats are set
##   There are  771  monomorphic SNPs
## ped stats are set
GBR <- select.inds(x1, pop=="GBR")
GBR
## A bed.matrix with 91 individuals and 947 markers.
## snps stats are set
##   There are  792  monomorphic SNPs
## ped stats are set
#Combine in one file:
CEU.FIN.GBR <- rbind(CEU, FIN, GBR)
CEU.FIN.GBR
## A bed.matrix with 289 individuals and 947 markers.
## snps stats are set
##   There are  515  monomorphic SNPs
## ped stats are set