Introduction to QuanT

Unmeasured heterogeneity in microbiome data and QuanT

Unmeasured technical and biomedical heterogeneity in microbiome data can arise from differential processing and design. Uncorrected for, this can lead to spurious results.

Currently, no methods are designed to address unmeasured heterogeneity in microbiome studies. Techniques developed for genomic data, such as surrogate variable analysis (SVA) (Leek and Storey 2007) and remove unwanted variation (RUV) (Risso et al. 2014), may be applicable. SVA is a two-step procedure for identifying surrogate variables, a general framework that was adapted by subsequent methods, including ours here. RUV employs factor analysis on control genes (via spike-in experiment or housekeeping gene estimation) to identify hidden heterogeneities.

However, SVA and RUV may not be ideally suited for microbiome data. First, they do not address the inflated zeros and over-dispersion typical in microbiome data. Second, they detect the underlying structure in the mean expression (log-transformed if needed) while ignoring other critical aspects of the complex data. Third, RUV requires users to specify the number of hidden directions: incorrect specification of these parameters may result in false positives or negatives.

We propose the Quantile Thresholding (QuanT) approach, a comprehensive non-parametric hidden variable inference method that accommodates the complex distributions of microbial read counts and relative abundances.

The CRC dataset

We will use the CRC dataset as an example to show how to use the QuanT R package. We will also run SVA and RUV and compare them with QuanT based on the CRC dataset.

The CRC dataset originates from the curatedMetagenomicData package (Pasolli et al. 2017), which offers standardized, curated human microbiome data across various studies. We analyzed two studies (Feng et al. 2015; Thomas et al. 2019) focused on colorectal cancer (CRC), considering distinct patient phenotypes (adenocarcinoma, adenoma, advanced adenoma, carcinoma, and control) as the primary variable of interest, and age and gender as covariates for adjustment. The processed data encompassed 166 genera across 234 samples.

This dataset is included in the package and is lazy-loaded when loading the package. Let’s have a quick look at the data.

library(QuanT)

str(crc)
#> List of 7
#>  $ otu      : num [1:234, 1:166] 5617649 8455896 1180434 457179 580791 ...
#>  $ rela     : num [1:234, 1:166] 0.13736 0.12792 0.01942 0.00909 0.01118 ...
#>  $ groupid  : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ condition: Factor w/ 3 levels "adenoma","control",..: 3 2 2 1 2 2 2 1 3 2 ...
#>  $ age      : Factor w/ 2 levels "adult","senior": 1 2 1 2 2 2 2 2 2 2 ...
#>  $ subtype  : Factor w/ 5 levels "adenocarcinoma",..: 4 5 5 3 5 5 5 3 4 5 ...
#>  $ gender   : Factor w/ 2 levels "female","male": 2 2 1 2 2 2 2 2 2 1 ...

As you can see, the list crc includes the following components.

otu: the absolute abundance(counts) matrix
rela: the relative abundance matrix
groupid: the group(study) indicator
condition: a factor with three levels: adenoma, CRC, control
age: a factor with two levels: adult, senior
subtype: a factor with five levels: adenoma, carcinoma, adenocarcinoma, advancedadenoma, control
gender: a factor with two levels: male, female

Running QuanT

Now let’s run QuanT on this dataset by calling the QuanT function. QuanT has two arguments: The first is the relative abundance matrix. The second is a matrix containing all known variables, which are phenotype, age, and gender here. For now, we act as if the true group IDs are unknown, and check whether QuanT is able to identify this unmeasured heterogeneity.

covariates = model.matrix(~ subtype + age + gender, data=crc)[,-1]
qsv = QuanT(crc$rela, covariates)

Now we have constructed the QSVs and stored them into the qsv object. We can plot the first two QSVs and color the points based on the true group ID.

plot(qsv[,1], qsv[,2], col=crc$groupid, 
     cex.lab=0.5, cex.axis=0.5, cex=0.5)

The two groups are approximately seperated by the first two QSVs! This indicates that QuanT has succesfully identified the unmeasured heterogeneity.

Running SVA and RUV

Installation

We then run SVA and RUV on the CRC dataset and compare them with QuanT. First install SVA, RUV, and their dependencies if they haven’t been installed. We will also need the ggplot2 and corrplot packages for visualization.

if (!require("mgcv", quietly = TRUE)) {
 install.packages("mgcv")
}
if (!require("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
if (!require("sva", quietly = TRUE)) {
  BiocManager::install("sva")
}
if (!require("RUVSeq", quietly = TRUE)) {
  BiocManager::install("RUVSeq")
}
if (!require("ggplot2", quietly = TRUE)) {
 install.packages("ggplot2")
}
if (!require("corrplot", quietly = TRUE)) {
  install.packages("corrplot")
}

library(mgcv)
library(sva)
library(RUVSeq)
library(ggplot2)
library(corrplot)

Running SVA

After installation, we can run SVA on the CRC dataset. sv are the surrogate variables(SVs) constructed by SVA.

mod = model.matrix(~ subtype + age + gender, data=crc)
mod0 = mod[,1]

svafit = svaseq(t(crc$otu),mod,mod0)
#> Number of significant surrogate variables is:  8 
#> Iteration (out of 5 ):1  2  3  4  5
sv = svafit$sv

There are eight SVs in total. We can plot the first two SVs and color the points based on the true group ID, just like what we have done for QuanT.

plot(sv[,1], sv[,2], col=crc$groupid, 
     cex.lab=0.5, cex.axis=0.5, cex=0.5)

The two groups are not well separated by the first two SVs.

Running RUV

We then run RUV on the CRC dataset. ruv are the factors constructed by RUV.

design <- model.matrix(~ subtype + age + gender, data=crc)
y <- DGEList(counts=t(crc$otu))
y <- calcNormFactors(y, method="TMM")
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=2)
empirical = which(p.adjust(lrt$table$PValue,"BH") > 0.05)
ruv <- RUVg(t(crc$otu), empirical, k=11)$W

Similar to what we have done for QuanT and SVA, we can plot the first two factors and color the points based on the true group ID.

plot(ruv[,1], ruv[,2], col=crc$groupid, 
     cex.lab=0.5, cex.axis=0.5, cex=0.5)

RUV also does a good job in separating the two groups based on the first two factors. However, the number of factors needs to be chosen arbitrarily, and an inappropriate number may impair the performance. The original paper chooses 6 factors for a dataset with 128 samples. Based on this choice, we use 11 factors for the CRC data which includes 234 samples.

Differential abundance analysis

We now conduct differential abundance analysis based on linear regression. The goal of this analysis is to identify the taxa associated with the phenotype while adjusting for the covariates and the hidden variables constructed by each method. For benchmarking, we also performed linear regression adjusted for the true group indicators (as the gold standard) and without any adjustment (as a baseline result).

get_pv = function(rela, surrogates) {
  if (is.null(surrogates)) {
    pv = apply(rela, 2,
               function(x) anova(lm(x~covariates),
                                 lm(x~covariates[,-(1:4)]))$'Pr(>F)'[2])
  } else {
    pv = apply(rela, 2,
               function(x)
                 anova(lm(x~covariates+surrogates),
                       lm(x~covariates[,-(1:4)]+surrogates))$'Pr(>F)'[2])
  }
  return(pv)
}

pv_no_correction = get_pv(crc$rela, NULL)
pv_group = get_pv(crc$rela, crc$groupid)
pv_quant = get_pv(crc$rela, qsv)
pv_sva = get_pv(crc$rela, sv)
pv_ruv = get_pv(crc$rela, ruv)

Now we have obtained the p-values from each method. We apply Benjamini-Hochberg procedure to the p-values and use a threshold of 0.05 to determine the significant taxa. Below are the indices of the significant taxa from each method.

which(p.adjust(pv_no_correction,"BH") <= 0.05)
#>  [1]   2   3   4   7   8   9  10  11  12  16  22  26  37  42  48  53  56  57  58
#> [20]  60  61  62  64  66  75  81  93  98 114 161
which(p.adjust(pv_group,"BH") <= 0.05)
#> [1]  60  61  64  81  93  98 103 151
which(p.adjust(pv_quant,"BH") <= 0.05)
#> [1] 60 61 93 98
which(p.adjust(pv_sva,"BH") <= 0.05)
#>  [1]   2   3   4   8   9  10  12  16  22  26  37  38  42  57  58  59  60  61  62
#> [20]  64  66  81  93  95  96  98 114 157 161
which(p.adjust(pv_ruv,"BH") <= 0.05)
#> [1] 61

As can be seen from the list above, using QuanT leads to 4 discoveries which are all true discoveries according to the gold standard. SVA leads to many more false discoveries. And RUV leads to only 1 true discovery. We can visualize the results by running the following code to generate a heatmap showing the number of overlapping discoveries between each pair of methods.

get_overlap = function(pv, method_name, alpha) {
  n_method = ncol(pv)
  discovery = apply(pv, 2, function(x) which(p.adjust(x,"BH")<=alpha))
  concordance = matrix(NA, nrow=n_method,ncol=n_method)
  for (i in 1:n_method) {
    for (j in i:n_method) {
      if (length(discovery) == 0) {
        concordance[i,j] = 0
      } else if ("list" %in% class(discovery)) {
        concordance[i,j] = length(intersect(discovery[[i]], discovery[[j]]))
      } else {
        concordance[i,j] = length(intersect(discovery[,i], discovery[,j]))
      }
      concordance[j,i] = concordance[i,j]
    }
  }
  rownames(concordance) = colnames(concordance) = method_name
  concordance
}
plot_overlap = function(pv, method_name, alpha=0.05) {
  concordance = get_overlap(pv, method_name, alpha)
  maxconcord = max(concordance)
  corrplot(concordance, 
           method = "color", 
           type = "upper",
           col.lim = c(0,maxconcord),
           # col = rev(COL2('RdYlBu',200)),
           col = COL1("YlGn",200),
           addCoef.col = "black",
           is.corr = FALSE,
           tl.cex = 0.4, number.cex = 0.4,
           tl.srt = 0,
           cl.pos = "n")
}
method_name = c("No Correction",
                "True Group",
                "SVA",
                "RUV",
                "QuanT")
plot_overlap(cbind(pv_no_correction, 
                   pv_group, 
                   pv_sva, 
                   pv_ruv, 
                   pv_quant),
             method_name)

Proportion of overlapping discoveries

Other than thresholding Benjamini-Hochberg adjusted p-values at 0.05, we can also sort the p-values and consider the top K significant taxa with a varying K. We can calculate the proportions of overlapping taxa in the top K detected differentially abundant taxa between each analysis and the gold standard analysis in the CRC dataset. Here, the method adjusting for true group or study indicator is considered as gold standard.

truerank = rank(pv_group, ties.method = "min")
get_overlap = function(pv) {
  overlap = rep(NA,20)
  for (k in 1:20) {
    dis = which(rank(pv, ties.method = "min") <= k)
    truetopK = which(truerank <= k)
    overlap[k] = mean(dis%in%truetopK)
  }
  return(overlap)
}
df = data.frame(overlap = c(get_overlap(pv_no_correction), 
                             get_overlap(pv_group), 
                             get_overlap(pv_sva), 
                             get_overlap(pv_ruv), 
                             get_overlap(pv_quant)),
                 method = rep(method_name,each=20),
                 K = rep(1:20,length(method_name)))
df$method = factor(df$method, method_name)
pic_overlap = ggplot(df, aes(x = K,
             y = overlap,
             color = method)) +
  geom_line(lwd = 0.6) +
  theme_bw() +
  scale_color_manual(values = c("#7FC97F","#BEAED4","#FDC086","#386CB0","#F0027F")) +
  labs(color = "Method",
       shape = "Method",
       linetype = "Method",
       x = "Top K Detected Taxa",
       y = "Proportion of overlapping Discoveries") +
    theme(legend.position="bottom",
          text = element_text(size = 6)) +
  guides(color = guide_legend(nrow = 2))
plot(pic_overlap)

As can be seen from the figure, QuanT is most consistent with the gold standard in terms of top K differential abundant taxa.

References

Feng, Qiang, Suisha Liang, Huijue Jia, Andreas Stadlmayr, Longqing Tang, Zhou Lan, Dongya Zhang, et al. 2015. “Gut Microbiome Development Along the Colorectal Adenoma–Carcinoma Sequence.” Nature Communications 6 (1): 6528.

Leek, Jeffrey T, and John D Storey. 2007. “Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis.” PLoS Genetics 3 (9): e161.

Pasolli, Edoardo, Lucas Schiffer, Paolo Manghi, Audrey Renson, Valerie Obenchain, Duy Tin Truong, Francesco Beghini, et al. 2017. “Accessible, Curated Metagenomic Data Through ExperimentHub.” Nature Methods 14 (11): 1023–24.

Risso, Davide, John Ngai, Terence P Speed, and Sandrine Dudoit. 2014. “Normalization of RNA-Seq Data Using Factor Analysis of Control Genes or Samples.” Nature Biotechnology 32 (9): 896–902.

Thomas, Andrew Maltez, Paolo Manghi, Francesco Asnicar, Edoardo Pasolli, Federica Armanini, Moreno Zolfo, Francesco Beghini, et al. 2019. “Metagenomic Analysis of Colorectal Cancer Datasets Identifies Cross-Cohort Microbial Diagnostic Signatures and a Link with Choline Degradation.” Nature Medicine 25 (4): 667–78.