Unmeasured technical and biomedical heterogeneity in microbiome data can arise from differential processing and design. Uncorrected for, this can lead to spurious results.
Currently, no methods are designed to address unmeasured heterogeneity in microbiome studies. Techniques developed for genomic data, such as surrogate variable analysis (SVA) (Leek and Storey 2007) and remove unwanted variation (RUV) (Risso et al. 2014), may be applicable. SVA is a two-step procedure for identifying surrogate variables, a general framework that was adapted by subsequent methods, including ours here. RUV employs factor analysis on control genes (via spike-in experiment or housekeeping gene estimation) to identify hidden heterogeneities.
However, SVA and RUV may not be ideally suited for microbiome data. First, they do not address the inflated zeros and over-dispersion typical in microbiome data. Second, they detect the underlying structure in the mean expression (log-transformed if needed) while ignoring other critical aspects of the complex data. Third, RUV requires users to specify the number of hidden directions: incorrect specification of these parameters may result in false positives or negatives.
We propose the Quantile Thresholding (QuanT) approach, a comprehensive non-parametric hidden variable inference method that accommodates the complex distributions of microbial read counts and relative abundances.
We will use the CRC dataset as an example to show how to use the
QuanT
R package. We will also run SVA and RUV and compare
them with QuanT based on the CRC dataset.
The CRC dataset originates from the curatedMetagenomicData package (Pasolli et al. 2017), which offers standardized, curated human microbiome data across various studies. We analyzed two studies (Feng et al. 2015; Thomas et al. 2019) focused on colorectal cancer (CRC), considering distinct patient phenotypes (adenocarcinoma, adenoma, advanced adenoma, carcinoma, and control) as the primary variable of interest, and age and gender as covariates for adjustment. The processed data encompassed 166 genera across 234 samples.
This dataset is included in the package and is lazy-loaded when loading the package. Let’s have a quick look at the data.
library(QuanT)
str(crc)
#> List of 7
#> $ otu : num [1:234, 1:166] 5617649 8455896 1180434 457179 580791 ...
#> $ rela : num [1:234, 1:166] 0.13736 0.12792 0.01942 0.00909 0.01118 ...
#> $ groupid : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
#> $ condition: Factor w/ 3 levels "adenoma","control",..: 3 2 2 1 2 2 2 1 3 2 ...
#> $ age : Factor w/ 2 levels "adult","senior": 1 2 1 2 2 2 2 2 2 2 ...
#> $ subtype : Factor w/ 5 levels "adenocarcinoma",..: 4 5 5 3 5 5 5 3 4 5 ...
#> $ gender : Factor w/ 2 levels "female","male": 2 2 1 2 2 2 2 2 2 1 ...
As you can see, the list crc
includes the following
components.
Now let’s run QuanT on this dataset by calling the QuanT
function. QuanT
has two arguments: The first is the
relative abundance matrix. The second is a matrix containing all known
variables, which are phenotype, age, and gender here. For now, we act as
if the true group IDs are unknown, and check whether QuanT is able to
identify this unmeasured heterogeneity.
covariates = model.matrix(~ subtype + age + gender, data=crc)[,-1]
qsv = QuanT(crc$rela, covariates)
Now we have constructed the QSVs and stored them into the
qsv
object. We can plot the first two QSVs and color the
points based on the true group ID.
The two groups are approximately seperated by the first two QSVs! This indicates that QuanT has succesfully identified the unmeasured heterogeneity.
We then run SVA and RUV on the CRC dataset and compare them with
QuanT. First install SVA, RUV, and their dependencies if they haven’t
been installed. We will also need the ggplot2
and
corrplot
packages for visualization.
if (!require("mgcv", quietly = TRUE)) {
install.packages("mgcv")
}
if (!require("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
if (!require("sva", quietly = TRUE)) {
BiocManager::install("sva")
}
if (!require("RUVSeq", quietly = TRUE)) {
BiocManager::install("RUVSeq")
}
if (!require("ggplot2", quietly = TRUE)) {
install.packages("ggplot2")
}
if (!require("corrplot", quietly = TRUE)) {
install.packages("corrplot")
}
library(mgcv)
library(sva)
library(RUVSeq)
library(ggplot2)
library(corrplot)
After installation, we can run SVA on the CRC dataset.
sv
are the surrogate variables(SVs) constructed by SVA.
mod = model.matrix(~ subtype + age + gender, data=crc)
mod0 = mod[,1]
svafit = svaseq(t(crc$otu),mod,mod0)
#> Number of significant surrogate variables is: 8
#> Iteration (out of 5 ):1 2 3 4 5
sv = svafit$sv
There are eight SVs in total. We can plot the first two SVs and color the points based on the true group ID, just like what we have done for QuanT.
The two groups are not well separated by the first two SVs.
We then run RUV on the CRC dataset. ruv
are the factors
constructed by RUV.
design <- model.matrix(~ subtype + age + gender, data=crc)
y <- DGEList(counts=t(crc$otu))
y <- calcNormFactors(y, method="TMM")
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=2)
empirical = which(p.adjust(lrt$table$PValue,"BH") > 0.05)
ruv <- RUVg(t(crc$otu), empirical, k=11)$W
Similar to what we have done for QuanT and SVA, we can plot the first two factors and color the points based on the true group ID.
RUV also does a good job in separating the two groups based on the first two factors. However, the number of factors needs to be chosen arbitrarily, and an inappropriate number may impair the performance. The original paper chooses 6 factors for a dataset with 128 samples. Based on this choice, we use 11 factors for the CRC data which includes 234 samples.
We now conduct differential abundance analysis based on linear regression. The goal of this analysis is to identify the taxa associated with the phenotype while adjusting for the covariates and the hidden variables constructed by each method. For benchmarking, we also performed linear regression adjusted for the true group indicators (as the gold standard) and without any adjustment (as a baseline result).
get_pv = function(rela, surrogates) {
if (is.null(surrogates)) {
pv = apply(rela, 2,
function(x) anova(lm(x~covariates),
lm(x~covariates[,-(1:4)]))$'Pr(>F)'[2])
} else {
pv = apply(rela, 2,
function(x)
anova(lm(x~covariates+surrogates),
lm(x~covariates[,-(1:4)]+surrogates))$'Pr(>F)'[2])
}
return(pv)
}
pv_no_correction = get_pv(crc$rela, NULL)
pv_group = get_pv(crc$rela, crc$groupid)
pv_quant = get_pv(crc$rela, qsv)
pv_sva = get_pv(crc$rela, sv)
pv_ruv = get_pv(crc$rela, ruv)
Now we have obtained the p-values from each method. We apply Benjamini-Hochberg procedure to the p-values and use a threshold of 0.05 to determine the significant taxa. Below are the indices of the significant taxa from each method.
which(p.adjust(pv_no_correction,"BH") <= 0.05)
#> [1] 2 3 4 7 8 9 10 11 12 16 22 26 37 42 48 53 56 57 58
#> [20] 60 61 62 64 66 75 81 93 98 114 161
which(p.adjust(pv_group,"BH") <= 0.05)
#> [1] 60 61 64 81 93 98 103 151
which(p.adjust(pv_quant,"BH") <= 0.05)
#> [1] 60 61 93 98
which(p.adjust(pv_sva,"BH") <= 0.05)
#> [1] 2 3 4 8 9 10 12 16 22 26 37 38 42 57 58 59 60 61 62
#> [20] 64 66 81 93 95 96 98 114 157 161
which(p.adjust(pv_ruv,"BH") <= 0.05)
#> [1] 61
As can be seen from the list above, using QuanT leads to 4 discoveries which are all true discoveries according to the gold standard. SVA leads to many more false discoveries. And RUV leads to only 1 true discovery. We can visualize the results by running the following code to generate a heatmap showing the number of overlapping discoveries between each pair of methods.
get_overlap = function(pv, method_name, alpha) {
n_method = ncol(pv)
discovery = apply(pv, 2, function(x) which(p.adjust(x,"BH")<=alpha))
concordance = matrix(NA, nrow=n_method,ncol=n_method)
for (i in 1:n_method) {
for (j in i:n_method) {
if (length(discovery) == 0) {
concordance[i,j] = 0
} else if ("list" %in% class(discovery)) {
concordance[i,j] = length(intersect(discovery[[i]], discovery[[j]]))
} else {
concordance[i,j] = length(intersect(discovery[,i], discovery[,j]))
}
concordance[j,i] = concordance[i,j]
}
}
rownames(concordance) = colnames(concordance) = method_name
concordance
}
plot_overlap = function(pv, method_name, alpha=0.05) {
concordance = get_overlap(pv, method_name, alpha)
maxconcord = max(concordance)
corrplot(concordance,
method = "color",
type = "upper",
col.lim = c(0,maxconcord),
# col = rev(COL2('RdYlBu',200)),
col = COL1("YlGn",200),
addCoef.col = "black",
is.corr = FALSE,
tl.cex = 0.4, number.cex = 0.4,
tl.srt = 0,
cl.pos = "n")
}
method_name = c("No Correction",
"True Group",
"SVA",
"RUV",
"QuanT")
plot_overlap(cbind(pv_no_correction,
pv_group,
pv_sva,
pv_ruv,
pv_quant),
method_name)
Other than thresholding Benjamini-Hochberg adjusted p-values at 0.05, we can also sort the p-values and consider the top K significant taxa with a varying K. We can calculate the proportions of overlapping taxa in the top K detected differentially abundant taxa between each analysis and the gold standard analysis in the CRC dataset. Here, the method adjusting for true group or study indicator is considered as gold standard.
truerank = rank(pv_group, ties.method = "min")
get_overlap = function(pv) {
overlap = rep(NA,20)
for (k in 1:20) {
dis = which(rank(pv, ties.method = "min") <= k)
truetopK = which(truerank <= k)
overlap[k] = mean(dis%in%truetopK)
}
return(overlap)
}
df = data.frame(overlap = c(get_overlap(pv_no_correction),
get_overlap(pv_group),
get_overlap(pv_sva),
get_overlap(pv_ruv),
get_overlap(pv_quant)),
method = rep(method_name,each=20),
K = rep(1:20,length(method_name)))
df$method = factor(df$method, method_name)
pic_overlap = ggplot(df, aes(x = K,
y = overlap,
color = method)) +
geom_line(lwd = 0.6) +
theme_bw() +
scale_color_manual(values = c("#7FC97F","#BEAED4","#FDC086","#386CB0","#F0027F")) +
labs(color = "Method",
shape = "Method",
linetype = "Method",
x = "Top K Detected Taxa",
y = "Proportion of overlapping Discoveries") +
theme(legend.position="bottom",
text = element_text(size = 6)) +
guides(color = guide_legend(nrow = 2))
plot(pic_overlap)
As can be seen from the figure, QuanT is most consistent with the gold standard in terms of top K differential abundant taxa.