[HTML][HTML] Revisit linear regression-based deconvolution methods for tumor gene expression data

B Li, JS Liu, XS Liu - Genome biology, 2017 - Springer
Genome biology, 2017Springer
We have recently published a statistical deconvolution method to study infiltrating immune
cells using tumor RNA-seq data [1]. One of the goals in that work was to understand how
proportions of different cell types covary across different cancer tissues. To this end, we
estimated the abundance of six cell types over 9000 tumor samples across 23 cancer types,
and then assessed the correlations of these estimated proportions across the different
samples within a cancer type. In particular we compared our method (TIMER) with …
We have recently published a statistical deconvolution method to study infiltrating immune cells using tumor RNA-seq data [1]. One of the goals in that work was to understand how proportions of different cell types covary across different cancer tissues. To this end, we estimated the abundance of six cell types over 9000 tumor samples across 23 cancer types, and then assessed the correlations of these estimated proportions across the different samples within a cancer type. In particular we compared our method (TIMER) with CIBERSORT [2], a previously published deconvolution approach, for their ability to assess such correlations. To our surprise, we found many non-biological negative correlations between CIBERSORT estimates, and we believed that this artifact was, to a large extent, due to the incorporation of highly similar features in the linear model, or statistical collinearity. Newman et al., the authors of CIBERSORT, have raised concerns that these correlations were due to data normalization, instead of collinearity [3]. While we agree with Newman and coauthors that the forced normalization indeed introduces unwanted negative correlations, we will show in this response that the inclusion of highly similar features contributes as significantly as normalization, if not more, to the observed artificial negative correlations among the estimates obtained by CIBERSORT. Highly correlated features (covariates) in linear regression models can lead to many technical difficulties, such as high estimation variances, non-robustness, and nonidentifiability. Furthermore, it is often misleading to interpret their coefficients at their face value. For example, it is very easy to create examples where when only one of the two highly similar features is included in a regression model, its coefficient is highly significant and positive; whereas when both are included, none of the coefficients is significant or one is positively significant and the other is negative. This issue is a fundamental statistical problem due to lack of information and is unlikely to be solved simply by regularization employed by the CIBERSORT method.
To evaluate how CIBERSORT estimations are affected by the incorporation of similar features, we conducted two in silico experiments. In the first one, we selected two unrelated cell types, CD8 T cells and neutrophils, from the CIBERSORT feature set, LM22 matrix. The Pearson correlation of the expression levels of the two cell types is 0.009. We generated 500 mixtures by randomly apportioning the population consisting of these two cell types only: Y= Yi, i= 1, 2,… 500, where:
Springer