Predictive modeling from high-dimensional genomic data is often preceded by a dimension reduction step, such as principal component analysis (PCA). However, the application of PCA is not straightforward for

Dimension reduction methods are invaluable for the analysis of genomics data and other high-dimensional ‘omics data. In particular, principal component analysis (PCA) and related methods reduce a large number of variables (eg, genes or genetic markers) to a small number of latent components that explain much of the variation in those variables. Principal component analysis can be used as an exploratory tool, but the principal components can also be used as predictors in a statistical model for an outcome. This latter use of PCA can solve issues of overfitting, identifiability, and collinearity that arise when a predictive model is fit using the original data. The use of principal components in linear regression and other predictive models has a long history in statistics^{1} and has more recently been used in clinical cancer research to predict patient outcomes such as survival or recurrence from high-dimensional molecular predictors.^{2–5} This article addresses the task of predicting an outcome from multiple sources of ‘omics data, representing different but related biological components. This scenario is increasingly encountered in cancer research and other fields. For example, we consider using multisource genomic data for glioblastoma multiforme (GBM) patients from The Cancer Genome Atlas^{6,7} to predict patient survival. We use 3 sources of data, capturing DNA methylation, microRNA (miRNA) expression, and gene (messenger RNA [mRNA]) expression. Long-term survival after GBM diagnosis is rare, with a median survival time of approximately 14 months with treatment.^{8} Current understanding of the molecular contribution to differences in survival outcomes is limited but suggests that miRNA, DNA methylation, and the regulation of gene expression all play a key role.^{9,10}

The application of classical dimension reduction techniques such as PCA for prediction from multisource genomic data is not straightforward. A sensible ad hoc approach is to perform a separate dimension reduction for each source and then combine the source-specific components in a predictive model; this approach is explored in the work by Zhao et al^{11} to predict survival from mRNA, miRNA, methylation, and copy number aberration data. However, components extracted from related sources may be collinear or redundant. At the other extreme, one could concatenate the data sources and use PCA or other dimension reduction approaches on the combined data (see, eg, consensus PCA^{12}) prior to predictive modeling. However, this approach lacks interpretability regarding the contribution of each data source and may not capture relevant signals that are specific to a single data source precisely.

There is a nascent literature in several computational domains on the integrative dimension reduction of multisource data.^{13–16} These methods identify lower-dimensional structure that is shared across multiple sources and structure that is specific to each source. In particular, Joint and Individual Variation Explained (JIVE)^{17,18} is a direct extension of PCA for multisource data. These methods have been used for exploratory analysis but not for predictive modeling. A recent survey of integrative analysis methods for cancer data identified many exploratory methods for multisource integration, and many predictive methods for a single source, but no methods for prediction from multisource data.^{19} In this article, we describe an approach in which JIVE components are used for predictive modeling, evaluating the properties and predictive accuracy of JIVE in a statistical modeling framework. Joint and Individual Variation Explained identifies a small number of joint latent components across all data sources, and individual components specific to each data source, that maximize the overall variation explained. Thus, redundancy and collinearity among the sources are accounted for in the joint components, and the individual components facilitate interpretation of the unique contribution of each data source in a predictive model.

The purpose of this article is 2-fold:

To illustrate the advantages of JIVE and multisource dimension reduction more broadly for predictive modeling of high-dimensional multisource data.

To introduce a method for the prediction of JIVE component scores for new samples that were not included in the original JIVE dimension reduction step, called jive.predict.

We also describe data preprocessing and postprocessing steps to improve predictive performance.

The rest of this article is organized as follows. In section “Methods,” we review the JIVE algorithm and explain its use in statistical modeling. In section “Simulations,” we discuss 2 simulations to illustrate and assess the efficacy of using JIVE in a predictive modeling framework. In section “GBM Data Application,” we describe an application to predict patient survival for the multisource GBM data, using JIVE components in a Cox proportional hazards model; we also discuss preprocessing and postprocessing steps and compare with separate and concatenated PCA approaches. In section “Post hoc prediction,” we introduce jive.predict and explore its theoretical properties and application to the GBM data. Section “Discussion” concludes with a discussion and summary.

Here, we briefly describe the mathematical form and notation for PCA, before describing JIVE as an extension to multisource data in section “Joint and Individual Variation Explained.” Let

where

Concatenated PCA describes the application of PCA to multiple sources of data:

Each data source is standardized separately to have the same total sum of squares, to resolve scale differences between the sources. The sources are then concatenated together by the rows to form

Joint and Individual Variation Explained was originally developed as a data exploration technique for measuring patterns shared among and expressed within multiple data sources. Joint and Individual Variation Explained is an extension of PCA, to allow for more than 1 high-dimensional data source. Specifically, JIVE serves as a compromise between concatenated PCA and individual PCA. For multisource data (

where

For more computational and theoretical details on the JIVE algorithm, and its use in an exploratory context, we refer to Lock et al^{17} and O’Connell and Lock.^{18}

Consider predicting an outcome variable

The joint component scores

For further interpretation, it helps to consider the relative contribution of each variable (eg, gene or miRNA) to the fitted model, and these can be obtained via the loadings for each component. For this purpose, we combine the loadings for the joint and individual components for a given data source,

where

In this section, we use JIVE scores for prediction in 2 illustrative simulations: one for modeling a binary outcome and another for a time-till event outcome. The strengths of JIVE are highlighted in this section by illustration: estimating shared data structures across data sources and the individual sources of signal that each data source independently provides.

In this section, we consider a simulation design with a binary group structure analogous to that in Section 4.10.3 in the study by Lock.^{20} We will use this simulation to illustrate the use of JIVE scores to distinguish sample groups in an exploratory fashion and their use in a predictive model for group membership. First, we created 2 data sources,

Simulation design for binary outcome. Reproduced with permission from Lock.^{20}

For a concrete example of how JIVE can capture the information even when reducing to ranks of 1, we consider the plot of the joint and individual scores identified by JIVE in

Joint and Individual Variation Explained results for 1 joint rank and 1 individual rank for each source: ◊ indicates jointly distinguished, ∆ is distinguished in source 1 only, and + is distinguished in source 2 only; blue indicates Class −1, whereas red indicates Class +1.

We next use the joint, individual 1, and individual 2 scores as covariates to predict class membership, using a logistic model:

where

Logistic regression results using Joint and Individual Variation Explained scores.

Joint | Individual 1 | Individual 2 | |
---|---|---|---|

Estimate | −0.261 | 0.434 | 0.395 |

Standard error | 0.041 | 0.061 | 0.061 |

−6.378 | 7.084 | 6.516 | |

<.0001 | <.0001 | <.0001 |

All the estimates were highly significant for discriminating Class +1 and Class −1 (

Fitted value and residuals plots are shown in

Diagnostics for logistic regression simulation.

We assess the accuracy of the JIVE prediction approach with 100 randomly generated data sets under the above simulation scenario. Within each simulation, we ran an N-fold (leave-one-out) cross-validation and obtained the predicted probability of the held-out sample being associated with the Class +1. Then, we computed the mean absolute error (MAE) of these predictions for each simulation. To compare predictive accuracy, we compared the results for JIVE with the correct ranks given (joint rank = 1, individual ranks = (1,1)) to the resulting model fit using the components of a concatenated (joint) or separate (individual) PCA analyses. To capture all of the structure in the data, rank

In this example, we consider using JIVE for predictive modeling of time-till-event data. Using the same data-generating scheme for 2 sources ^{21} using the JIVE scores as predictors:

This expression gives the hazard rate

For this example, we observe strong and significant relationships between the hazard rate and all 3 components: joint, individual 1, and individual 2. To assess sensitivity of the results, we generated

Results from 100 simulations Cox proportional hazards regression results using Joint and Individual Variation Explained scores.

Joint | Individual 1 | Individual 2 | |
---|---|---|---|

Mean |
0.0297 | 0.0488 | 0.0483 |

Standard error of coefficient | 0.001 | 0.001 | 0.001 |

Mean |
4.295 | 5.346 | 5.358 |

Standard error of |
0.1376 | 0.101 | 0.107 |

Proportion of |
.92 | .98 | 1.00 |

The results were generally robust to the random seed that generated the values for a given simulation. The standardized coefficients were generally similar across simulations, and all 3 components were significant in at least ^{21}

Similar to the aforementioned binary outcome simulation, we assessed the prediction accuracy with 100 simulations of the time-till-event outcome scenario. Within each simulation, we ran an N-fold cross-validation and obtained the correlation between the predicted median survival time and the observed survival time. The simulation mean and standard error of the correlation for JIVE, concatenated PCA, and individual PCAs were 0.3395 (0.00651), 0.2797 (0.00689), and 0.3073 (0.00655), respectively. The use of JIVE with the correct ranks for prediction gives optimal performance; however, the 2 methods with misallocated ranks still give reasonable predictive accuracy.

Altogether, the binary and time-till-event simulations elucidate more into JIVE’s improvement on PCA, for both interpretation and predictive accuracy. Rank misspecification (as in the concatenated and individual PCAs for both of the aforementioned scenarios) decreases prediction accuracy but can still give reasonable estimates.

In this section, we analyze the relationships of gene expression, miRNA, and DNA methylation data in relation to patient survival. We note that the DNA methylation of genes—specifically MGMT—are observed by around 50% of the patients with GBM. The methylation status of MGMT is related to the survival of those patients who underwent radiation therapy. Across many studies of GBM, patient age has a reoccurring relationship with patient survival.^{9} The use of JIVE on other sources of data may reveal stronger predictions than predictions that primarily use age, and this is the article’s aim.

Here, we describe preprocessing steps for the 3 data sources. Gene expression data were measured using the Agilent G450 microarrray, miRNA was measured using the Agilent 8 × 15^{22} using a nonparametric empirical Bayesian approach.

Following preprocessing and batch adjustment, the full data had ^{2} Specifically, we ran a univariate Cox survival model for each variable in each data source against the survival outcome. We collected the ^{23} The resulting data set count for the 20% FDR preprocessing step was as follows: 2700 variables for DNA methylation, 40 variables for miRNA, and 2085 variables for gene expression, for

Following the preprocessing of the 3 data sources, we applied JIVE using R Studio and the package r.jive.^{18} Joint and Individual Variation Explained estimated the joint rank and (3) individual ranks to be (1, 13, 9, 36), respectively. One may view these ranks as individual columns of

Preprocessed Joint and Individual Variation Explained heatmap for methylation, microRNA, and gene expression data, respectively, ranks: joint = 1, individual = (13, 9, 36), red is positive signal and blue is negative signal.

Joint and Individual Variation Explained scores colored by clinical subtype: blank, classical, G-CIMP, mesenchymal, neural, proneural—for the processed data. MiRNA indicates microRNA.

In addition to the JIVE scores, we also included age and sex into the Cox model as potential predictors. Thus, the full model with all variables included was as follows:

where

We applied a backward/forward selection process^{24} based on Akaike information criterion (AIC)^{25} to select the most predictive subset of JIVE components and other variables in the final model. Then, we ran the model through a series of permutation tests to assess whether the JIVE results add explanation of the variability in the survival times at all, as well as whether they are informative

We used the 3 objective resampling assessment strategies described above—permutation testing, permutation testing in addition to age, and cross-validated prediction accuracy—on the concatenated PCA, altogether individual PCA, and the 3 individual PCA results to compare with the predictive model obtained through JIVE. The number of components for the PCA analyses was chosen to be consistent with the estimated joint rank and individual ranks from JIVE. That is, the number of components for concatenated PCA was given by summing all joint and individual ranks. The estimated ranks for the individual PCAs were

Comparing modeling results across dimension-reducing strategies.

Method | AIC value | Perm. age, |
Corr. (log(.)) | 95% CI |
---|---|---|---|---|

JIVE | 1887.394 | <.0001 | 0.637 | 0.55–0.71 |

Concatenated PCA | 1869.617 | <.0001 | 0.507 | 0.402–0.599 |

Individual PCA | 1881.127 | <.0001 | 0.289 | 0.163–0.407 |

Methylated PCA | 1920.857 | <.0001 | 0.468 | 0.358–0.565 |

MiRNA PCA | 1955.293 | .004 | 0.364 | 0.244–0.474 |

Gene expression PCA | 1938.709 | .004 | 0.426 | 0.311–0.529 |

Abbreviations: AIC, Akaike information criterion; PCA, principal component analysis.

The permutation

Age-outcome permutation results using Joint and Individual Variation Explained,

N-fold cross-validation results on the

In this section, we consider prediction from a new set of multisource patient data,

This problem frames the JIVE algorithm in a new light: given a new data

where

where

The algorithm is as follows:

Initiate

We then minimize the error by iteratively estimating

Stop until convergence.

Using this approach, the original JIVE decomposition and the scores for the new samples are both estimated via minimizing the sum of squared residuals. In fact, if the optional orthogonality between individual structures is not enforced in the original JIVE decomposition, the jive.predict scores are exactly the same as the JIVE scores for the same data. That is, if

In this section, we illustrate jive.predict using the GBM survival data from section “GBM data application.”

We replicated the scenario of the new patient prediction with a 5-fold cross-validation. For each fold, 20% of the patients were placed in the test data set, whereas 80% of the remaining patients were used as the training data set. Each patient was used in the test data once. The training data set underwent a JIVE decomposition and then used as covariates in a Cox proportional hazards model with age and sex. The decomposition was followed by a backward/forward variable selection process. The test data were not included in the original JIVE analysis or modeling, but their sample scores were predicted using jive.predict and the JIVE loadings from the training data set. To evaluate whether orthogonality between the individual scores increases model accuracy, there is another case where we also applied jive.predict on the training data using the JIVE results of that training data. Backward/forward variable selection was implemented, after the fit of the Cox proportional hazards model using the scores as covariates.

Predicted vs true median survival plots with individual orthogonality enforced are shown in

Fivefold cross-validation: test scores estimated using jive.predict, training model fit using Joint and Individual Variation Explained only.

Fivefold cross-validation: test scores estimated using jive.predict, training model fit using jive.predict.

The correlations were not that different in values between the 2 cases of orthogonality. In the case of orthogonality, the correlation with orthogonality enforced between individual scores was 0.3966 (95% CI: 0.277-0.504,

An important consideration when using jive.predict is consistency in the centering and scaling approaches used for the test data

Joint and Individual Variation Explained addresses 2 problems: handling multiple high-dimensional data sources as well as dimension reduction assuming matrix rank sparsity. In this article, we have explored the use of such a method for statistical modeling and prediction. For GBM data sources and other data sources relating to cancers with more complex genetic factors, JIVE is a strong exploratory tool, as well as a strong predictive analysis tool for these problematic high-throughput data scenarios.

One key feature of JIVE is that this method ensures that the estimated joint and individual structures are orthogonal to each other, meaning that there is no overlap between the estimated shared patterns and the estimated individual patterns pertaining to the data. In contrast to concatenated PCA, this method lacks estimation of what is truly joint and what is truly individual pertaining to the data sources in the analyses. Also, using separate individual PCAs for each data source may result in redundancies within each data source that could have been estimated as joint structure. Joint and Individual Variation Explained eliminates the weaknesses of both of these mentioned strategies and leads to accurate predictive results from multiple high-dimensional data sources.

In this article, we have considered JIVE as an extension of PCA for prediction from multiple sources of high-dimensional data. This general approach may also be extended to other data reduction techniques (eg, autoencoders^{26} and non-negative matrix factorization^{16}). Also, in this article, we have focused on the 2-step process of data reduction followed by predictive modeling. An advantage of such an approach is the flexibility to use a variety of classical statistical techniques (eg, logistic regression, Cox proportional hazards modeling) after the initial dimension reduction step. However, alternative single-step prediction approaches to assess the joint and individual predictive power of multisource high-dimensional data is an interesting direction of future work.

In this article, the JIVE ranks were estimated by permutation tests. Alternative approaches that are directly related to the predictive model, such as selecting the ranks via Bayesian information criterion or to give optimal predictive performance under cross-validation, may also be used. Alternative penalization strategies, such as an

Jive.predict is a method for adding new patients into the currently JIVE-analyzed data sources an investigator or statistician may have at any given day. The swiftness of the algorithm is also a noticeable perk; the algorithm converges in less than 10 iterations. Instead of running a whole new JIVE analysis on the integrated data of the new and old subjects, one can bring in the raw data and use this new algorithm under 10 minutes. An additional advantage of jive.predict is that we do not need to construct an entirely new JIVE model for the data again, which is conceptually appealing. A well-validated prediction model does not need to change under new data. Once the algorithm quickly finishes, the investigator will obtain the scores and predict those sets of patients’ outcomes, such as median survival time, probability of an event of cancer, and odds of being in a high-risk group of a disease.

Let

Then, it follows that

where

Therefore, these 2 products are linearly independent of each other due to

An analogous argument can be used to show the equivalence of the scores for

The authors would like to thank The Cancer Genome Atlas Research Network for providing data. They would also like to thank Michael O’Connell for his assistance with the R.JIVE package and Dr Xianghua Luo for fruitful discussions on survival modeling with high-dimensional covariates.

Three peer reviewers contributed to the peer review report. Reviewers’ reports totaled 604 words, excluding any confidential comments to the academic editor.

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Institutes of Health (grant ULI RR033183/KL2 RR0333182).

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

EFL conceived and designed JIVE and applications of JIVE for statistical modeling. AK analyzed the data and conceived and designed jive.predict and applications of jive.predict.