Robust evaluation of deep learning-based representation methods for survival and gene essentiality prediction on bulk RNA-seq data

In this section, we describe all the different elements of the benchmarking pipeline depicted in Supplementary Fig. S1. Additional details can be found in Supplementary Materials. Although our pipeline can be used to evaluate any RNA-seq-based machine learning model, we focused in this study on benchmarking the capacity of various representation models to capture relevant information in bulk RNA-seq data.DatasetsThis study was based on publicly available data from The Cancer Genome Atlas Program (TCGA) and Cancer Cell Line Encyclopedia (CCLE) dataset47. Expression data (raw counts, RPKM, TPM) for all TCGA datasets were downloaded from RECOUNT348, a publicly available resource concatenating multiple omics datasets aiming at harmonizing preprocessing steps to make comparisons easier between these datasets. For the per-cohort OS prediction task, we focused on the cohorts with less than 90% of patients censored and selected the indications with more than 400 patients with associated RNA-seq samples (BRCA, UCEC, KIRC, HNSC, LUAD, LGG, LUSC, SKCM, COAD, STAD and BLCA). For the pan-cancer task, we selected the 33 cohorts of TCGA, for a total of 10,736 samples. For both OS prediction tasks, we used the OS labels defined by the TCGA Pan-Cancer Clinical Data Resource9 (TCGA-CDR).The Gene Essentiality prediction experiments were carried out using the CCLE dataset, which contains the expression profiles of hundreds of human cancer cell lines. The dataset was obtained through the DepMap portal, version 22Q447, which provided the gene perturbation experiments results used as labels. Expression data (raw and TPM pseudo-counts) and gene signatures files were downloaded directly from the platform and non-transformed TPM data was obtained by applying a reverse transformation of the pseudo-counts file. We used the same selection process as previous studies3 to define gene dependencies of interest (DepOI) and filtered out cell lines with missing dependencies. This resulted in 1223 DepOIs and 893 cell lines for our experiments, for a total of 1,092,139 combinations with an associated gene essentiality score. All datasets used for this study are available for download in the GitHub repository.Tasks descriptionThe different representation architectures were evaluated by the performance on the prediction of patients or cell lines labels when using the output of baseline representation models or of DRL ones as input features. These downstream tasks were evaluated on several test sets sampled from the datasets following a repeated holdout cross-validation process (see related section below for more details).The first task we considered was cancer-specific OS prediction, a common task in cancer biology that can improve the identification of subgroups of patients at risk. In this task, representation and prediction models were trained on single cancer cohorts separately to predict the survival of patients within a particular cancer subtype. Our second task consisted in pan-cancer OS prediction, which is often considered in the literature as a standard for comparing methods and claim state of the art performance of newly developed machine learning models2. The models are trained on all the cohorts from TCGA combined in a single dataset, evaluating the capabilities to predict the different survival times of a more heterogeneous population suffering from different indications. Harrell C-index49 was used as the final metric for both tasks and is referred to as c-index in the paper. The different data splits were stratified to ensure similar censorship levels between train and test sets.The gene essentiality prediction task described in this paper was inspired by the DeepDEP framework3, with the modification of predicting gene dependency instead of gene effect as recommended in previous work3,50 and focusing only on bulk RNA-seq rather than a multi-modal setting. In this context, gene dependency refers to the extent to which a particular gene is necessary for cell survival and growth, which depends on which cell line is considered. Each cell line is represented by its expression data while DepOIs are represented by vectors called fingerprints. Fingerprints are specific encodings that summarize the relevant biological functions of a given gene and represent its involvement in 3115 gene signatures related to chemical and genetic perturbation defined in MSigDBv6.251. We focused only on the cell line representation based on RNA-seq data in our benchmark while the fingerprints were fixed and reduced to 500 dimensions using PCA, an arbitrary choice to balance between computation speed, performance and variance explained (Supplementary Fig. S8). The final dataset for the downstream prediction models is the cartesian product of the cell lines representation dataset and the DepOI fingerprints dataset: for each cell line / gene combination with a gene dependency label, we created the corresponding features by concatenating the cell line representation and the fingerprints representation. To evaluate model performances, we used the Spearman rank correlation. While in the original DeepDEP paper Pearson correlation coefficients were computed, both methods are suited to evaluate this task52,53. The Spearman correlation was preferred as it is more robust to outliers on non-normally distributed data54 and fits better the framework of ranking genes for target selection. The correlation was computed on the whole test set and is referred to as the overall correlation. Following DeepDEP’s evaluation, we computed as well a per-DepOI metric: for each gene with available experiments, we calculated the Spearman rank correlation over the different cell lines and averaged the result over all the genes to have one final metric per repetition. We partitioned the data by cell lines, ensuring that a cell line did not overlap between the training and test sets.Preprocessing datasets and gene selectionThe RNA-Seq data was preprocessed following a classical bioinformatics pipeline. While different preprocessing options were tested (data and results not shown) as they can impact downstream analyses55, we decided to choose a fixed standard choice to keep a reasonable computational budget and compare models all else being equal. For all the experiments without pretraining on external datasets, we selected the 5000 most variable genes on the training sets, applied a logarithmic operation and normalized the data with mean-standard scaling. The exact same transformation was then applied to the test dataset to prevent any potential leak by computing features statistics on the whole dataset. All models used TPM normalized expression data except the variational auto-encoder model model which used raw counts to fit a negative binomial prior distribution (cf Methods, Representation Models).When considering pretraining, we implemented another preprocessing process closer to recent pretraining strategies for foundation models (FM) trained on single-cell expression data (scRNA-seq)26,27,28. While these models can afford little feature selection during pretraining thanks to the vast amount of available data, pretraining on size-limited bulk RNA-seq datasets still requires gene selection to avoid the curse of dimensionality. However, the most variable genes for the pretraining datasets can be different from those of the datasets used to evaluate on the downstream tasks for non-pretrained models. To select genes that were still relevant for the tasks considered in this paper, we built two gene lists based on their variances in the TCGA pan-cancer dataset used for pretraining:

In the case of the per-cohort OS prediction task, the previous procedure has to be adapted in order to ensure a fair comparison with the non-pre-trained case. We therefore took the union of the top 2,000 most variable genes for each of the 11 cohorts selected in the downstream task to make sure relevant genes per indication were also selected in the final features rather than genes solely linked to cancer type that would be considered when looking at the genes’ variances across the whole TCGA dataset. This resulted in a set of 5046 unique gene identifiers.

For the gene essentiality task, we took the top 5000 most variable genes in TCGA after intersection with the genes present within the CCLE dataset, similarly to DeepDEP’s procedure of selecting genes with a standard deviation superior to 1 in TCGA.

The TCGA data used for pretraining was normalized within each fold with mean standard scaling and learned statistics were saved for potential usage on the downstream datasets (pretraining experiments).Repeated holdout cross-validation frameworkIn this study, we aim to compare the performance of different representation learning algorithms on downstream survival and gene essentiality prediction tasks using bulk RNA-seq data. Each representation model is trained and used to transform the input expression data before feeding the learned low-dimensional embeddings to a task-specific prediction model, fitted for each representation model tested. To achieve a comprehensive evaluation, we adopt a validation pipeline that focuses on exploring the learning algorithm’s variability to diverse hyperparameter settings. Our validation pipeline involves a repeated holdout cross-validation approach45 in which the dataset is repeatedly split in two to create pairs of training and test sets, comprising 80% and 20% of the original data respectively (Supplementary Fig. S1). For experiments without pretraining, the training sets are used to select jointly the optimal HPs for the representation and prediction models by performing a fivefold cross-validation for a given set of HPs. The HP tuning is performed using a Tree-structured Parzen Estimator (TPE Sampler) implemented by Optuna56 with a fixed budget of 50 iterations. Then, we select the set of HPs with the best average performance over the validation folds on the downstream tasks to train the representation and prediction models on the whole training set before evaluating it on the test set. This procedure is repeated 10 times to generate a distribution of scores over the different test sets, providing robust performance assessments compared to a single test evaluation.For pretraining experiments, we performed for each task HP tuning of the pre-trained representation models separately from the prediction models. We used the TPE Sampler with 50 trials to find the architecture choices and regularization factors (Supplementary Table S8) that minimized the reconstruction loss of the auto-encoders on TCGA data. For each trial, a fivefold cross-validation was repeated 5 times to estimate the generalization error by averaging the scores obtained per fold. For the gene essentiality prediction task, all of the 33 cohorts were used during pretraining (for a total of 10,736 samples in the pretraining dataset) while we removed the 11 cohorts of the downstream tasks for the per-cohort OS prediction task (4480 samples in the pretraining dataset).Acceptance testing and model scoringLimitations of traditional null-hypothesis significance testing and the cross-validation framework to derive statistical conclusions has been critically examined in previous work45. They highlight the unsuitability of standard null-hypothesis significance testing, including t-tests, for cross-validation due to the non-independence of runs and the complexity of deriving confidence intervals. Consequently, they advocate for a repeated holdout framework as an alternative to cross-validation.While it is feasible to derive confidence intervals from the repeated holdout framework, the authors still recommend a different scoring system. They argue that traditional hypothesis testing primarily focuses on the statistical significance of expected improvements in models over an infinite population. This approach, however, is not applicable to studies that concentrate on practical, meaningful improvements on finite-sized test sets. Therefore, following the authors’ recommendation, we considered a method superior if it outperforms another 75% of the time.We proposed a scoring system for all models considered for a given task based on the aforementioned criterion. To evaluate the relative importance of each model, a score was generated by counting the number of significant pairwise comparisons won against other models. When different cohorts were considered for the same task, an average score was obtained from the cohort-specific scores. This value was then normalized by the maximum value obtained by a model on the task, resulting in a score between 0 and 1. This score was then converted into a percentage to rank the different models.Representation modelsWe selected state-of-the-art methods from various subfields of DRL: linear models as baselines (Identity, PCA), auto-encoders-like models (AE, VAE), Self-Supervised Learning methods (Masking Auto-Encoders), Semi-Supervised methods (Multi-Head Auto-Encoders), data augmentation techniques coupled with AE, graph-based methods leveraging prior-knowledge (GNN) and training frameworks (pretraining). For each of these categories, one base model was implemented, and we considered the different architecture choices as HPs sweeps in our pipeline as described above. Details about the architectures and HPs ranges (Supplementary Tables S8, S9) are described in Supplementary Materials.BaselinesWe considered as baseline representations the Identity and Principal Component Analysis (PCA), widely used for all considered tasks on RNA-seq data. Identity refers to not transforming the input TPM counts and passing the 5000 genes as features to the prediction models directly. The number of components of the PCA was considered a hyperparameter and optimized per fold in our pipeline (Supplementary Table S8).Auto-encodersAn auto-encoder (AE) is a type of neural network that learns to encode input data into a lower-dimensional representation and then decode it back to the original form, with the goal of reconstructing the input accurately57. Given the limited size of our datasets, we focused on small architectures ranging from 0 to 2 hidden layers for the encoder (excluding the representation layer) with 256 to 1024 neurons (more details available in Supplementary Table S8). The decoder part was constructed symmetrically to the encoder. We optimized the AE models using a Mean Squared Error (MSE) Loss and the Adam algorithm58 and applied rectified linear unit activation (ReLU) functions between each linear layer. We did the same for other representation models derived from this architecture (MAE, MHAE, DA-GN, PreAE).Variational auto-encoders: scVIA variational auto-encoder (VAE) is a type of auto-encoder that incorporates a probabilistic approach, using encoder and decoder networks to generate latent variables and enable sampling from a learned distribution, allowing for generative modeling and capturing underlying data distributions59.Specifically, we adapted a popular method to embed scRNA-seq datasets, scVI38 to bulk RNA-seq data, and assessed the learned representations on our benchmark tasks. scVI is based on a hierarchical Bayesian model where conditional distributions are parametrized by neural networks. As in the VAE, each gene expression profile is encoded through a non-linear transformation into a low dimensional latent vector. This latent representation is then decoded by another non-linear transformation to generate an estimate of the parameters of a Zero-Inflated Negative Binomial. Since bulk RNA-seq data typically does not fit the zero-inflation assumption observed in scRNA-seq, we parameterized scVI’s decoder with a Negative Binomial distribution instead.Masking auto-encodersMasking in self-supervised learning refers to the process of randomly hiding or removing portions of input data, forcing the model to learn to reconstruct the missing parts, which promotes the discovery of meaningful features and representations. VIME, or Value Imputation and Mask Estimation is a popular masking method for tabular data40. In this self-supervised learning framework for tabular data, two pretext tasks are introduced: feature vector estimation and mask vector estimation. The encoder function is trained together with two pretext predictive models to reconstruct a corrupted input sample (feature vector estimation) with MSE loss and estimate the mask vector used to corrupt the sample (mask vector estimation) with a cross-entropy loss. The resulting learned representation captures correlations across different parts of the data, making it useful for downstream tasks. We focused on one re-implementation from the main paper which is similar to the original method where we only included one pretext task: feature vector estimation.Multi-head auto-encodersA multi-head auto-encoder (MHAE) includes one or more auxiliary heads to perform supervised tasks during the representation model training. These auxiliary heads are fully-connected neural networks that take the compressed representation as inputs, and predict labels such as overall survival and gene essentiality as outputs. In this study we considered MH auto-encoders trained with an auxiliary head predicting the label of the downstream task, to assess if adding supervision would improve the performance of the auto-encoder.Graphs neural networksWe used a Graph Neural Network (GNN) architecture to incorporate protein–protein interaction networks (PPIs) as a source of prior knowledge. The STRING PPI database served as the underlying graph on which the bulk RNA-seq data was laid. Each node of the graph represented a gene, with gene expression as a node feature. In this setting, each patient (or sample) was represented as a single graph, and the graph topology over the samples did not vary, only the overlaying signal did. This model is close to the traditional convolutional neural network alternating between convolution and pooling steps but using a graph instead of e.g. pixel coordinates to perform message passing to neighboring features.Previous studies showcased this kind of model for survival prediction36,60, and we propose here a modified version intended for representation learning. More details about the implementation are given in the Supplementary Materials.Data augmentationData augmentation is a set of techniques used to reduce overfitting of machine learning models by generating new training data points from the original ones. It is ubiquitous in computer vision, where new images can be obtained by simple transformations such as rotations or cropping61. In the case of omics data, there are no obvious equivalents of such transformations. However, certain methods have been successfully applied to single-cell RNA-seq data in the context of contrastive learning by adding noise and simulating dropout events24. Specifically, we focused on using data augmentation based on Gaussian Noise (DA GN). The method consists in creating additional samples by copying the data and adding gaussian noise to all the features after they have been standardized with mean and standard normalization. We used a centered normal distribution with standard deviation controlled by a hyperparameter and fixed the number of copies to 4 corrupted samples for one original sample. The target labels for these new data points are simply copied from the original dataset.Pretraining experimentsThe use of bulk RNA-seq data to train representation algorithms is often hindered by the limited number of patients who were screened for a specific condition. To address this issue, pretraining models on larger datasets can help represent the data more accurately. In this study, we tried two different pretraining strategies for both the per-cohort survival prediction and the gene essentiality prediction tasks. In order to better assess the effect of pretraining and decouple it from architecture details, we focused our pretraining experiments on the AE model.For both downstream tasks, we compared the original AE to two different pretraining strategies, PreAE and PreAE finetuned. In the case of PreAE, the mean-standard scaling was done using solely the statistics of the TCGA pretraining dataset (as described above, all 33 TCGA cohorts were used for pretraining for gene essentiality prediction task, while we removed the 11 cohorts of the downstream tasks for the per-cohort OS prediction task) and the auto-encoder was used only for inference on the task dataset. For the PreAE finetuned strategy, the scaling was performed using the learned statistics on the training folds of the task dataset while the auto-encoder was also fine-tuned for each fold. These former experiments assess how initializing the weights with pretraining can help on the downstream tasks. Both optimization histories for the pretraining are available in (Supplementary Fig. S9, Supplementary Fig. S10).Downstream prediction modelsFor survival prediction tasks, each representation model was combined with a multi-layer perceptron (MLP) optimized with a differentiable Cox loss62,63 using the Adam algorithm. 20% within each training fold were used for early stopping to prevent overfitting of the MLP. The choice of using MLP rather than Cox linear models was motivated by faster computation times thanks to GPU usage with PyTorch and by the need of a fair comparison with DL models that use nonlinear prediction heads. More details about the hyperparameters of the prediction model are available in the Supplementary Materials (Supplementary Table S9).For Gene Essentiality a light gradient-boosting machine (LGBM) regressor was trained to predict the essentiality (dependency score) of each gene in a cell line based on the features described in the task section. To train the LGBM regressors effectively, two HPs were considered: the learning rate and the regularization coefficient alpha. The learning rate was set in the range of 0.01 to 0.3 and the regularization coefficient within a range of 0 to 100 for exploring various regularization strengths. Notice that we initially tested both MLP (the choice in DeepDEP paper) and LGBM but ended up keeping the best performing model (Supplementary Table 10).

Hot Topics

Related Articles