Leveraging the variational Bayes autoencoder for survival analysis

Survival analysisIn a conventional time-to-event or SA setup, N observations are given. Each of these observations is described by \(D = (x_i, t_i, d_i)^N_{i=1}\) triplets, where \(x_i=(x_i^1,\ldots , x_i^L)\) is an L-dimensional vector where \(l=1,2,\ldots ,L\) indexes the covariates, \(t_i\) is the time-to-event, and \(d_i \in \{0,1\}\) is the censor indicator. When \(d_i = 0\) (censored), the subject has not experienced an event up to time \(t_i\), while \(d_i = 1\) indicates the observed events (ground truth). SA models are conditional on covariates: time probability density function \(p(t\vert x)\), hazard rate function (the instantaneous rate of occurrence of the event at a specific time) \(h(t\vert x)\), or survival function \(S(t\vert x) = P(T>t) = 1-F(t\vert x)\), also known as the probability of a failure occurring after time t, where \(F(t\vert x)\) is the Cumulative Distribution Function (CDF) of the time. From standard definitions of the survival function, the relations between these three characterizations are formulated as follows:$$\begin{aligned} p(t\vert x) = h(t\vert x)S(t\vert x). \end{aligned}$$
(1)
Vanilla variational autoencoderThe original VAE was proposed in 201316, a robust approach employing DNNs for Bayesian inference. It addresses the problem of a dataset consisting of N i.i.d. samples \(x_i\) of a continuous or discrete variable, where \(i \in 1,2,\ldots , N\), \(x_i\) are generated by the following random process, which is depicted in Fig. 1:Figure 1Bayesian VAE vanilla model. The shaded circle refers to the latent variable z, and the white circle refers to the observable x. Probabilities \(p_\theta (x\vert z)\) and \(q_\phi (z\vert x)\) denote, respectively, the generative model and the variational approximation to the posterior, since the true posterior \(p_{\theta }(z\vert x)\) is unknown.

1.

A latent variable \(z_i\) is sampled from a given prior probability distribution p(z). The original research16 assumes a form \(p_\theta (z)\), i.e., the prior depends on some parameters \(\theta\), but its main result drops this dependence. Therefore, a simple prior p(z) is assumed in this paper.

2.

A conditional distribution, \(p_\theta (x\vert z)\), with parameters \(\theta\) generates the observed values, \(x_i\). A generative model governs this process. Certain assumptions are made, including the differentiability of probability density functions (pdfs), p(z), and \(p_\theta (x\vert z)\), regarding \(\theta\) and z.

The latent variable z and the parameters \(\theta\) are unknown. Without simplifying assumptions, evaluating the marginal likelihood \(p_\theta (x) = \int p(z)p_\theta (x\vert z)dz\) is infeasible. The true posterior density \(p_\theta (z\vert x)\), which we aim to approximate, can be defined as Eq. (2) using Bayes’ theorem:$$\begin{aligned} p_\theta (z\vert x) = \frac{p_\theta (x\vert z)p(z)}{p_\theta (x)}. \end{aligned}$$
(2)
However, since the marginal likelihood \(p_\theta (x)\) is often intractable, direct computation of the true posterior \(p_\theta (z\vert x)\) is not practicable.Variational methods offer a solution by introducing a variational approximation, \(q_\phi (z\vert x)\), to the true posterior. This approximation involves optimizing the best parameters for a chosen family of distributions. The quality of the approximation depends on the expressiveness of this parametric family.ELBO derivationSince an optimization problem must be solved, the optimization target needs to be developed. Considering \(x_i\) are assumed to be i.i.d., the marginal likelihood of a set of points \(\{x_i\}_{i=1}^{N}\) can be expressed as$$\begin{aligned} \log p_\theta (x_1, x_2,\ldots , x_N) = \sum _{i=1}^{N}\log p_\theta (x_i), \end{aligned}$$
(3)
where$$\begin{aligned} \begin{aligned} p_\theta (x) = \int p_\theta (x,z)dz = \int p_\theta (x,z) \frac{q_\phi (z\vert x)}{q_\phi (z\vert x)}dz = {\mathbb {E}}_{q_\phi (z\vert x)} \left[ \frac{p_\theta (x,z)}{q_\phi (z\vert x)}\right] . \end{aligned} \end{aligned}$$
(4)
Using Jensen’s inequality, we can obtain:$$\begin{aligned} \begin{aligned} \log p_\theta (x) = \log \Bigg [{\mathbb {E}}_{q_\phi (z\vert x)}\left[ \frac{p_\theta (x,z)}{q_\phi (z\vert x)}\right] \Bigg ] \ge {\mathbb {E}}_{q_\phi (z\vert x)}\left[ \log \frac{p_\theta (x,z)}{q_\phi (z\vert x)}\right] . \end{aligned} \end{aligned}$$
(5)
Rearranging Eq. (5), we can express it as follows:$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_{q_\phi (z\vert x)}\Bigg [\log \left( \frac{p_\theta (x,z)}{q_\phi (z\vert x)}\right) \Bigg ]\\&\quad = \int q_\phi (z\vert x)\log \frac{p_\theta (x\vert z)p(z)}{q_\phi (z\vert x)}dz\\&\quad = \int q_\phi (z\vert x)\log \frac{p(z)}{q_\phi (z\vert x)}dz + \int q_\phi (z\vert x)\log p_\theta (x\vert z)dz\\&\quad = – \int q_\phi (z\vert x)\log \frac{q_\phi (z\vert x)}{p(z)}dz + \int q_\phi (z\vert x)\log p_\theta (x\vert z)dz\\&\quad = -D_{KL}(q_\phi (z\vert x) \vert \vert p(z)) + {\mathbb {E}}_{q_\phi (z\vert x)}\left[ \log p_\theta (x\vert z)\right] \\&\quad = \mathcal {L}(x, \theta , \phi ), \end{aligned} \end{aligned}$$
(6)
where \(D_{KL}(p\vert \vert q)\) is the Kullback-Leibler divergence between distributions p and q, and \(\mathcal {L}(x, \theta , \phi )\) is the Evidence Lower BOund (ELBO), whose name comes from Eq. (5):$$\begin{aligned} \begin{aligned} \log p_\theta (x) \ge – D_{KL}(q_\phi (z\vert x)\vert \vert p(z)) + {\mathbb {E}}_{q_\phi (z\vert x)}\left[ \log p_\theta (x\vert z)\right] = \mathcal {L}(x, \theta , \phi ), \end{aligned} \end{aligned}$$
(7)
the ELBO is a lower bound for the marginal log-likelihood of the relevant set of points. Thus, maximizing the ELBO maximizes the log-likelihood of the data. This would be the optimization problem to solve.ImplementationThe ELBO derived from Eq. (7) can be effectively implemented using a DNN-based architecture. However, computing the gradient of the ELBO concerning \(\phi\) presents challenges due to the presence of \(\phi\) in the expectation term (the second part of the ELBO in Eq. (7)). To address this issue, the original research16 introduced the reparameterization trick. This method involves modifying the latent space sampling process to make it differentiable, enabling gradient-based optimization techniques. Rather than sampling directly from the latent space distribution, VAEs sample \(\epsilon\) from a simple distribution, often a standard normal distribution. Subsequently, a deterministic transformation \(g_\phi\) is applied to \(\epsilon\), producing \(z = g_\phi (x, \epsilon )\) where \(z \sim q_\phi (z\vert x)\) and \(\epsilon \sim p(\epsilon )\). In this case, the ELBO can be estimated as follows.$$\begin{aligned} \begin{aligned} \mathcal {{\hat{L}}}(x, \theta , \phi ) = \frac{1}{N}\sum _{i=1}^{N} \bigg (- D_{KL}(q_\phi (z\vert x_i)\vert \vert p(z)) + \log p_\theta (x_i\vert g_\phi (x_i, \epsilon _{i})) \bigg ). \end{aligned} \end{aligned}$$
(8)
This modification facilitates the calculation of the ELBO gradient concerning \(\theta\) and \(\phi\), allowing the application of standard gradient optimization methods.Equation (8) offers a solution using DNNs, with functions parameterized by \(\phi\) and \(\theta\). Gradients can be conveniently computed using the Backpropagation algorithm, which various programming libraries automate. The term VAE derives from the fact that Eq. (8) resembles the architecture of an Autoencoder (AE)17, as illustrated in Fig. 2. The variational distribution \(q_\phi\) can be implemented using a DNN with weights \(\phi\), taking an input sample x and outputting parameters for the deterministic transformation \(g_\phi\). The VAE’s latent space comprises the latent variable z distribution, a deterministic transformation \(g_\phi\) of the encoder DNN output and random ancillary noise \(\epsilon\). A sampled value \(z_i\) is drawn from the latent distribution and used to generate an output sample, where another DNN with weights \(\theta\) acts as a decoder, taking z as input and providing parameters of the distribution \(p_\theta (x\vert z)\) as output.Figure 2VAE vanilla model implementation using DNNs.Two key observations emerge.

1.

The ELBO losses in Eq. (7) include a regularization term penalizing deviations from the prior in the latent space and a reconstruction error term that enforces similarity between generated samples from the latent space and inputs.

2.

In contrast to standard AEs, VAEs incorporate intermediate sampling, rendering them non-deterministic. This dual sampling process is retained in applications where the distribution of output variables is of interest, facilitating the derivation of input value distribution parameters.

Our contributionThe interest lies in using VAEs to obtain the predictive distribution of time-to-event given covariates. The proposed approach termed Survival Analysis VAE (SAVAE), depicted in Fig. 3, extends the Vanilla VAE. SAVAE includes a continuous latent variable z, two vectors (an observable covariate vector x and the time-to-event t), and generative models \(p_{\theta _1}(x\vert z)\) and \(p_{\theta _2}(t\vert z)\), assuming conditional independence, which is a characteristic inherent to VAEs and their ability to model the joint distribution of variables effectively. This means that knowing z, the components of the vector x and t can be generated independently. A single variational distribution estimates the variational posterior \(p(z\vert x)\) to define the predictive distribution based on covariates. While it is possible to include the effect of time (\(p(z\vert t, x)\)), this approach focuses on using only covariates to obtain the latent space, as the time t can be unknown to predict survival times for test patients and could be censored. SAVAE combines VAEs and survival analysis, offering a flexible framework for modeling complex event data.Figure 3SAVAE Bayesian model. The shadowed circle refers to the latent variable, and the white circles refer to the observables. Note that the probabilities \(p_{\theta _1}(x\vert z)\) and \(p_{\theta _2}(t\vert z)\) denote the generative models, and \(q_\phi (z|x)\) denotes the variational approximation to the posterior, since the true posterior \(p_{\theta }(z\vert x)\) is unknown.GoalTo achieve the main objective, which is to obtain the predictive distribution for the time to event, variational methods will be used in the following way18:$$\begin{aligned} \begin{aligned} p\left( t^*\vert x^*, \left\{ x_i, t_i\right\} ^N_{i=1}\right) = \int p \left( t^*\vert z, \left\{ x_i, t_i\right\} ^N_{i=1}\right) p \left( z\vert x^*, \left\{ x_i, t_i\right\} ^N_{i=1}\right) dz, \end{aligned} \end{aligned}$$
(9)
where \(x^*\) represents the covariates of a particular patient, and its survival time distribution \(p \left( t^*\vert z, \left\{ x_i, t_i\right\} ^N_{i=1}\right)\) needs to be estimated.ELBO derivationConsidering our main objective and the use of VAE as the architecture on which we base our approach, the previous ELBO development can be extended to apply to our case. SAVAE assumes that the two generative models \(p_{\theta _1}(x\vert z)\) and \(p_{\theta _2}(t\vert z)\) are conditionally independent. This implies that if z is known, generating x or t is possible. Furthermore, due to the VAE architecture, it is assumed that each component of the covariate vector x is also conditionally independent given z. Therefore,$$\begin{aligned} p(x, t, z) = p_{\theta _1}(x\vert z)p_{\theta _2}(t\vert z)p(z) = p_\theta (x, t\vert z)p(z). \end{aligned}$$
(10)
It also assumes that the distribution families of \(p_{\theta _1}(x\vert z)\) and \(p_{\theta _2}(t\vert z)\) are known, but not the parameters \(\theta _1\) and \(\theta _2\). Taking into account these assumptions, the ELBO can be computed in a similar way to the Vanilla VAE. First, the conditional likelihood of a set of points \(\left\{ x_i, t_i\right\} ^N_{i=1}\) can be expressed as follows:$$\begin{aligned} \begin{aligned} \log p_\theta (x_1, x_2,\ldots , x_N, t_1, t_2,\ldots , t_N\vert z) = \sum _{i=1}^{N}\log p_\theta (x_i, t_i\vert z) = \sum _{i=1}^{N}\left( \log p_{\theta _2}(t_i\vert z) + \sum _{l=1}^{L} \log p_{\theta _{1}}(x_{i}^{l}\vert z)\right) , \end{aligned} \end{aligned}$$
(11)
where the expected conditional likelihood can be expressed as:$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_z\left[ p_\theta (x, t\vert z)\right] \\&\quad = \int p_\theta (x, t\vert z)p(z)dz \\&\quad = \int \frac{p_\theta (x, t, z)}{p(z)}p(z)dz \\&\quad = \int p_\theta (x, t, z)dz \\&\quad =p_\theta (x, t) = \int p_\theta (x, t, z)\frac{q_\phi (z\vert x)}{q_\phi (z\vert x)}dz \\&\quad = {\mathbb {E}}_{q_\phi (z\vert x)}\left[ \frac{p_\theta (x,t,z)}{q_\phi (z\vert x)}\right] . \end{aligned} \end{aligned}$$
(12)
As the interest lies in computing the log-likelihood:$$\begin{aligned} \begin{aligned} \log p_\theta (x, t) = \log \left[ {\mathbb {E}}_{q_\phi (z\vert x)}\left[ \frac{p_\theta (x,t,z)}{q_\phi (z\vert x)}\right] \right] \ge {\mathbb {E}}_{q_\phi (z\vert x)}\left[ \log \frac{p_\theta (x,t,z)}{q_\phi (z\vert x)}\right] , \end{aligned} \end{aligned}$$
(13)
where the inequality comes from applying Jensen’s inequality. Then, this could be rearranged as:$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_{q_\phi (z\vert x)}\left[ \log \left( \frac{p_\theta (x,t,z)}{q_\phi (z\vert x)}\right) \right] \\ \\&\quad = \int q_\phi (z\vert x)\log \frac{p_{\theta _1}(x\vert z)p_{\theta _2}(t\vert z)p(z)}{q_\phi (z\vert x)}dz \\&\quad = – \int q_\phi (z\vert x)\log \frac{q_\phi (z\vert x)}{p(z)}dz + \int q_\phi (z\vert x)\left( \log p_{\theta _1}(x\vert z)+\log p_{\theta _2}(t\vert z)\right) dz \\ \\&\quad = -D_{KL}(q_\phi (z\vert x) \vert \vert p(z)) + {\mathbb {E}}_{q_\phi (z\vert x)}\left[ \log p_{\theta _1}(x\vert z) + \log p_{\theta _2}(t\vert z)\right] \\&\quad = \mathcal {L}(x, \theta _1, \theta _2, \phi ). \end{aligned} \end{aligned}$$
(14)
After computing this ELBO, it can be seen that it is similar to the Vanilla VAE’s one (Eq. 8). The only difference lies in the reconstruction term, which is expressed differently to distinguish between the covariates and the time-to-event explicitly. By using Eq. (11) and the reparameterization trick, the ELBO estimator is obtained, explicitly accounting for each dimension of the covariates vector:$$\begin{aligned} \begin{aligned} \mathcal {{\hat{L}}}(x, \theta _1, \theta _2, \phi ) = \frac{1}{N} \sum _{i=1}^{N}\Bigg (- D_{KL}(q_\phi (z\vert x_i)\vert \vert p(z)) + \log p_{\theta _2}(t_i\vert g_\phi (x_i, \epsilon _{i})) + \sum _{l=1}^{L}\log p_{\theta _{1}}(x_{i}^{l}\vert g_\phi (x_i, \epsilon _{i}))\Bigg ). \end{aligned} \end{aligned}$$
(15)
Three DNNs have been used in implementation, as specified in Fig. 4. Note that the decoder DNNs output the parameters of each distribution.Figure 4SAVAE implementation using DNNs. One of them acts as an encoder, which has the covariates vector as input. The other two act as decoders, one for the covariates and the other one for the time.Divergence computationSAVAE assumes that \(q_\phi (z\vert x)\) follows a multidimensional Gaussian distribution defined by a vector of means \(\mu\), where each element is \(\mu _j\) and by a diagonal covariance matrix \({\textbf {C}}\), where the main diagonal consists of variances \(\sigma ^2_j\). It can be stated that:$$\begin{aligned} – D_{KL}(q_\phi (z\vert x)\vert \vert p(z)) = \frac{1}{2} \sum _{j=1}^{J} (1 + \log (\sigma _j^2) – \mu _j^2 -\sigma _j^2), \end{aligned}$$
(16)
where J is the dimension of the latent space z16. This means the Kullback-Leibler divergence from the ELBO equation 15 can be calculated analytically.Time modelingOne significant challenge in handling survival data is the issue of censorship, which occurs when a patient has not yet experienced the event of interest. In such cases, the survival time remains unknown, resulting in partial or incomplete observations. Consequently, SA models must employ techniques capable of accommodating censored observations and uncensored ones to estimate relevant parameters reliably.In our case, to account for censoring in survival data, we start from the time t reconstruction term from Eq. 15 for a single patient:$$\begin{aligned} \mathcal {\hat{L}}_{time}(x_i, \theta _2, \phi ) = \log p_{\theta _2}(t_i\vert g_\phi (x_i, \epsilon _{i})). \end{aligned}$$
(17)
Taking into account the censoring indicator \(d_i\):$$\begin{aligned} d_i = {\left\{ \begin{array}{ll} 0 \;\;\;\text {if censored}\\ 1 \;\;\;\text {if event experienced} \end{array}\right. }, \end{aligned}$$
(18)
we could just use the information given by uncensored patients. However, we would waste information since we know that the censored patients have not experienced the event until time \(t_i\). Hence, considering Eq. (1) and following19, we model the time pdf as:$$\begin{aligned} p_{\theta _2}(t_i\vert g_\phi (x_i, \epsilon _{i})) = h(t_i \vert g_\phi (x_i, \epsilon _{i}))^{d_i}S(t_i \vert g_\phi (x_i, \epsilon _{i})). \end{aligned}$$
(19)
Therefore, the hazard function term is only considered when the event has been experienced, when the data are not censored. This way, SAVAE incorporates information from censored observations, providing consistent parameter estimates.Regarding the distribution chosen for the time event, we have followed several publications such as8, where the Weibull distribution model is used. This distribution is two-parameter, with positive support, that is, \(p(t) = 0, \forall t < 0\). The two scalar parameters of the distribution are \(\lambda\) and \(\alpha\), where \(\lambda > 0\) controls the scale and \(\alpha > 0\) controls the shape as follows:$$\begin{aligned} {\left\{ \begin{array}{ll} p(t; \alpha , \lambda ) = \frac{\alpha }{\lambda } \left( \frac{t}{\lambda } \right) ^{\alpha – 1} \exp { \left( -\left( \frac{t}{\lambda }\right) ^\alpha \right) }\\ S(t; \alpha , \lambda ) = \frac \exp {\left( -\left( \frac{t}{\lambda }\right) ^\alpha \right) }\\ h(t; \alpha , \lambda ) = \frac{p(t; \alpha , \lambda )}{S(t; \alpha , \lambda )} = \frac{\alpha }{\lambda }\left( \frac{t}{\lambda }\right) ^{\alpha -1} \end{array}\right. }. \end{aligned}$$
(20)
Although the Weibull distribution is our primary choice for modeling time-to-event data in SAVAE, it is crucial to highlight that other distributions are feasible as long as their hazard functions and CDFs can be analytically calculated. This versatility distinguishes SAVAE from other models. For example, the exponential distribution, a particular case of Weibull with \(\alpha = 1\), can represent constant hazard functions. Integrating alternative distributions, such as the exponential, into SAVAE is straightforward and only requires adjusting the terms in Eq. (19). The ability of SAVAE to predict the distribution parameters for each patient facilitates the calculation of various statistics, such as means, medians, and percentiles, providing flexibility beyond the models customized to a single distribution.Marginal log-likelihood computationAssigning distribution models to patient covariates in the reconstruction term is essential in SAVAE. This choice enables control over the resulting output variable distribution, but it also implies that the model approximates the chosen distribution even if the actual distribution differs. The third component of the ELBO (15) depends on the log-likelihood of the data, which for some representative distributions is:

Gaussian distribution: Suitable for real-numbered variables (\(x_{i}^{l} \in (-\infty , +\infty )\)), it has parameters \(\mu \in (-\infty , +\infty )\) and \(\sigma \in (0, +\infty )\), known for its symmetric nature. Its log-likelihood function is: $$\begin{aligned} \begin{aligned} \log (p(x_{i}^{l};\mu , \sigma )) = -\log (\sigma \sqrt{2\pi }) – \frac{1}{2} \left( \frac{x_{i}^{l}-\mu }{\sigma }\right) ^2. \end{aligned} \end{aligned}$$
(21)

Bernoulli distribution: Applied to binary variables (\(x_{i}^{l} \in \{0,1\}\)), it has a single parameter \(\beta \in [0,1]\), representing the probability of \(x_{i}^{l}=1\).Its log-likelihood function is: $$\begin{aligned} \log (p(x_{i}^{l};\beta )) = x_{i}^{l}\log (\beta ) + (1-x_{i}^{l})\log (1-\beta ). \end{aligned}$$
(22)

Categorical distribution: Models discrete variables with K possible values. We can think of \(x_{i}^{l}\) as a categorical scalar random variable with K different values. Each possible outcome is assigned a probability \(\theta _k\) (note that \(\sum _{k=1}^K \theta _k = 1\)). The log-likelihood function can be computed based on the Probability Mass Function (PMF) following the expression: $$\begin{aligned} \log (p(x_{i}^{l} \vert \theta _1, \theta _2,\ldots , \theta _k)) = \log \left( \prod _{k=1}^K \theta _k^{{\mathbb {I}}(x_{i}^{l}=k)}\right) , \end{aligned}$$
(23)
where the indicator function means: $$\begin{aligned} {\mathbb {I}}(x_{i}^{l}=k) = {\left\{ \begin{array}{ll} 1 \quad x_{i}^{l} = k\\ 0 \quad x_{i}^{l} \ne k \end{array}\right. }. \end{aligned}$$
(24)

Recall that other desired distributions can be implemented in SAVAE if their log-likelihood is differentiable.

Hot Topics

Related Articles