AFFECT: an R package for accelerated functional failure time model with error-contaminated survival times and applications to gene expression data | BMC Bioinformatics

Survival dataLet n denote the sample size. For subject $i=1,…,n$, let $\widetilde{T_{i}}$ and $\widetilde{C}_{i}$ be the non-negative failure and the censoring times of a specific cancer, respectively. Due to the purpose of analysis, we consider the log transformation: $T_{i}\triangleq \log (\widetilde{T}_{i})$ and $C_{i}\triangleq \log (\widetilde{C}_{i})$. Based on $T_{i}$ and $C_{i}$, define $Y_i \triangleq \min \{T_i,C_i\}$ as the observed survival time and denote $\delta _i\triangleq \mathbb {I}(T_i<C_i)$ as the censoring indicator, where $\mathbb {I}(\cdot )$ is an indicator function. Moreover, let ${\textbf {X}}_{i} = (X_{i1},X_{i2},…,X_{ip})^{\top }$ be a p-dimensional vector of covariates or gene expressions. We impose the standard assumption that $T_{i}$ and $C_{i}$ are independent, given ${\textbf {X}}_{i}$. Therefore, a typical survival data structure is given by $\left\{ (Y_{i},\delta _{i},{\textbf {X}}_{i}):i=1,2,…,n\right\}$.The main interest in survival analysis is to characterize the relationship between the failure time and covariates. In our development, we consider the following accelerated failure time (AFT) model:$$\begin{aligned} T_{i}= & {} {F({\textbf {X}}_{i})} + \varepsilon _{i} \nonumber \\\triangleq & {} f_{1}(X_{i1})+f_{2}(X_{i2})+…+f_{q}(X_{iq}) + \varepsilon _{i}, \end{aligned}$$
(1)
where $\varepsilon _{i}$ is the noise term with $E(\varepsilon _{i})=0$ and has an unknown survivor function $S_\varepsilon (\cdot )$, and $f_{j}\in \mathcal {F}$ is a unknown function of interest with $\mathcal {F}$ being a class of continuous smooth functions. (1) shows that, among all p gene expressions, there are only q gene expressions informative to the failure time, where q is a positive integer and is smaller than p.Ideally, if $T_{i}$ is fully observed for all $i=1,…,n$, then one can consider the following least squares function$$\begin{aligned} \sum \limits _{i=1}^n\{T_{i}-F({\textbf {X}}_{i})\}^{2}, \end{aligned}$$
(2)
and $F(\cdot )$ can then be estimated by minimizing (2) via some nonparametric methods. However, in the presence of right-censoring, $T_{i}$ is incomplete and one has $Y_{i}$ in the dataset. Directly using $Y_{i}$ in (2) may lead to biased estimator of $F(\cdot )$. Moreover, the other challenge is that dimension p in the gene expression data is usually larger than q, yielding that most gene expressions are possibly non-informative to the time-to-event response. As a result, detecting informative gene expressions is a crucial issue as well.Measurement error modelsIn addition to the challenge from the complex regression model, measurement error is the other challenging and ubiquitous feature from the dataset, which is usually caused by imprecise measurement or wrong record. While we cannot examine whether variables are contaminated by measurement error, the key spirit is that we relax an “implicit” assumption that variables in the dataset are precisely measured. In most situations, measurement error in covariates has been widely explored. As commented by [25], however, survival times and censoring status are also possibly subject to measurement error. Specifically, let $Y_{i}^{*}$ and $\delta _{i}^{*}$ denote the surrogate version of unobserved survival time $Y_{i}$ and censoring status $\delta _{i}$, respectively.First, to characterize the error-prone survival time $Y_{i}^{*}$ and the unobserved survival time $Y_i$, we modify the classical additive measurement error model (e.g.,[4, 35]) and follow an idea in [25] to consider the following measurement error model:$$\begin{aligned} Y^{*}_{i} = Y_{i} + \gamma _0+\varvec{\gamma }^{\top }_{1}{\textbf {X}}_{i}+\eta _{i} \triangleq Y_{i} +\varvec{\omega }_{i}, \end{aligned}$$
(3)
where $\eta _{i}$ is assumed to follow a distribution with $E(\eta _{i})=0$ and $\text {var}(\eta _i) = \sigma _\eta ^2$, and is independent of ${\textbf {X}}_{i}$, $\gamma _{0}$ and $\varvec{\gamma }_1$ are parameters.Next, to characterize the misclassified censoring status, we let $\pi _{ikl} = P(\delta ^{*}_{i} = k|\delta _{i} = l,{\textbf {X}}_{i})$ denote the conditional probability that links the observed censoring status k with the covariates and the unobserved censoring status l for $k, l \in \{0,1\}$. By the law of total probability, one can express two probabilities $P(\delta _{i}^{*}=1|{\textbf {X}}_{i})$ and $P(\delta _{i}^{*}=0|{\textbf {X}}_{i})$ as$$\begin{aligned} \begin{bmatrix} P(\delta ^{*}_{i}=1|{\textbf {X}}_{i}) \\ P(\delta ^{*}_{i}=0|{\textbf {X}}_{i})\\ \end{bmatrix} = \varvec{\Pi }_{i} \begin{bmatrix} P(\delta _{i}=1|{\textbf {X}}_{i}) \\ P(\delta _{i}=0|{\textbf {X}}_{i})\\ \end{bmatrix} \end{aligned}$$
(4)
with $\varvec{\Pi }_{i}=\begin{bmatrix} \pi _{i11}&{}\pi _{i10}\\ \pi _{i01}&{}\pi _{i00}\\ \end{bmatrix}$ being a $2 \times 2$ misclassification matrix. Moreover, as commented by [35] (Ch8), we impose the non-differentiable mechanism, which says that$$\begin{aligned} \pi _{ikl} = P(\delta ^{*}_{i} = k|\delta _{i} = l) \end{aligned}$$
(5)
for $k, l \in \{0,1\}$. As a result, in the following development, we will take (5) in our inference procedure. From now on, we respectively replace $\pi _{ikl}$ and $\varvec{\Pi }_i$ by $\pi _{kl}$ and $\varvec{\Pi }$ with the subscript i removed due to the assumption (5) and the independence of subject i.Noting that parameters $\gamma _{0}$ and $\varvec{\gamma }_1$ in (3) as well as $\varvec{\Pi }$ (4) are usually unknown in applications. If the auxiliary information, such as the validation data, is available, then those parameters in (3) and (4) can be estimated. Otherwise, one may require prior knowledge and past experience for parameters $\gamma _{0}$, $\varvec{\gamma }_1$, and $\varvec{\Pi }$ or conduct sensitivity analyses, where the latter approach says that one can specify various values for those unknown parameters based on background knowledge or under reasonable ranges to examine the impact of different magnitudes of measurement error effects and see whether the estimation method is robust with the change of parameter values in (3) and (4).

AFFECT: an R package for accelerated functional failure time model with error-contaminated survival times and applications to gene expression data | BMC Bioinformatics

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Hot Topics

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Popular Articles

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models