Random forests for the analysis of matched case–control studies | BMC Bioinformatics

Conditional logistic regressionConditional logistic regression [14] is applied if, for example in a matched case–control study, the observations come in n strata of size $m_i, i=1,\ldots ,n$. The case–control status defines the binary outcome $y_{ij} \in \{0,1\}$ where $y_{ij}=1$ for cases and $y_{ij}=0$ for controls. By design, a stratum contains one case and one or more controls which are matched to this case. We assume the number of cases per stratum to be restricted to one, i.e. $\sum _{j=1}^{m_i} y_{ij} = 1$. For these data, an ordinary model of conditional logistic regression (CLR) can be denoted as$$\begin{aligned} log\left( \dfrac{P(y_{ij}=1)}{P(y_{ij}=0)}\right) = \alpha _{i} + \textbf{z}_{ij}^T\varvec{\gamma }, \quad i = 1,\ldots ,n, ~ j = 1,\ldots ,m_i\,. \end{aligned}$$
(1)
The strata-specific intercepts $\alpha _i$ represent strata effects, which describe the similarity between all observations in one stratum with respect to the matching criteria. All other variables are modeled using the simple linear term $\textbf{z}_{ij}^T\varvec{\gamma }$ with coefficient vector $\varvec{\gamma }$ and covariate vector $\textbf{z}_{ij}$.In cases of explanatory models, where we are interested in one particular of the variables (i.e. the exposure variable), we separate this variable in our mathematical notation. For the rest of the manuscript, the exposure variable will be denoted as x (with treatment effect $\beta$) while all further covariates are collected in the vector $\textbf{z}$. Using this notation, CLR is denoted as$$\begin{aligned} \log \left( \dfrac{\text {P}(y_{ij}=1)}{\text {P}(y_{ij}=0)}\right) = \alpha _{i} + x_{ij}\beta + \textbf{z}_{ij}^T\varvec{\gamma }, \quad i = 1,\ldots ,n, ~ j = 1,\ldots ,m_i\,. \end{aligned}$$
(2)
For the estimation of such a model, the corresponding conditional likelihood is used where the stratum-specific intercepts $\alpha _{i}$ are eliminated from the likelihood by conditioning on the number of cases per stratum. For further details see Schauberger et al. [9] or Breslow and Day [15].Conditional logistic regression treesThe method of conditional logistic regression trees [9] (CLogitTree) was introduced as an alternative to CLR. It has the advantage, that the assumptions for the functional relationship between the covariates (i.e. the variables contained in $\textbf{z}$) and the outcome are much weaker. In particular, no linear relationship is assumed, and interactions are included automatically in a data-driven manner.Conditional logistic regression trees take advantage of the fitting process of CLR models via the conditional (log-)likelihood. They start with an initial model that only contains the strata-specific intercepts and a separate exposure effect and then gradually evolve by finding optimal partitions of the covariate space. The final model can be denoted as$$\begin{aligned} \quad \log \left( \dfrac{\text {P}(y_{ij}=1)}{\text {P}(y_{ij}=0)}\right) = \alpha _{i} + x_{ij}\beta + f(\textbf{z}_{ij}), \end{aligned}$$
(3)
where $f(\textbf{z}_{ij})$ represents the effect of the variables collected in $\textbf{z}$ and can be displayed as a tree via dendrograms. In an explorative setting, where we are not interested in a particular exposure variable, the separate term $x_{ij}\beta$ containing a linear exposure effect can be omitted. The tree is embedded into the CLR framework via (products of) indicator functions, which represent the terminal nodes of the tree $S_1,\ldots , S_t$. Accordingly, the tree $f(\textbf{z}_{ij})$ can in general be denoted as$$\begin{aligned} f(\textbf{z}_{ij}) = \delta _1 I(\textbf{z}_{ij} \in S_1) + \ldots + \delta _t I(\textbf{z}_{ij} \in S_t) \end{aligned}$$
(4)
where $\delta _1,\ldots ,\delta _t$ represent the parameter estimates for the single terminal nodes.Fig. 1Exemplary illustration of $f(\textbf{z}_{ij})$ as a tree with four terminal nodes $S_1,\ldots ,S_4$Figure 1 exemplarily shows a representation of $f(\textbf{z}_{ij})$ as a tree with four terminal nodes $S_1,\ldots ,S_4$ where $\textbf{z}$ consists of p covariates $Z_1,\ldots ,Z_p$, but only the first three variables $Z_1,\ldots ,Z_3$ are selected for splits. In this example, the tree would be represented as$$\begin{aligned} f(\textbf{z}_{ij}) = \delta _1 I(\textbf{z}_{ij} \in S_1) + \delta _2 I(\textbf{z}_{ij} \in S_2) + \delta _3 I(\textbf{z}_{ij} \in S_3) + \delta _4 I(\textbf{z}_{ij} \in S_4)\,. \end{aligned}$$
(5)
Each terminal node can be represented as a product of indicator functions. For example, $S_1$ is denoted as $I(\textbf{z}_{ij} \in S_1)=I(z_{ij1} \le 2)\,I(z_{ij2} \le -1)$.Conditional logistic regression forestsIn this paragraph, the proposed conditional logistic regression forests (CLogitForest) are introduced. We start by explaining the estimation process before elaborating on their interpretation via variable importance and the potential use of bootstrap confidence intervals for exposure effects.EstimationCLogitForest is an ensemble learner technique with CLogitTree as base-learner. The estimation is based upon ntree bootstrap samples which are sampled from the original training data. A main characteristic of data from matched case–control studies is that they are built up by a number of strata which cannot be separated. Therefore, sampling of the different bootstrap samples has to be done on the level of the n strata. Sampling is either performed by regular bootstrap sampling (sampling of n strata with replacement) or by taking a subsample of 63.2% of the n strata without replacement. The number $63.2\%$ results from the fact that in the case of regular sampling with replacement, the expected number of unique elements is $(1-e^{-1})n\approx 0.632\,n$.In each of the samples, a separate CLogitTree is estimated. In order to additionally de-correlate the single trees, the parameter mtry needs to be specified. In each potential split, only a random subset of mtry out of all p covariates is used. This guarantees that the different trees will differ even more than already by the fact that they use different samples of the data.In contrast to single trees, trees serving as base-learners of ensemble methods are allowed to overfit the data. The overfitting of single base-learners (in our case single trees) is compensated by combining them to a joint model (in our case to a random forest). Therefore, within CLogitForest the single trees are not pruned via permutation tests or the Bayesian information criterion (BIC), which were the possibilities of pruning proposed by Schauberger et al. [9]. However, in the accompanying implementation [16] other arguments can be applied to prevent the tree from producing terminal nodes with a too small number of observations (see Sect. 2.4).A recommended option is to perform the estimation of the conditional logistic regression (underlying each tree from CLogitTree) using an $L_2$ penalty term. Using an $L_2$ penalty has the goal of stabilizing the estimates in cases where perfect separability between cases and controls is achieved within the tree. The penalty term is scaled by a regularization parameter $\lambda$, which is typically set to a small value like $\lambda = 10^{-20}$. For further details on the exact implementation of these arguments see Schauberger et al. [9].The final forest model is the aggregation of the single trees. Prediction can simply be done by averaging the predictions of the ntree different trees. In case a separate exposure effect $\beta$ is estimated, the final estimate for $\beta$ is the average of all single estimates in the trees.Variable importanceThe main advantage of using trees instead of forests in general, but also in our particular application to matched case–control studies, is that the resulting models are much more stable than single tree models. Also, the functional relationship between the covariates and the outcome can be much more complicated than in a single tree. The downside of this increased degree of flexibility is, that the interpretation of the functional relationships between covariates and the outcome becomes much harder. Accordingly, random forests are often termed as black-box models which do not allow for any insights of the functional relationships while trees are very intuitive to understand and easy to display.However, also for black-box models methods of interpretable machine learning [17] exist which can help identify the fitted model’s underlying structures. A popular method to identify the relevance of the various covariates within random forest models is the concept of variable importance [1], which can also be implemented for CLogitForest. In order to measure the variable importance of a particular variable, this variable is permuted within the given training data. Subsequently, the estimated forest (which is based on the non-permuted variables) is used to predict the respective outcome using the permuted variable (while all other variables remain unchanged). In CLogitForest, we use the predictive conditional likelihood on the level of the single strata as the quality measure of the respective prediction. Finally, this predictive conditional likelihood can be compared to the predictive conditional likelihood based on the non-permuted variables. The larger the difference, i.e. the more the predictive accuracy decreases after permutation of a variable, the more important is the variable for the estimated forest model. Two versions of measuring variable importance are implemented in the accompanying R package. The first version is the rather classic version where measuring variable importance is based on all observations for all trees. The second version only relies on out-of-bag observations for each tree. That means, that for each tree only observations are used which are not part of the training data of the respective tree, as they are not part of the respective bootstrap sample. The later version is preferred due to its increased robustness against overfitting to the training data.If the model incorporates an explicit exposure effect, the variable importance gives us valuable information about which covariates have the highest importance. This can be substantially different to the importance of the single variables we would see in CLR where variables with an important (but non-linear) influence will be neglected. However, this can be highly valuable information for researchers interested in gaining a deeper understanding of the underlying confounding structure.Furthermore, variable importance measures can also be immensely helpful in cases where an exploratory analysis is applied to a matched case–control study. In such a case, one is usually interested in finding the most important risk factor(s) for the disease at hand from a set of potential candidates. If the association between the risk factors and the outcome is complicated, is potentially non-linear or includes multiple interactions with other variables, this can be detected by CLogitForest much better than by CLR, but also CLogitTree. Additionally, as CLogitForest provides a more stable estimation, it will also lead to a more stable and reliable detection of the most important variables compared to CLogitTree. Therefore, implementing variable importance measures can help to identify important risk (or protective) factors in an exploratory analysis and to distinguish between important and unimportant variables.Confidence intervalsConventional random forests are not able to provide single parameter estimates and accompanying confidence intervals, as they typically do not contain any global linear parameters. In the method proposed here, the random forest can contain a linear term representing the overall exposure effect. In explanatory analyses of matched case–control studies, this effect is of main interest for researchers. Accordingly, it is important to not only quantify the effect itself but also its uncertainty. We propose to quantify the uncertainty of the exposure effect via nonparametric bootstrap confidence intervals, adapting the concept of confidence intervals for CLogitTree to CLogitForest. The details of the procedure can be found in Schauberger et al. [9]. The main idea is to repeatedly apply the whole procedure of estimating a random forest to a large number of bootstrap samples [4] of the training data. From the resulting estimates of the exposure effect, quantiles can be deduced which represent the corresponding bootstrap confidence interval.ImplementationThe proposed method is implemented in R within the add-on package CLogitTree [16] and publicly available from https://github.com/Schaubert/CLogitTree.Main functionsThe package contains both a function to fit conditional logistic regression trees (CLogitTree()) and a function to fit conditional logistic regression forests (CLogitForest()). Both algorithms can be run parallel on several nodes. The most important parameters to choose in CLogitForest() are ntree (the number of trees) and mtry (the number of randomly selected possible splitting variables). While the number of trees is less important as long as it is chosen large enough, the choice of mtry can have a significant effect on the performance [18]. Therefore, the user is offered an internal tuning procedure for the choice of mtry. Within this procedure, all possible values for mtry between 2 and p are cross-validated using a pre-defined number of trees. The predictive out-of-bag conditional likelihood is used as the optimality criterion.Furthermore, the user can choose specific options for the trees which are fitted within the forest. In particular, there are arguments for the maximal depth of the trees, the minimal node size in order to be eligible for further splitting and the minimum number of observations in any terminal node. For a deeper introduction into the different arguments typically used in random forests and further important aspects for the training of random forests we refer to Boulesteix et al. [19].Supporting functionsThe most important supporting functions for CLogitForest() are the function varimp() for calculating and plotting the variable importance as well as the function boot.ci() for calculating bootstrap confidence intervals for the exposure effect.Inclusion of a linear offsetAs described above, in case an exposure variable is defined, each tree is initialized with only this exposure effect as a linear effect. However, the initial model could actually be extended. In the software implementation of CLogitForest, an option is offered to include the linear fit of all covariates as an offset before the trees are grown. This offset is the sum of the linear effects of all covariates from CLR but excludes the linear effect of the exposure variable. By using this design, each tree is built upon the fit of CLR before the first split is performed. All potential further splits only have the goal to account for all non-linear effects or interactions which have not yet been captured within the linear fit.

Random forests for the analysis of matched case–control studies | BMC Bioinformatics

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Hot Topics

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Popular Articles

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models