Integrating cat boost algorithm with triangulating feature importance to predict survival outcome in recurrent cervical cancer

The background information regarding the research tools and methods is covered in the sections that follow.Dataset collectionThis historical investigation was conducted in a tertiary teaching medical centre. Between March 2012 and April 2018, thorough epidemiology, therapeutic, and information were acquired from the case reports. On March 1 2020, the investigation concluded. The dataset was built using an online database30.Women with cervical carcinoma who had recurrences totalled 260 of a total of 4913 patients who are affected with cervical cancer. The dataset has been broadly classified into 27 features relevant to treatments and staging of a disease. Individuals receiving a particular form of recurrence therapy are categorized as pelvic cavity Recurrence, Recurrence beyond the pelvic cavity, and pelvic cavity Recurrence in both cases.Average/most frequent imputation methodWomen with cervical carcinoma who had been affected with recurrences is 260 of a total of 4913 patients who are affected with cervical cancer. Of the 260 parents with recurrence, the data set consists of many missing values. If missing values exceed 80%, they are deleted from the dataset. After deletion, we get the data of 160 patients, and 27 features are considered to proceed with the research. Out of 160 patients and 27 features, some values are missing; those missing values are imputed by the Average/Most frequent imputation method. Average and most frequent imputation substitutes the values not present in columns with the mean value for numerical attributes and the most commonly available values for categorical features. It is a straightforward replacement technique that substitutes the data at hand for the features instead of the information not present31. However, it does not portray a standard result. The following procedures allow inputting values absent from a data set employing the “Average” approach for continuous characteristics in a dataset. If a data collection has the continuous feature “\(\alpha \)” but specific values are missing (NaN). Generate the average mean value by using the values that are present in the feature column “\(\alpha \)” of the dataset. Values that are missing are to be replaced in attributes”\(\alpha \) “with the Average value that is calculated.$$ Average \,{\upalpha } = \frac{Sum \,of\, all \,values\, that \,are\, present \,in \,attribute\, \alpha }{{no.\,of.\,values\, that \,are\, present\, in \,attribute \,\alpha }} $$
(1)
Use the methods below to impute the values that are absent for categorical characteristics in a collection of data using the “Most Frequent” imputation method: Let’s say that dataset with an attribute “\(\beta \) ” that is categorical and has some values that aren’t present (NaN) in it. Find the category that exists the most frequently in attribute “\(\beta \) “. Change the missing values in attribute “\(\beta \)” to the most common category found.Mode_ \(\beta \) = attribute’s most commonly used category \(\beta \).If an attribute \(\beta \) has a missing value (NaN), then:$${\beta }_{i}=Mode \beta $$
(2)
\({\beta }_{i}\)—ith value missing in attribute βi, \(Mode \beta \)—most frequently used categorised attribute.Advanced paradigms for feature elicitation—exploring ReliefF algorithm, FCBF, and information gainFollowing the pre-processing of information, feature selection is the most crucial step for research progress. Generally, Feature selection is an essential stage in creating a classification system. It functions by decreasing the features in a given dataset and selecting the most essential features required for an accurate prediction of a model32. In this research, we discussed four feature selection algorithms and how they worked in our given dataset and selected one of the best algorithms for our model predictions, which attains the highest accuracy.The relief technique gives each data set’s attributes weights, and the values assigned to them are continuously updated. Strong weight characteristics should be chosen, while low-weight ones should be ignored. The technique relief is looped through n random training samples (T), with no replacement choice chosen as ‘n’.The algorithm for Relief feature selection is given below is the step-by-step procedure for relief algorithm for feature selection33Load the Recurrence cervical cancer dataset into the Data Frame of pandas. It comprises 26 features; the last column should be a target variable. Define –Relief represents the Input of ‘n’ random samples. Initialize the weights (W) for all attributes (X) to zeros. To perform the relief F algorithm, Iterate over ‘n’ randomly chosen samples for training. The target is chosen randomly. Consider, for example, Target is (T) find the closest hit (I) and closest miss (M). For every attribute (A), determine the difference between the attribute values of T and I as well as the difference between T and M.Sum the difference between T and M and subtract the difference between T and I, both divided by m, to change the weights (W) of each Attribute (A). For each of the ‘n’ random samples for training, restart the procedure.Normalize the attribute values. Name the weight of each attribute. Greater weights denote qualities that are more crucial for separating classes.The Best-ranked 5 features are selected from the given dataset of 26 attributes. Accuracy value has been increased gradually.Information Gain is a crucial metric for categorizing characteristics. We can develop a metric representing new data about \(\beta \) provided by \(\alpha \) that indicates the amount by which the overall entropy of \(\beta \) lowers since the entropy is a criterion of impurities in the training data set S.This is known as Information gain34;and it is represented as,It is formulated as,$$Corr\left(\alpha ,\beta \right)=\frac{P(\alpha U \beta )}{P\left(\alpha \right)P(\beta )}$$
(3)
It can aid in locating the most pertinent characteristics that considerably aid forecasting. The algorithm for information gain is described as the Load Recurrence cervical cancer dataset. It consists of 26 features. Predicting the survival, whether the patient dies or is alive, should be a target variable. Calculating the entropy of a target feature “STATUS” in a given dataset. Table 1 is the important feature selected by ReliefF technique. Calculate the target attributes entropy to understand how uncertain the group distribution is.Table 1 Important features selected by ReliefF.The mathematical representation of entropy is,$$Entropy\left(\beta \right)-\sum (P({X}_{i})*\text{log}2(P({X}_{i})))$$
(4)
In which P(\({X}_{i}\)) is the percentage of cases in the dataset that correspond to class \({X}_{i}\).To calculate information gain, determine the information gain for every attribute by considering its capacity to forecast the target value. For each feature, PR-OS, Date of first recurrence diagnosis, PR-PFS, Therapeutic effect, FIGO 2009 staging, etc. Information gain is calculated as,$$Information\,Gain\left(N,\alpha \right)=Entropy\, \left(\beta \right)-\sum \left(\left(\frac{{|N}_{i|}}{\left|N\right|}\right)*Entropy(\beta |\alpha ={W}_{i})\right)$$
(5)
\({|N}_{i|}\)—no. of. instances in the dataset, \(Feature\,value \alpha ={W}_{i}\), \(\left|N\right|\)—Total no. of. instances in a given recurrent cervical cancer dataset.Information gain will calculate the most important and uncertain features when the target feature is fixed. The highest Information gain score feature is selected. To determine the target variable’s entropy “Status” for multiple features present in the dataset (PR-OS, Date of first recurrence diagnosis, PR-PFS, Therapeutic effect, FIGO 2009 staging, etc.). The Information Gain of every attribute is then calculated by adding the numbers for each distinct attribute and deducting it from the entropy of the target variable. Attributes withthe highest information gain score are selected35. Best The 5 features are selected to forecast the prognostic fate of individuals with recurrent cervical carcinoma.The Best-ranked 5 features are selected from the given dataset of 26 attributes using information gain as a feature selection algorithm is listed in Table 2. Scored features have been selected to increase the model’s accuracy.Table 2 Important characteristics chosen by Information gain.Fast correlation-based feature selection36 chooses a feature’s efficiency for classification. The attribute is chosen only if it is considered good, applicable to the field of study, and does not duplicate any additional applicable characteristics. The relationship between the two characteristics is calculated. Important attributes from the dataset are chosen to be strongly associated with every other type. Fast correlation-based attribute chosen techniques can solve dimensionality problems. The algorithm for the fast correlation-based attribute chosen is given as a technique for FCBF.Load Recurrence cervical cancer dataset. It consists of 26 features. Predicting the survival, that is, the patient’s death or life, should be a target variable.Target variable to be fixed.The target variable is fixed as “status”, as we predict the survival outcome after a disease recurrence.For every attribute, determine the Symmetrical Uncertainty (SU). As we have 26 features in our Pre-processed dataset (such as PR-OS, Date of first recurrence diagnosis, PR-PFS, Therapeutic effect, FIGO 2009 staging,disease progression once again, etc.), we must select the crucial attributes based on the score value.To calculate the symmetrical uncertainty for every 26 features and select the best features using the formula,$$To\,calculate\,SU\left(X,Target\, variable\right)$$$$SU\left(\alpha ,T\right)=2*\left(\frac{MI\left(\alpha ,T\right)}{H\left(\alpha \right)+H\left(T\right)}\right)$$
(6)
here \(MI\left(\alpha ,T\right)\) is the Mutual Information between features \(\alpha \) in a given dataset. Here \(\alpha \) denotes the 26 features of recurrent cervical cancer, and the target variable is “status”, \(H\left(\alpha \right)\)—Entropy of feature \(\alpha \), and \(H\left(T\right)\)—Entropy of target ‘status’.Set the threshold for FCBF. Select a threshold value for symmetrical uncertainty. Features with SU exceeding this cut-offare picked for additional investigation and will be considered meaningful. Choosing pertinent characteristics based on SU. To Calculate the Joint mutual information for a relevant feature, which is selected by the step set the threshold for FCBF. For any pertinent characteristic \(\alpha \),$$JMI\left(\alpha ,Targe{t}{\prime}{T}{\prime}\right)=MI\left(\alpha ,T\right)-1/((\left|F\right|-1)*\sum (MI(\alpha ,{\alpha }_{i})))$$
(7)
\(MI\) (\(\alpha ,T\)) is the Mutual Information between attribute \(\alpha \) and the target variable “status”, \(MI\left(\alpha ,T\right)\)—Mutual Information between \(\alpha \) and the target variable ‘status’, \(\left|F\right|\)—Total no. of. Features, Summation of feature α and all other relevant feature αi, \(\sum (MI(\alpha ,{\alpha }_{i}\) —Summation of feature α and all other relevant feature αi, \({\alpha }_{i}\)—features excluded by α.Choose the attribute with the highest JMI. The features with the highest JMI are the most useful for predicting the “Survival outcome after recurrence”. The ranked 5 features are selected from the given dataset of 26 attributes using FCBF as a feature selection algorithm is listed in Table 3. Scored features have been selected to increase the model’s accuracy.Table 3 Important features selected by FCBF.Triangulating feature importanceBy analyzing the above 3 algorithms, ReliefF, Information gain, and FCBF detail, we selected the essential features based on a score available.By selecting the essentialattribute from the given data, the accuracy and performance of a model is increased.A total of 8 features are selected from the 26 features of the given dataset by correlating all 3 algorithms.This is a Novel Approach to Feature Selection Algorithm. From the above-referred table, the essential features which are selected are mentioned below in Table 4,Table 4 Triangulating feature importance from ReliefF,IG,FCBF.

Hot Topics

Related Articles