A novel embedded kernel CNN-PCFF algorithm for breast cancer pathological image classification

In this section, the relevant theories related to the proposed algorithm are primarily discussed. In Sect. 3.1, a brief introduction to the basic concepts of kernel methods is provided. In Sect. 3.2, two typical applications of kernel methods are presented, namely dimensionality reduction and mapping classification.Kernel methodThe basic idea of kernel methods is to map samples into a high-dimensional feature space, where a linearly inseparable dataset becomes linearly separable in the high dimensional feature space. Kernel functions play a crucial role in this process and are widely used in various tasks, such as classification, clustering, and dimensionality reduction, achieving significant success30,31. The reason behind this is that by choosing an appropriate kernel function, dissimilar samples can be scattered, while similar samples can be clustered together.Let X be the input space and H be the reproducing kernel Hilbert space. If there exists a mapping\(\phi :X \to H\), such that:$$\kappa (x,z)=<\phi (x),\phi (z)>\;,\forall (x,z) \in X.$$
(1)
Then \(\kappa (x,z)\) is called the kernel function, where \(< \cdot ,\, \cdot >\) denotes the inner product operation. The function\(\phi (x)\)is generally unknown. To avoid the computation of inner products in high dimensional feature space, kernel functions with known forms are introduced. Some commonly used kernel functions are as follows.

(i)

Polynomial kernel:\(\kappa (x,z)={(<x,z>+c)^d},d \geqslant 1\)

(ii)

Hyperbolic tangent kernel: \(\kappa (x,z)=\tanh ({c_1}<x,z>+{c_2})\)

(iii)

Gaussian kernel: \(\kappa (x,z)=\exp ( – \gamma {\left\| {x – z} \right\|^2}),\gamma >0\)

(iv)

Laplacian kernel.: \(\kappa (x,z)=\exp ( – \gamma \left\| {x – z} \right\|),\sigma >0\)

Although there are various types of kernel functions, they can mainly be categorized into two types: kernels with inner product and kernels with p-norm distances. It is evident that kernel functions can capture the similarity between data points. As revealed by the experimental analysis in Sect. 4, different forms of kernel functions play significantly different roles in feature fusion and classification tasks.Kernel function selectionDifferent kernel functions have a significant impact on the performance of kernel methods. Therefore, choosing the appropriate kernel function in different scenarios is crucial for enhancing the performance of kernel methods. The following principles can generally be followed for kernel function selection. (i) Understand the characteristics of the dataset: Know the characteristics of the dataset to be processed, including the distribution of the data, the dimensions, and whether it is linearly separable; (ii) Preliminary kernel function selection: Based on the data characteristics, select one or several potentially applicable kernel functions. For example, if the data is approximately linearly separable, you can first try a linear kernel. A polynomial kernel is suitable for data with nonlinear relationships, and you can adjust the degree of the polynomial to control the complexity of the model. A Gaussian kernel (RBF) is suitable for complex nonlinear data, and you can adjust the parameters to control the width of the function; (iii) Parameter tuning: For the selected kernel function, use techniques such as cross-validation and grid search to find the optimal parameters; (iv) Model evaluation: Train the model with different kernel functions and parameter settings, and use appropriate evaluation metrics (such as accuracy, recall, etc.) to assess the model’s performance.When selecting a kernel function, computational complexity also needs to be considered. For example, polynomial and Gaussian kernels may be very time-consuming when dealing with large datasets. Strictly speaking, there are no fixed rules for the selection of kernel functions, and experiments and adjustments are usually needed based on specific problems, domain knowledge, and the characteristics of the dataset. In practice, you may need to try multiple kernel functions and parameter combinations to find the best model configuration. A key advantage of kernel methods is that they allow the model to perform complex nonlinear transformations in high-dimensional space without explicitly calculating these features, which is particularly useful in KPCA for data dimension reduction and feature extraction.In kernel methods, there is another commonly used approach called multi-kernel learning. This refers to selecting several high-performing kernel functions from a pool of candidate kernels or combining different kernel functions to enhance interpretability, dimensionality reduction, and prediction performance on the samples. The general expression of a multi-kernel function is$${\kappa _\eta }(x,z)={f_\eta }(\{ {\kappa _i}({x^i},{z^i})\} _{{i=1}}^{m}\left| \eta \right.),$$
(2)
where\({f_\eta }:{R^p} \to R\)is a combinatorial function. The\({f_\eta }\)can be linear or nonlinear, and the parameter \(\eta\)determines the form of the kernel function combination.The most commonly method for combining kernels is convex weighted linear combination32.$$\kappa (x,z)=\sum\limits_{{i=1}}^{m} {{\eta _i}{\kappa _i}(x,z)} ,{\eta _i} \geqslant 0,\sum\limits_{{i=1}}^{m} {{\eta _i}} =1$$
(3)
The calculation of weights\({\eta _i}\), including heuristic algorithms33,34) and optimization solvers35,36), is mainly used for selecting appropriate weights.Typical applications of kernel methodsThe applications of kernel methods mainly include integrating kernel functions into PCA to achieve dimensionality reduction of nonlinear and non-Gaussian distributed data, and embedding kernel functions into SVM to achieve effective separation of linearly inseparable data. Therefore, selecting and constructing suitable kernel functions becomes the key to promoting the application of kernel methods.Kernel principal component analysisWhen the data presents a non-normal distribution and nonlinear relationship, PCA processing often leads to the contribution rate of each principal component being too scattered, and the principal component that effectively represents the original sample cannot be obtained37. Kernel principal component analysis (KPCA) is a nonlinear feature fusion technique based on kernel method. It maps the samples from the original space to the feature space, where principal component analysis is then conducted. The basic principle of kernel PCA is as follows.The original sample data matrix is.\(X={({x_{ij}})_{n \times p}},i=1,2,…n;j=1,2,…p\)Where\({x_i}=({x_{i1}},{x_{i2}},…,{x_{ip}})^{\prime}\)is the\({i^{th}}\)sample of the dataset, p is the feature dimension, and n is the sample size.Given a nonlinear mapping\(z=\phi (x)\), it maps a sample from a low-dimensional space to a high dimensional feature space H, i.e.,\({x_i} \in {R^p} \to \Phi ({x_i}) \in H\). Let\({z_i}=\phi ({x_i})\)be the map of\({x_i}\)in the high dimensional feature space, and its covariance matrix is$$\Sigma =\sum\limits_{{i=1}}^{n} {{z_i}z_{i}^{T}}$$
(4)
The objective of solving KPCA is:$$\left\{ \begin{gathered} {w_j}=\omega _{j}^{T}\phi (x) \hfill \\ s.t.\left( {\sum\limits_{{i=1}}^{n} {{z_i}z_{i}^{T}} } \right){\omega _j}={\lambda _j}{\omega _j},j=1,2,…,n \hfill \\ \end{gathered} \right.$$
(5)
Where\({\omega _j}\)is the eigenvector corresponding to the\({j^{th}}\)largest eigenvalue of\(\Sigma\), and\({w_j}\)is the\({j^{th}}\)coordinate component of the original sample x in the feature space, that is, the\({j^{th}}\)kernel principal component. According to Eq. (5), it is required to solve the kernel principal components\({w_j}\), and the vector\({\omega _j}\)needs to be calculated and the\(\phi (x)\)must be known. The process of solving the kernel principal components is as follows$$\left( {\sum\limits_{{i=1}}^{n} {{z_i}z_{i}^{T}} } \right){\omega _j}=\left( {\sum\limits_{{i=1}}^{n} {\phi ({x_i}){\phi ^T}({x_i})} } \right){\omega _j}={\lambda _j}{\omega _j}$$
(6)
$${\omega _j}=\frac{1}{{{\lambda _j}}}\left( {\sum\limits_{{i=1}}^{n} {\phi ({x_i}){\phi ^T}({x_i})} } \right){\omega _j}=\sum\limits_{{i=1}}^{n} {\phi ({x_i})} \alpha _{i}^{j},$$
(7)
where\(\alpha _{i}^{j}=\frac{{{\phi ^T}({x_i}){\omega _j}}}{{{\lambda _j}}}\).By substituting Eq. (7) into Eq. (6), we can obtain$$\left( {\sum\limits_{{i=1}}^{n} {\phi ({x_i}){\phi ^T}({x_i})} } \right)\sum\limits_{{i=1}}^{n} {\phi ({x_i})} \alpha _{i}^{j}={\lambda _j}\sum\limits_{{i=1}}^{n} {\phi ({x_i})} \alpha _{i}^{j}$$
(8)
By multiplying both sides of Eq. (8) by\({\phi ^T}(X)={({\phi ^T}({x_1}),…,{\phi ^T}({x_n}))^T}\), there is$$\begin{gathered} \;\;\;\left( {\begin{array}{*{20}{c}} \begin{gathered} {\phi ^T}({x_1}) \hfill \\ {\phi ^T}({x_2}) \hfill \\ \end{gathered} \\ \vdots \\ {{\phi ^T}({x_n})} \end{array}} \right)(\phi ({x_1}),\phi ({x_2}),…,\phi ({x_n}))\left( {\begin{array}{*{20}{c}} \begin{gathered} {\phi ^T}({x_1}) \hfill \\ {\phi ^T}({x_2}) \hfill \\ \end{gathered} \\ \vdots \\ {{\phi ^T}({x_n})} \end{array}} \right)(\phi ({x_1}),\phi ({x_2}),.,\phi ({x_n}))\left( {\begin{array}{*{20}{c}} {\alpha _{1}^{j}} \\ \vdots \\ {\alpha _{n}^{j}} \end{array}} \right) \hfill \\ =\;\lambda \left( {\begin{array}{*{20}{c}} \begin{gathered} {\phi ^T}({x_1}) \hfill \\ {\phi ^T}({x_2}) \hfill \\ \end{gathered} \\ \vdots \\ {{\phi ^T}({x_n})} \end{array}} \right)(\phi ({x_1}),\phi ({x_2}),…,\phi ({x_n}))\left( {\begin{array}{*{20}{c}} {\alpha _{1}^{j}} \\ \vdots \\ {\alpha _{n}^{j}} \end{array}} \right). \hfill \\ \end{gathered}$$Due to.$$\left( {\begin{array}{*{20}{c}} {{\phi ^T}({x_1})} \\ \vdots \\ {{\phi ^T}({x_n})} \end{array}} \right)(\phi ({x_1}),…,\phi ({x_n}))=\left( \begin{gathered} {\phi ^T}({x_1})\phi ({x_1})\,\,{\phi ^T}({x_1})\phi ({x_2}) \cdots {\phi ^T}({x_1})\phi ({x_n}) \hfill \\ {\phi ^T}({x_2})\phi ({x_1})\,\,{\phi ^T}({x_2})\phi ({x_2}) \cdots {\phi ^T}({x_2})\phi ({x_n}) \hfill \\ \quad \quad \quad \quad \quad \;\quad \;\; \vdots \hfill \\ {\phi ^T}({x_n})\phi ({x_1})\,\,{\phi ^T}({x_n})\phi ({x_2}) \cdots {\phi ^T}({x_n})\phi ({x_n}) \hfill \\ \end{gathered} \right)={\left( {\kappa ({x_i},{x_j})} \right)_{n \times n}},$$The above equation can be simplified to$${\rm K}{\alpha ^j}={\lambda _j}{\alpha ^j},$$
(9)
where\({\rm K}={\left( {\kappa ({x_i},{x_j})} \right)_{n \times n}}\)is the kernel matrix corresponding to\(\kappa ({x_i},{x_j})\), and\({\alpha ^j}={(\alpha _{1}^{j},\alpha _{2}^{j},…,\alpha _{n}^{j})^T}\)is the eigenvector corresponding to the\({j^{th}}\)largest eigenvalue\({\lambda _j}\)of the kernel matrix\({\rm K}\).By substituting Eq. (7) into the objective function in Eq. (5), there is$$\begin{gathered} {w_j}=\omega _{j}^{T}\phi (x) \hfill \\ =\sum\limits_{{i=1}}^{n} {\alpha _{i}^{j}{\phi ^T}({x_i})} \phi (x) \hfill \\ =\sum\limits_{{i=1}}^{n} {\alpha _{i}^{j}\kappa ({x_i},x)} \hfill \\ \end{gathered}$$
(10)
In practical tasks, it is generally sufficient to select the first (\(d < < n\)) kernel principal components. From Eq. (10), it can be seen that by introducing a known kernel function, there is no need to obtain the explicit expression of\(\phi (x)\), which enables us to obtain the fused features\({w_j}\).Based on Eq. (3), we can generalize Eq. (10) to obtain the fused multi-kernel principal components.$$\begin{gathered} {w_j}=\omega _{j}^{T}\phi (x) \hfill \\ =\sum\limits_{{i=1}}^{n} {\alpha _{i}^{j}\sum\limits_{{j=1}}^{m} {{\eta _j}{\kappa _j}({x_i},x)} } \hfill \\ \end{gathered}$$
(11)
Multi-kernel SVMSupport vector machine (SVM) is a discriminative algorithm based on the principle of structural risk minimization, which is designed for binary or multiclass classification problems. Its basic model is a maximum margin linear classifier defined in the feature space. Through the introduction of kernel functions, SVM can effectively serve as a nonlinear classifier. The basic principles of kernel SVM are presented as follows.Given a training set\(T = \{ (x_{i} ,y_{i} )\left| {x_{i} \in R^{p} ,y_{i} \in \{ + 1, – 1\} } \right.\}\), where n represents the sample size and\({y_i}\)represents the class labels corresponding to\({{\mathbf{x}}_i}\).SVM can be formulated as the following convex quadratic programming problem.$$\begin{gathered} \mathop {\hbox{min} }\limits_{{\omega ,b,\xi }} \frac{1}{2}{\left\| \omega \right\|^2}+C\sum\limits_{{i=1}}^{n} {{\xi _i}} \hfill \\ s.t.\,{y_i}\left[ {{\omega ^T}\phi ({{\mathbf{x}}_i})+b} \right] \geqslant 1 – {\xi _i},{\xi _i} \geqslant 0,i=1,2,…,n. \hfill \\ \end{gathered}$$
(12)
Where\(\omega\)represents the normal vector of the classification hyperplane, determining the direction of the hyperplane; b is the bias term, determining the distance between the hyperplane and the origin; c is the penalty coefficient, where a larger value implies a higher cost for misclassifications; \({\xi _i}\)represents the slack variables, measuring the distance between the true values and the predicted values by SVM. The objective of SVM is to maximize the margin.In the process of solving SVM, the dual form of model (12) is generally used$$\begin{gathered} \mathop {\,\;\hbox{max} }\limits_{\alpha } \frac{1}{2}\sum\limits_{{i=1}}^{n} {\sum\limits_{{j=1}}^{n} {{\alpha _i}{\alpha _j}{y_i}{y_j}} \phi ({{\mathbf{x}}_i}) \cdot \phi ({{\mathbf{x}}_j})} – \sum\limits_{{i=1}}^{n} {{\alpha _i}} \hfill \\ =\mathop {\hbox{max} }\limits_{\alpha } \frac{1}{2}\sum\limits_{{i=1}}^{n} {\sum\limits_{{j=1}}^{n} {{\alpha _i}{\alpha _j}{y_i}{y_j}} \kappa ({{\mathbf{x}}_i},{{\mathbf{x}}_j})} – \sum\limits_{{i=1}}^{n} {{\alpha _i}} \hfill \\ s.t.\,\sum\limits_{{i=1}}^{n} {{\alpha _i}{y_i}} =0,0 \leqslant {\alpha _i} \leqslant C,i=1,2,…,n. \hfill \\ \end{gathered}$$
(13)
Based on the support vectors in the training set, b can be solved as follows\(b=\frac{1}{{n^{\prime}}}\left( {\sum\limits_{{s=1}}^{{n^{\prime}}} {{y_s}} – \sum\limits_{{i=1}}^{n} {{\alpha _i}{y_i}\kappa ({{\mathbf{x}}_i},{{\mathbf{x}}_s})} } \right),\)where\({{\mathbf{x}}_s}\)represents the support vectors and\(n^{\prime}\)represents the number of support vectors. A single kernel function in Eq. (13) can be replaced by formula (3) to form a multi-kernel SVM$$\begin{gathered} \mathop {\hbox{max} }\limits_{\alpha } \frac{1}{2}\sum\limits_{{i=1}}^{n} {\sum\limits_{{j=1}}^{n} {{\alpha _i}{\alpha _j}{y_i}{y_j}} \left( {\sum\limits_{{k=1}}^{m} {{\eta _k}{\kappa _k}({x_i},{x_j})} } \right)} – \sum\limits_{{i=1}}^{n} {{\alpha _i}} \hfill \\ s.t.\,\sum\limits_{{i=1}}^{n} {{\alpha _i}{y_i}} =0,0 \leqslant {\alpha _i} \leqslant C,i=1,2,…,n. \hfill \\ \end{gathered}$$
(14)

Hot Topics

Related Articles