Classification and prediction of drought and salinity stress tolerance in barley using GenPhenML

Data preparationThe phenotype and genotype properties of barley were determined utilizing its agronomic characteristics under saline and drought conditions. For stress score prediction, 1236 data samples were collected from barley lines and divided randomly to train and test datasets, each including 70% and 30% of the whole data. For stress tolerance classification, 1128 data samples were divided randomly to train (70%) and test (30%) datasets. The genotype and phenotype features of barley lines were determined utilizing their agronomic characteristics under saline and drought conditions. In the greenhouse at Gonbad Kavous University, 103 lines of F8 families resulting from Badia and Kavir crossings were examined using a completely randomized design with three replications. Planting was done in 5-kg soil capacity pots, with seven seedlings per line. The population was developed to present the plant genetic materials under the Gonbad Kavous University’s license. All the methods were performed in accordance with relevant guidelines and regulations. Table 6 shows some physical and chemical features of the soil.Table 6 Soil physical and chemical properties of the experiment site (0–30 cm depth).Drought stress was applied during the reproductive stage, with a moisture content of 0.8 field capacity equal to 20% by weight moisture. Every other week, irrigation was performed, and the moisture level was lowered to 9% by weight moisture. The soil moisture level was modified by assessing the amount of moisture lost and compensating with water (20%). Salinity stress was applied during the reproductive stage by irrigation with a salt chloride source of 16 dS.m-1. Weekly assessments of the salinity of the saturated extract in pots demonstrated a weekly increase of up to 10–17 (dS.m-1). The saturated extract was created by pouring 150 g of potting soil into a plastic bucket, adding distilled water, mixing, and shining the top. For phenotyping measurements, 15 competing plants of each line were measured, and their average was considered in the analysis. Phenotype scores were measured according to the protocols recommended by Chang and Yoshida14,15. The measurement instructions are provided in Tables 7 and 8.Table 7 Instructions for drought tolerance.Table 8 Instructions for salinity stress tolerance.The genotyping analysis was performed using crude DNA preparation. In a 1.5 ml centrifuge tube labeled with a label, a single leaf was extracted and placed in ice for a while. The leaf sample was macerated using 400 μl of extraction buffer (50 mM Tris–HCl, pH 8.0, 2.5 mM EDTA, 300 mM NaCl, and 1% SDS). It was ground until the buffer turned green. After that, 400 μl of extraction buffer was added and mixed by pipetting. For 10 min, the contents were centrifuged at 12,000 g in a microcentrifuge. Nearly 400 μl of lysate was extracted with 400 μl chloroform. The top supernatant was transferred to another 1.5ml tube, where DNA precipitation was performed with absolute ethanol. We centrifuged the contents for three minutes at full speed and discarded the supernatants. We rinsed the pellets with 70% ethanol and dried the DNA before resuspending it in 50 μl TE buffer (10 mM Tris–HCl, pH 8.0, 1 mM EDTA, pH 8.0). An aliquot of the solution was used for PCR analysis and the remaining solution was stored at –20°C.For marker analysis, 365 SSR markers were properly spread over seven barley chromosomes16. Based on the polymorphic SSR primers, the DNA of each line was amplified using primers exhibiting polymorphism. The PCR was performed using a thermocycler (iCyclerBIORAD, USA) with template DNA 50 ng in 15 μl reaction mixture of primers 0.67 M, reaction buffer 10 μl, MgCl2 2.5 mM, dNTPs 0.2 mM and Taq polymerase 0.5 U. PCR was performed at initial denaturation of 94°C for 5 min, 30 cycles of denaturation at 94°C for 1 min, annealing at 58°C for 1 min, elongation at 72°C for 1.5 min, and final extension at 72°C for 5 min then storage in a refrigerator at 4°C. Separation and visualization of the final product were performed with 6% polyacrylamide gel electrophoresis and stained silver. ISSR, iPBS, IRAP, SCoT and CAAT markers were employed for the parental investigation. When the band amplified in the first parent, scores of 1 and 3 were used for the presence and absence of the band, respectively. Scores of 2 and 4 were also utilized when the band was amplified in the second parent.Phenotype and genotype featuresPhenotype data includes 15 phenotype features obtained from each plant by direct measurements. Genotype features consisted of 719 molecular markers determined by genetic measurements. These genotype features were used for the prediction of salinity and drought stress. Three FS algorithms (ReliefF, MRMR and F-test) were deployed to determine important genotype features.Feature selectionOver the past decades, data collection and storage advances have forced many sciences to face vast amounts of information. The FS algorithms reduce the dimensionality of the data by selecting appropriate subsets of the original features17 This paper used ReliefF, MRMR, F-test and Chi2 algorithms to select the appropriate number of features to train ML models.Kira and Rendell formulated the original Relief algorithm inspired by learning by example18. As an evaluation filter algorithm, the ReliefF algorithm can detect feature dependencies. This algorithm uses the concept of nearest neighbors to obtain feature statistics. In addition, it retains the general advantages of filtering algorithms, such as high relative convergence speed and independence of the selected features from the induction algorithm. The diff function in the ReliefF algorithm calculates the difference in feature value A between two samples, I1 and I2, where I1 = Ri (Ri is the target) and I2 is H or M, in weighted updates. Bump identifies the two closest neighbor instances of the target. One with the same class called Close Hit (H) and one with the opposite class called Close Miss (M). For discrete features, the diff function is defined as follows19$$diff\left( {A.I_{1} .I_{2} } \right) = \left\{ {\begin{array}{*{20}l} o \hfill & { if \;value\left( {A.I_{1} } \right) = value\left( {A.I_{2} } \right)} \hfill \\ 1 \hfill & {if\; otherwise } \hfill \\ \end{array} } \right.$$
(1)
Furthermore, for continuous features, diff is defined as:$$diff\left( {A.I_{1} .I_{2} } \right) = \frac{{\left| {value\left( {A.I_{1} } \right) – value\left( {A.I_{2} } \right)} \right|}}{\max \left( A \right) – \min \left( A \right)}$$
(2)
The performance of the MRMR algorithm is based on the performance of mutual information between two feature spaces, which increases as the probability of sharing two feature vectors increases. Mutual information between two variables, x and y, is obtained according to Eq. 3 based on the probability density function20.$$I\left( {x.y} \right) = \mathop \sum \limits_{y \in Y} \mathop \sum \limits_{x \in X} p\left( {x.y} \right)\log (\frac{{p\left( {x.y} \right)}}{p\left( x \right)p\left( y \right)})$$
(3)
In the maximum correlation method, FS requires (I) to have the highest value with class c. This trend shows the most significant dependence of feature x on class c. Maximum correlation is one of the optimal feature search methods, which is obtained by Eq. 4 based on the average value of all mutual information values between individual features xi and class c.$$\max D\left( {S.c} \right). D = \frac{1}{\left| S \right|}\mathop \sum \limits_{{x_{i} \in S}} I\left( {x_{i} ;c} \right)$$
(4)
According to Eq. 4, the characteristics most dependent on the class are selected; However, this dependency between functions can be considerable. Therefore, the mutual information between features is obtained per Eq. 5 to reduce duplications.$$\min R\left( S \right). R = \frac{1}{{\left| S \right|^{2} }}\mathop \sum \limits_{{x_{j} ,x_{i} \in S}} I\left( {x_{i} ;x_{j} } \right)$$
(5)
To achieve the optimal property due to the minimum and maximum release ratio, the two equations, 4 and 5, are combined to obtain Eq. 6.$$\mathop {\max }\limits_{{x_{j} \in X – Sm – 1}} \left[ {I\left( {x_{j} ;c} \right) – \frac{1}{m – 1} \mathop \sum \limits_{{x_{j} \in Sm – 1}} I\left( {x_{j} ;x_{i} } \right)} \right]$$
(6)
In this equation, m represents the number of elements selected from the feature set S, and x is the feature vector20.The F-test is a statistical test that calculates the ratio of variances between the instances with the same target value called groups and within a group for a feature in one-way Analysis of Variance (ANOVA). It ranks features based on higher f-score values, indicating fewer distances within groups and more distances between groups. The f-score in this method is given by:21.$$F-score = \frac{{{{var}}iance\; between \;groups}}{{{t{var}}iance\; whithin \;groups}}$$
(7)
where variance between groups is the variance between groups indicated by the target feature, and variance within a group is the sum of variances within each group.The Chi2 FS algorithm was used for stress classification, with individual chi-square tests used to assess the independence of predictor variables from response variables. A small p-value indicates that a predictor variable depends on the response variable, making it an important feature22.ML modelsThis Section presents a brief description of all deployed ML models. The ML models are introduced more conceptually than mathematically. The mathematical explanations of models can be found in textbooks23,24.Gaussian process regression (GPR)The GPR regression model is a nonparametric statistical method for determining the relationship between independent and dependent variables. It uses latent variables, an explicit basis function, and unknown data parameters. The latent function reflects the statistical nature of the model and is determined by the kernel of the variance function. GPR models can provide accurate estimates with confidence intervals at any spatial point, capturing model predictions’ uncertainties. The parser can also choose individual base features to preview and specify the model’s appearance. Building and optimizing GPR models is a task that is doable with today’s high-performance computing capabilities25.Linear discriminant analysis (LDA)The discriminant analysis (DA) classification introduced by R. Fisher is one of the simplest and easiest classifiers. There are two types of DA classifiers: linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). In LDA classification, the decision surface is linear, while in QDA, the decision boundary is nonlinear26. Discriminatory characteristics create decision boundaries to distinguish between different classes in different areas. Thus, the input space is divided into regions, each bounded by some decision boundaries. A classifier is represented by decision function c or discrimination, where c is the number of classes. Decision functions are used to define decision boundaries between classes and regions or between regions of each class. Therefore, the discriminant function is used to determine the class label of the unknown pattern based on comparing several discriminant functions c and assigning the maximum score of the unknown pattern to the class label. Therefore, the discriminant function will have the highest value in the region compared to the other discriminant functions27,28.Neural network (NN)Neural networks (NN) are derived from biological neural systems. These models, with their natural and intelligent structure and appropriate modeling of the neurons in the human brain, try to simulate the behavior of brain neurons through defined mathematical functions and synaptic function in natural neurons through the calculated weights in the communication lines of neurons are artificially modeled. The structure of an NN consists of input, output and hidden layers, communication weights and activation transfer functions. The input layer is a transmission layer and a means to prepare and introduce data; the output layer includes the values predicted by the network and the hidden layers, which consist of processor nodes and the place of data processing29.Naive Bayes (NB)NB is a probabilistic classifier using Bayesian theory in complete independence. For classification problems, the NB model is powerful and intuitive. NB’s predictions are based on categories and Bayesian theory and assume that the predictors are conditionally independent. NB classifiers assume that the presence of one feature in a class is independent of the presence of another feature30.Support vector machine (SVM)SVM is a hybrid approach for reducing classification errors that combines estimation of convex hulls with differential error reduction. This loss reduction function evaluates unfavorable locations. SVM also uses the linear kernels as a tainted version of the Gaussian kernel to incorporate nonlinear maps of vector properties in ample space. SVM classification has a linear decision area, and while non-error core models have more flexible nonlinear decision-making contexts, linear SVM classifiers train errors faster than SVM models31.Decision tree (DT)The DTs are algorithms that generate decision rules based on the expected reduction in entropy when an element is sorted. They overstimulate data and have poor performance when applied to new datasets. For better results, they are frequently used in group contexts such as RFs32.Random forest (RF)A RF is a bag of DTs. Each DT is applied to a new training dataset obtained by random sampling, replacing the original dataset. In addition, some randomness is introduced into the decision tree construction: a subset of features is randomly selected for each decision branch of the DT. The RF prediction is given as the mean prediction of a single DT33.K-nearest neighbor (KNN)One of the classifiers used in this research is KNN. In this method, in the training stage, all samples in the input space are multidimensional vectors. This space is divided into category labels and the position of these points. Usually, the distance of the new sample to all the training samples is a suitable criterion to determine the category of the new and unknown sample. The distance of two samples is calculated as Euclidean, Manhattan, and Chebyshev. To determine the category of a new sample, the distance of this sample with all the samples stored in the memory is calculated, and the k samples with the smallest distance to the unknown sample are selected. The category label of most of these k samples is considered the category label for the unknown sample34 .Hyperparameter optimizationBayesian Optimization Algorithm (BOA) is an effective method of general optimization of objective functions, the evaluation of which is costly35. BOA is proper when the user cannot access the functions’ form and can only access noisy objective function estimates. In this paper, hyperparameter tuning of ML models is performed by BOA. The BOA was proposed by Pelikan et al., 199936. This algorithm evolves a population of candidate solutions by building a Bayes network and then sampling it. In the BOA, the initial population is often randomly generated with a uniform distribution over all possible solutions. Each iteration of the BOA consists of four steps: First, using one of the selection methods, promising answers are selected from the current population. In the second step, a Bayes network is built to describe the population of promising answers. In the third step, new candidate answers are generated through sampling from the Bayes network. In the fourth step, the new candidate’s answers are added to the previous answers and replace all or some of them. The steps are repeated until a termination condition is reached. The termination condition can be convergence to a single member, reaching a sufficiently good solution, or reaching a certain number of iterations. There are different ways to perform each step of the BOA. For example, the initial population can be generated randomly or by using initial knowledge related to the problem. The selection stage can be done using any standard selection method in evolutionary algorithms. Also, different algorithms can be used to build the Bayes network, and different criteria can be used to evaluate the quality of candidate models. The ML model parameters optimized by the BOA are presented in Table 9.Table 9 HyperParameters of ML models optimized by bayesian optimization algorithm.Evaluation metricsThe ML algorithms have two phases: training and testing. During the training phase, a model was created to predict the state of other samples, and their performance was measured by a set of tests in the second phase. In the testing phase, the goal is to evaluate the algorithm’s performance from different aspects. The regression method has a set of data called training data that is pre-classified and has specific labels. The goal is to find a method, function or rule based on the characteristics of the training data to classify the data to be entered into the model in the future. The performance of all ML models was evaluated by MAE, RMSE and R2 metrics37.$$MAPE = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left| {\frac{{y_{i} – \overline{{y_{i} }} }}{{y_{i} }}} \right|$$
(8)
$$MSE = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left( {y_{i} – \overline{{y_{i} }} } \right)^{2}$$
(9)
$$R^{2} = 1 – \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {y_{i} – \overline{{y_{i} }} } \right)^{2} }}{{\mathop \sum \nolimits_{i = 1}^{n} \left( {y_{1} – y_{ave} } \right)^{2} }}$$
(10)
In these equations, \({y}_{i} and \overline{{y }_{i}}\) are predicted value and actual value, \({y}_{ave}\) is the average of data set values and n is the number of observations.In the case of classification, after training and testing the ML model, the confusion matrix on the training and testing dataset is computed to obtain the different types of misclassifications (Fig. 5). A confusion matrix contains information about different accuracy and error types. The confusion matrix is a matrix that shows the successful or unsuccessful performance of a classifier model. Each column of the matrix shows a sample of the value predicted by the model, and each row contains real (correct) samples. Confusion matrices make it easy to observe the error and interference between the results and are used to estimate the desired performance. The performance of a model is calculated by dividing the total number of elements of the main diagonal by the total number of elements of the matrix38.Figure 5Confusion matrix, P: positive, N: negative, TP: true positive, FN: false negative, FP: false positive, TN: true negative.The performance metrics for a multiclass confusion matrix are presented in Eqs. 11–1539.$$Accuracy = \frac{TP + TN}{{TP + FP + TN + FN}}$$
(11)
$$Sensitivity = TP\; Rate \left( {TPR} \right) = \frac{TP}{{TP + FN}}$$
(12)
$$Specificity = TN\; Rate = \frac{TN}{{TN + FP}}$$
(13)
$$Precision = Positive\; Predictive\; Value \left( {PPV} \right) = \frac{TP}{{TP + FP}}$$
(14)
$$F1-Score = 2\frac{PPV \times TPR}{{PPV + TPR}} = \frac{{2n_{TP} }}{{2n_{TP} + n_{FN} + n_{FP} }}$$
(15)

Hot Topics

Related Articles