Momentum prediction models of tennis match based on CatBoost regression and random forest algorithms

In this section, we are tasked with developing a model capable of capturing momentum to describe the probability of scoring in a match. For this purpose, we have chosen a stacking model strategy and established the CBRF (Composite Bayesian Regression Framework) prediction model to describe the match process. The flowchart of this model is illustrated in Fig. 4.Figure 4Based on decision tree CatBoost regression prediction modelDecision tree regression modelThe process of generating a decision tree involves continuously grouping the training sample set. The branches of the decision tree grow gradually as the data is segmented further. The core technique in the growth of a decision tree is the test attribute selection issue12. We use the data after dimensionality reduction as the independent variable and point-victor as the dependent variable to conduct machine learning with the decision tree regression prediction model, obtaining the predicted model results.When utilizing machine learning algorithms for predictions, it’s common practice to assess the accuracy of these predictions using several statistical metrics. These metrics include Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and the Coefficient of Determination \(R^{2}\) The formulas for calculating these metrics are as follows:$$\begin{aligned} MSE= & {} \frac{1}{N}{\textstyle \sum _{i=1}^{N}}(y_{i}-{\hat{y}}_{i})^{2},\\ RMSE= & {} \sqrt{\frac{1}{N}{\textstyle \sum _{i=1}^{N}}(y_{i}-{\hat{y}} _{i} )^{2}},\\ MAE= & {} {\frac{1}{N}{\textstyle \sum _{i=1}^{N}}\left| y_{i}-{\hat{y}} _{i}\right| },\\ MAPE= & {} {\frac{100\%}{N}{\textstyle \sum _{i=1}^{N}}\left| \frac{y_{i}-y_{i}^{‘}}{y_{i} }\right| },\\ R^{2}= & {} 1-\frac{{\textstyle \sum _{i=1}^{N}}(y_{i}-{\hat{y}}_{i})^{2}}{{\textstyle \sum _{i=1}^{N}}(y_{i}-{\bar{y}}_{i})^{2}}. \end{aligned}$$In the context provided, \(y_{i}\) represents the actual value of the \(i{\text {th}}\) sample, \({\hat{y}} _{i}\) is the predicted value of the \(i{\text {th}}\) sample, N is the total number of samples,\({\bar{y}}\) is the average of all actual values.Using the formulas provided above, the results can be calculated and presented as shown in Table 4:Table 4 Decision tree evaluation results.Upon analyzing the evaluation results from Table 1, it was found that the value was too small. Therefore, we decided to optimize the decision tree regression prediction model and adopt the CatBoost regression prediction model for re-prediction.CatBoost regression prediction model based on decision tree regression modelCatBoost is a framework that relies on symmetric decision trees as base learners, characterized by fewer parameters and support for multivariate analysis. Its primary advantage lies in efficiently and reasonably addressing prediction bias, thus minimizing overfitting and enhancing the model’s accuracy and generalizability13. The expression for this model is:$$\begin{aligned} x_{i,k} =\frac{ \sum _{j=1}^{p-1} [x_{\sigma _{j,k} }=x_{\sigma _{p,k} } ]\cdot Y_{j}+ a\cdot p }{\sum _{j=1}^{p-1} [x_{\sigma _{j,k} }=x_{\sigma _{p,k} } ]+a}. \end{aligned}$$In the formula, \(\sigma _{j}\) represents the model’s output for the \(j{\text {th}}\) data point; \(x_{i,k}\) denotes the discrete feature in the \(k{\text {th}}\) column of the \(i{\text {th}}\) row in the training dataset; a is a prior weight; and p represents the prior distribution term. The predicted model results are evaluated, with the evaluation results presented in Table 5, and the formulas for calculating evaluation result parameters are provided in section “Decision tree regression model”.Table 5 CatBoost regression prediction model evaluation results.Although this model fits the data significantly better than the random forest model, its predictive performance is still not ideal. While the model performs well on training data, its performance degrades significantly on test data, exhibiting increased prediction errors and reduced accuracy. Furthermore, the coefficient of determination is very high on the training set but decreases markedly on the test set, indicating a problem of overfitting.Building the CBRF prediction model based on CatBoost regression model and random forest regression modelTo make the model more accurate, considering that in tennis matches, the serving side often has a higher probability of scoring22, we incorporate the serving side’s weight \(S_{i}\) as an additional variable into the CatBoost regression model. This way, the model’s prediction function depends on both the original data \(S_{i}\) and the serving side’s weight \(S_{i}\).$$\begin{aligned} {\hat{y}}_{i}^{(CB)}=f_{CB}(X_{i},S_{i}). \end{aligned}$$In this context, \({\hat{y}}_{i}^{(CB )}\) represents the predicted probability of winning, while \(f_{CB}\) is the prediction function constructed using the CatBoost regression model.Besides this, random Forest is a supervised machine learning method constructed through an ensemble of decision trees as base learners. It introduces randomness into the decision tree training process, endowing it with excellent anti-overfitting and noise resistance capabilities16.To make machine learning predictions more accurate, we stack the CatBoost regression model on top of the Random Forest regression model to construct the CBRF prediction model. In this model, we use the prediction results from the CatBoost regression model as input data to build the Random Forest regression model:$$\begin{aligned} {\hat{y}} _{i}^{(CBRF)} =f_{CBRF} \left( {\hat{y}} _{i}^{(CB )},S_{i},X_{i}^{‘} \right) . \end{aligned}$$In this expression, \({\hat{y}} _{i}^{(CBRF )}\) represents the predicted value from the CBRF prediction model, and \(f_{CBRF}\) is the prediction function of the CBRF prediction model, which incorporates variable \(S_{i}\).Additionally, to enhance the accuracy of the model training, we employ three-fold cross-validation to assess the model’s stability during the training process. The ratio of the training set to the dataset in the total data is 4:1.After training, we obtain the predicted results for ‘point-victor‘. Comparing these predicted results with the original values of ‘point-victor‘, and considering the large volume of original data, we select the predicted results of one match for visualization, resulting in Fig. 5 as shown:Figure 5From the above figure, it is evident that the predicted values are very close to the actual values.Predictive evaluation analysisThe predicted model results are evaluated, with the evaluation results presented in Table 6, and the formulas for calculating evaluation result parameters are provided in section “Decision tree regression model”.Table 6 Evaluation of CBRF model predictions.The data from the table indicate that the values of MSE, RMSE, MAE, and MAPE are all low, and \(R^{2}\) is close to 1. Therefore, it can be concluded that the CBRF prediction model exhibits very good predictive performance on the dataset of men’s singles matches from the second round of the 2023 Wimbledon Tennis Open. This demonstrates that the model can accurately predict the direction of matches based on the momentum of the athletes, proving the effectiveness of the CBRF prediction model in this task.Figure 6We use Python visualisers for simulation, and from the Fig. 6, it can be analyzed that the prediction results for both the training and test sets are similar. This indicates that the CBRF prediction model has good generalization ability and does not exhibit overfitting or underfitting phenomena.Score probability visualizationThrough the establishment of the CBRF prediction model, we have been able to determine who is more likely to be the scorer as a result of momentum shifts. However, we are yet unable to visually represent which player performs better at a specific moment in the match and to what extent. Therefore, we propose constructing the following model to describe this:$$\begin{aligned} \theta _{1}= & {} \left( 2-{\hat{y}}^{(CBRF)} \right) \times 100\%,\\ \theta _{2}= & {} \left( 1-{\hat{y}}^{(CBRF)} \right) \times 100\%, \end{aligned}$$where \(\theta _{1}\) represents the probability of Player 1 winning, \(\theta _{2}\) represents the probability of Player 2 winning.We visualize the probabilities \(\theta _{1}\) and \(\theta _{2}\) to express this concept, as shown in Fig. 7 below, we used Python to visualise and analyse the data.Figure 7Visualization of scoring probability.From the chart, you can observe the probability of each player scoring at a specific time interval. For instance, in the case of Player 1 during the match at 1301 s, the probability of winning is 3.72%, while at 4806 s, the probability of winning jumps to 95.74%.

Hot Topics

Related Articles