Accurate and 30-plus days reliable cuffless blood pressure measurements with 9-minutes personal photoplethysmograph data and mixed deduction learning

In this section, the prediction performance of four deep learning models: two purely personalized models (PIL and PDL) and two models trained with mixed subjects (MIL and MDL) were compared. All models were developed for each of the 15 target subjects, who had at least 10 rounds of data (see Supplementary Data Table 2), as nine rounds of data for model training is the requisite in this study. This study was carried out on an ASUS ESC8000 GPU server, which equipped with dual Intel Xeon Silver 4114 CPUs, 256 GB host memory, and 6 Nvidia RTX 2080Ti GPU cards. The system was operated by Ubuntu 18.04.6 LTS, with the model development environment of CUDA-10.1, Python-3.6.9, Keras-2.2.2, TensorFow-1.11.0, and Jupyter Notebook 5.7.0. For the training parameters, the batch size was set to 3000 for optimal GPU memory utilization, with the learning rate 0.001, the Adam optimization algorithm, and ReLU activation function, for all the models. All the reference BP were measured by a cuff-based Omron model HEM-7320 automatic sphygmomanometer.Performance comparison of our models is presented in the correlation plots shown in Fig. 3 (a), (b), and Supplementary Data Fig. 5, for the ground truth SBP / DBP measured by a cuff-based sphygmomanometer versus the corresponding model predictions. Following the BHS protocol, a BPM device receive grade “A” in performance validation if it reaches 60%, 85%, and 95% of BP measurements with deviations within ± 5 mmHg, ± 10 mmHg, and ± 15 mmHg, respectively (see Table 1 for details). These deviations of BP predictions are illustrated in the solid lines surrounding the central diagonal (dot) line in each figure, and the percentage of data points of predicting deviation within the region ± 15 mmHg is also presented. As shown in the figures, the percentages of data points within the region ± 15 mmHg given by MDL are 92.0% for SBP and 95.5% for DBP, respectively, both are significantly increased versus PDL, PIL, and MIL. We notice that subjects with SBP higher than 150 mmHg exhibit significant errors. This was due to insufficient number of data samples. From our recruited subjects, we only collected about 13% of total data samples with SBP \(\:\ge\:\) 150 mmHg.Fig. 5The performance trends of the estimated BP, in accordance with the BP variation thresholds of BHS standard for the systolic (a1) ≤ ± 5 mmHg, (a2) ≤ ± 10 mmHg, and (a3) ≤ ± 15 mmHg, and diastolic (b1) ≤ ± 5 mmHg, (b2) ≤ ± 10 mmHg, and (b3) ≤ ± 15 mmHg, respectively. The red / green data points represent the models with one-channel / two-channel inputs, respectively. Examples of intermediate feature maps extracted from the first CNN layer of our models are presented in: (c1) for PDL trained with nine consecutive rounds of PPG data of the target subject, (c2) for MDL trained with the PDL training set and mixed with data from 199 randomly selected subjects, (c3) for MDL trained with the PDL training set and mixed with data from all the recruited 364 subjects, (d1) for PIL trained with nine unpaired rounds of PPG data of the target subject, (d2) MIL trained the PIL training data and mixed with data from 199 randomly selected subjects, and (d3) for MIL trained with the PIL training data and mixed with data of all the recruited 364 subjects.The commonly used global-view scatter plot of multi-subjects’ reference versus prediction is insufficient to objectively evaluate the performance of a personalized device that requires calibration data, as highlighted by Mukkamala et al.55. It is important to show the model’s response to the actual BP change for each subject to avoid the illusion of having a seemingly good global coefficient of determination when the model is not responding to individual’s BP change relative to the preceding cuff BP measurement (reference) obtained for calibration (hereafter referred as calibration BP). To solve this, Mukkamala proposed a figure using each subject’s calibration BP to normalize the prediction and reference value, as shown in Fig. 3 (c), (d), which depicts the relationship between the true BP change from calibration and the predicted BP change from calibration. The calibration BP in our work is defined as the average of the reference BP from rounds of measurements used in training the model for each subject:$$\:Calibration\:BP\:\left(Subject\:i\:\right)=\frac{\sum\:_{r=1}^{k}{BP}_{i,r}}{k}$$Where i is the individual subject, k is the number of rounds of data in training, r is the rth round of data, and BP refers to the measured reference BP with cuff. From Fig. 3(c), (d) we can see the MDL’s superior performance, having significantly better linear correlation along the diagonal line compared to MIL, PIL, and PDL. The MDL outperforms the other three from both global and individual perspectives of modeling performance. While most of the data falls within the ± 20 mmHg range, it also shows it’s ability of deliver accurate predictions up to ± 50 mmHg changes.To examine whether the performance of each model could fulfill the standard of AAMI, which stated that a valid BPM device should attend the measured accuracy of ME (mean error) \(\:\le\:\) 5 mmHg, SD of ME \(\:\le\:\) 8 mmHg, and the number of testing subjects should be at least 85 (see Table 1 for details), the plots of the average of the BP reference and model prediction \(\:{(BP}_{ref}{+\:BP}_{pred})/2\:\)versus their difference \(\:{(BP}_{ref}{-\:BP}_{pred})\) for both SBP and DBP, together with their SD of ME, are presented in Supplementary Data Fig. 5. As shown in the figures, the SD of ME predicted by MDL for SBP / DBP is 8.74 / 5.90 mmHg, respectively, which is smaller than that predicted by all the other models. It is even appealing to note that, following the AAMI criteria, the percentage of predicted data points falling within the region of ME \(\:\pm\:\) 8 mmHg are 64.8% for SBP, and 84.1% for DBP, respectively, which are also considerably larger than those predicted by the other models. Therefore, it is evident that MDL outperforms the other models also in the AAMI standards.To ensure that predictions remain reliable and consistent over a prolonged period, we computed the BP percent error (PE) between BP reference and model prediction for each sample (see Fig. 6), via \(\:\text{P}\text{E}=\:\left({BP}_{ref,\:i}-\:{BP}_{pred,\:i}\right)\:/\:{BP}_{ref,\:i}\times\:100\%\), where \(\:i\), \(\:{BP}_{ref,\:i},\) and \(\:{BP}_{pred,\:i}\) represent the index of the testing samples (\(\:i=\text{1,2},\dots\:,\:88\)), the ground truth BP, and the predicted BP, respectively. The PE results were plotted versus the test period (number of days) between the training and testing, which ranged from 15 to 49 days. From these figures, the PE data points falling within the region of \(\:\le\:\pm\:15\) mmHg of MDL (97.7% for SBP, and 100.0% for DBP) were significantly increased, surpassing PDL, PIL, and MIL. Even for the test periods \(\:\ge\:\) 30 days, the data points of PE of both SBP and DBP given by MDL were still close to 0% without significant outliers. This indicates that our MDL model is capable of maintaining the quality of prediction over a long period without requiring calibration, which is crucial to a clinically relevant application.Fig. 6Percent error versus the test period in days between the training and testing, for MIL, PIL, PDL, and MDL, respectively. Rows (a) and (b) are plots of SBP and DBP. The red-dashed horizontal lines represent the region of \(\:\pm\:15\%\) percent error, and the percentage of data points falling within this region is displayed.Table 1 summarizes the quantitative performance comparison for MIL, PIL, PDL, and MDL, from aspects of the AAMI, IEEE STD 1708–2014 28, and BHS standards. The quantities of ME, MAE, SD, and the percentage of samples with predicted difference within a threshold (CD (%)) were carried out using Eq. 2 to 5 (see Methods section), respectively. According to BHS, our MDL results are graded “B” and “A” for SBP and DBP, respectively, indicating that MDL is compliant with this standard. For IEEE STD 1708–2014 standard, our MDL also meets the grade “B” and “A” for SBP and DBP, respectively, for static test. The test of induced BP changes will be covered in our future works. Moreover, the MDL results also met the AAMI requirement for the DBP prediction, and nearly met the AAMI requirement for the SBP prediction. Overall, we found that all the models performed better in predicting DBP than SBP. Additionally, we also present the predicted accuracy scores in terms of the commonly used performance metrics in Table 2, such as \(\:pA\) (percentage accuracy), RMSEP (root-mean-square-error prediction), and \(\:{R}_{p}\) (Pearson correlation coefficient) (see Eq. 6 to 8). Our result shows that MDL performed much better in \(\:{R}_{p}\) (0.910 for DBP, and 0.914 for SBP) and RMSEP (6.4 for DBP, and 8.9 for SBP) over the other models. As a result, it can be concluded that MDL excelled over MIL, PIL, and PDL models. This demonstrates an encouraging breakthrough that MDL could be an ideal candidate for cuffless NIBP estimation, providing continuous and reliable BP monitoring for individuals.Table 1 Performance comparison of the proposed models based on the standards of BHS, IEEE STD 1708-2014, and AAMI for BP measurement. The criterion of BHS enumerates the percentage of samples with predicted difference in each region of thresholds. The criterion of IEEE STD 1708-2014 is based on the ranges of MAE scores of the prediction. The AAMI standard set the criteria of ME and SD of ME of predictions with at least 85 samples. The overall scoring of our models (the cells enclosed in the bold borders) was based on the worst grade of BHS protocol in the three thresholds, ≤ 5mmHg, ≤ 10mmHg, and ≤ 15mmHg.Table 2. Explores the predicting performance of DBP and SBP by percentage accuracy (pA), Pearson correlation coefficient (R_p) and root-mean-square-error prediction (RMSEP).The performance trends of training the models with one-channel and two-channel inputs to test with respect to the size of mixed training samples were investigated. In this test, we used the PIL and PDL training data, and mixed with data from 0, 199 of randomly selected, and all the 364 of recruited subjects, respectively, to train our models, where the first one is equivalent to PIL and PDL; while the two remaining ones are corresponded to MIL and MDL. Figure 5 illustrated a comparison of performance trends of our models in accordance with the BHS standard of BP variation thresholds. It appears that adding data of more subjects to the training set for conventional model with one-channel input did not lead to any improvement in performance when transitioning from PIL to the more inclusive MIL. This may be due to that the inclusion of a larger set of subjects’ data introduced ampler variations, which overwhelmed the weak footmark of the correlation between PPG and BP. In contrast, the training performance was significantly enhanced in models with two-channel inputs as data of more subjects were mixed to the MDL training set. This is the performance we could achieve so far. It is expected that if we could recruit more subjects to enlarge our training data set, it is possible that the required number of personal data samples of the target subject could be further reduced in training MDL. To gain a better understanding of what had been learned in the models with one-channel and two-channel inputs, we examined the feature maps produced by the first CNN layer with 256 filters of the models, as shown in Fig. 5 and Supplementary Data Figs. 4–9. The detailed discussion is presented in the Discussion section.To objectively compare the significant differences between MDL versus the other models with a statistically significant testing strategy, the p-value results in percentage accuracy (\(\:pA\), computed using Eq. 6, see Table 2) were presented. In order to choose a suitable statistically significant test, the distribution of the data population of pA was first checked. Supplementary Data Fig. 10a and b presented histogram plots of probability density versus pA for the SBP and DBP, respectively, for MDL compared to the other models. Since the probability distribution of all the models were positively skewed from normal distribution, a nonparametric paired Friedman test29,30 on pA for MDL, PDL, PIL, and MIL together was computed. The computed Friedman test p-values were 0.00029 and 0.00336 for SBP and DBP, respectively, which were all < 0.05. This result revealed that at least one method significantly outperformed the others. Next, the post hoc tests on pA for MDL versus the other models were performed using nonparametric Wilcoxon paired and Mann-Whitney U two-tailed tests. As seen in Supplementary Data Table 4, both tests yielded p-values \(\:<\) 0.05 for MDL versus PDL, PIL, and MIL for SBP, demonstrating its superiority over other methods. For DBP, both tests gave p-values \(\:<\) 0.05 for the MDL versus MIL and PIL, but, yielded 0.08725 and 0.08420 (both were near 0.05) for MDL versus PDL, respectively (see Supplementary Data Table 4). From these findings, we can conclude that MDL was significantly outperformed PDL, PIL, and MIL in terms of the cuffless BP prediction using PPG signals.

Hot Topics

Related Articles