Innovative approaches in imaging photoplethysmography for remote blood oxygen monitoring

DatasetsIn the current research on contactless SpO2 estimation, most researchers use self-built datasets instead of public ones, and most of them have fewer subjects and scenes. The experiments in this paper are conducted on the public datasets of VIPL-HR for easier comparison and more convincing32. The VIPL-HR database contains 9 scenarios for 107 subjects recorded by 4 different devices, which note the HR, SpO2, and Blood Volume Pulse (BVP) signal of each video at the same time.In order to verify the running status of the algorithm on devices with poor performance, this paper selects SOURCE1, in which the video is recorded by a Logitech C310 web camera with 25 frames and a resolution of \(960 \times 720\), as the dataset. 80% of the data was used as the training set, 10% as the validation set, and 10% as the test set. The SpO2 starts recording at the same time as the video, and a SpO2 value is recorded every second. The histogram of the SpO2 value distribution of the data set is shown in Fig. 1. The range of SpO2 from VIPL-HR is 84 to 99.Figure 1The histogram of label distribution of the dataset VIPL-HR, and label is SpO2 level.Traditional methodAccording to the literature33, SpO2 is the percentage of oxygenated hemoglobin in peripheral capillaries, which can be expressed as Eq. (1).$$ {\text{SpO}}_{{2}} = \frac{{C_{{HbO_{2} }} }}{{\left( {C_{{HbO_{2} }} + C_{Hb} } \right)}} \times 100\% $$
(1)
where \(C_{Hb}\) is the concentration of deoxygenated hemoglobin and \(C_{{HbO_{2} }}\) is the concentration of oxygenated hemoglobin. The Beer–Lambert Law states the light absorption by a substance in a solution is proportional to its concentration34. Set the light intensity incident on the tissue as \(I_{0}\). After passing through human tissue, the intensity of transmitted light \(I\) can be expressed as Eq. (2).$$ I = I_{0} e^{{ – \varepsilon_{0} C_{0} l – (\varepsilon_{{HbO_{2} }} C_{{HbO_{2} }} + \varepsilon_{Hb} C_{Hb} )L}} $$
(2)
where \(\varepsilon_{0}\), \(C_{0}\), \(l\) represent the total extinction coefficient of non-pulsatile components in tissue and venous blood, the concentration of light-absorbing substances, and distance travelled by light in tissue, respectively. The \(\varepsilon_{Hb}\) and \(\varepsilon_{{HbO_{2} }}\) represent the extinction coefficient of deoxygenated hemoglobin and the extinction coefficient of oxygenated hemoglobin. The \(L\) represents the distance travelled by light in arterial blood. Then, the absorbance of human tissue \(Ab\) can be defined as Eq. (3).$$ Ab = – \log \frac{I}{{I_{0} }} = \varepsilon_{0} C_{0} l + \left( {\varepsilon_{{HbO_{2} }} C_{{HbO_{2} }} + \varepsilon_{Hb} C_{Hb} } \right)L $$
(3)
As the human heart beats, the light intensity of light penetrating arteries will reach the minimum and maximum, and the absorbance of human tissue fluctuates between maximum and minimum values. Obviously, \(\varepsilon_{0} C_{0} l\) can be cancelled by the max \(Ab_{\max }\) and the min \(Ab_{\min }\), which means the influence of blood tissue can be eliminated by changes in light absorption caused by blood volume fluctuations. It is shown in Eq. (4).$$ \begin{aligned} & \Delta Ab = Ab_{\max } – Ab_{\min } = \left( { – \log \frac{{I_{DC} }}{{I_{0} }}} \right) – \left( { – \log \frac{{I_{AC} }}{{I_{0} }}} \right) = \log \frac{{I_{AC} }}{{I_{DC} }} \\ & \propto \frac{{I_{AC} }}{{I_{DC} }} = \left( {\varepsilon_{{HbO_{2} }} C_{{HbO_{2} }} + \varepsilon_{Hb} C_{Hb} } \right)\Delta L \\ \end{aligned} $$
(4)
where \(I_{DC}\), \(I_{AC}\) represent the DC component and the AC component, respectively. \(\Delta L\) represents the difference of the distance travelled by light. For the same subject, \(\Delta L\) is the same at different wavelengths. Obviously, in the scene of two different wavelengths of light, the influence can be eliminated, which is shown in Eq. (5).$$ \frac{{D_{{\lambda_{1} }} }}{{D_{{\lambda_{2} }} }} = \frac{{I_{AC}^{{\lambda_{1} }} /I_{DC}^{{\lambda_{1} }} }}{{I_{AC}^{{\lambda_{2} }} /I_{DC}^{{\lambda_{2} }} }} = \frac{{\varepsilon_{{HbO_{2} }}^{{\lambda_{1} }} C_{{HbO_{2} }} + \varepsilon_{Hb}^{{\lambda_{1} }} C_{Hb} }}{{\varepsilon_{{HbO_{2} }}^{{\lambda_{2} }} C_{{HbO_{2} }} + \varepsilon_{Hb}^{{\lambda_{2} }} C_{Hb} }} $$
(5)
where \(\lambda_{1}\) and \(\lambda_{2}\) represent the different wavelengths, and \(D_{\lambda }\) represents the radio of the AC component and the DC component. Combined with Eqs. (1) and (5), we can rewrite the calculation model as Eq. (6).$$ {\text{SpO}}_{2} = \frac{{\varepsilon_{Hb}^{{\lambda_{2} }} \cdot \frac{{D_{{\lambda_{1} }} }}{{D_{{\lambda_{2} }} }} – \varepsilon_{Hb}^{{\lambda_{1} }} }}{{\left( {\varepsilon_{{HbO_{2} }}^{{\lambda_{1} }} – \varepsilon_{Hb}^{{\lambda_{1} }} } \right) – \left( {\varepsilon_{{HbO_{2} }}^{{\lambda_{2} }} – \varepsilon_{Hb}^{{\lambda_{2} }} } \right) \cdot \frac{{D_{{\lambda_{1} }} }}{{D_{{\lambda_{2} }} }}}} $$
(6)
According to the optical absorption spectra of HbO2 and Hb shown in Fig. 2, when the absorption of Hb and HbO2 of light of wavelength \(\lambda_{2}\) is equal, Eq. (6) can be written as Eq. (7).$$ {\text{SpO}}_{2} = \frac{{\varepsilon_{Hb}^{{\lambda_{1} }} }}{{\varepsilon_{Hb}^{{\lambda_{1} }} – \varepsilon_{{HbO_{2} }}^{{\lambda_{1} }} }} – \frac{{\varepsilon_{Hb}^{{\lambda_{2} }} }}{{\varepsilon_{Hb}^{{\lambda_{1} }} – \varepsilon_{{HbO_{2} }}^{{\lambda_{1} }} }} \cdot \frac{{D_{{\lambda_{1} }} }}{{D_{{\lambda_{2} }} }} $$
(7)
Figure 2Absorption spectrum of HbO2 and Hb10.Therefore, the red light with a wavelength of 660 nm is usually referred to as \(\lambda_{1}\). In the visible light range, light with wavelengths of 440 nm and 520 nm is usually used as \(\lambda_{2}\). For each subject, this paper gets its BVP signal by video, and then calculates the average peak-to-peak and average values of the BVP signals of different color channels as AC and DC components respectively, and finally fits the SpO2 curve.Another traditional method PBV is not discussed in this paper because it can only roughly calculate SpO2.Deep learning method ITSCANThe proposed SpO2 estimation model based on deep learning is shown in Fig. 3, which consists of a motion branch and an appearance branch. And the identifiable image in Fig. 3 is taken from a dataset, named VIPL-HR. The appearance branch is responsible for capturing the detailed features of the face in the video by the Residual Network, and introducing it into the motion branch through the attention mechanism. The motion branch uses Temporal Shift Module (TSM)35 instead of 3D Convolution Neural Network (3DCNN) to extract spatial–temporal features, which helps to reduce calculation time. Different from TSCAN’s frame-by-frame estimation of HR, ITSCAN’s motion branch introduces a SpO2 Estimator at the end to map the features learned by the model into a separate SpO2, realizing second-by-second SpO2 prediction. By doing so, ITSCAN not only significantly reduces computational time by only calculating the attention mask once, but also captures most of the pixels that contain skin and reduces camera quantization error. ITSCAN is derived from MTTS-CAN, but they exhibit distinct differences in task types, application scenarios, and model architectures. MTTS-CAN. MTTS-CAN is used to measure HR and Resp by introducing self-attention masks twice in regression tasks, while ITSCAN is used to measure SpO2 by introducing coordinate attention, residual network and SpO2 estimator in classification.Figure 3Overview of the ITSCAN model.TSM networkThis section introduces a TSM Network to extract features with physiological information from face videos and uses them to calculate the SpO2. The TSM Network is shown in Fig. 4.Figure 4The TSM network module operates by using frame dislocation and 2D CNN to achieve synchronous extraction of spatiotemporal features.2D CNN is widely used to extract image features, but IPPG needs to extract more complex spatiotemporal features. Therefore, 3D CNN, which can better take into account spatiotemporal features, is used for video processing. However, there is still a problem with 3D CNN because it needs to process temporal and spatial features at the same time, resulting in a large computational overhead. While 2D CNN can guarantee computational efficiency but cannot effectively extract temporal features. TSM is naturally proposed as a compromise between 3D CNN and 2D CNN, the schematic diagram of which is shown in Fig. 5.Figure 5The schematic diagram of TSM specific operations.In this paper, the total number of channels is divided by 3, 1/3 of the channels shift forward along the temporal axis, the other 1/3 shifts backward along the temporal axis, and the remaining 1/3 channels are unchanged. When some channel shifting exceeds the theoretical time length, the excess frames are truncated and the empty part is filled with zero. TSM realizes the exchange of spatial information between different frames so that the subsequent 2D CNN operation can extract spatial features including other time dimensions. Moreover, the operations of truncation and zero padding also ensure the invariance of the original time sequence.The coordinate attentionAffected by the environment, large amplitude head movements and light conditions will affect the extraction of facial features, resulting in distortion of SpO2 estimation36. Meanwhile, to minimize the negative effects introduced by shifting tensor, an attention module is inserted in ITSCAN. The coordinated attention mechanism performs convolution, normalization, pooling, and activation operations on the horizontal and vertical features information of the image, realizing the integration of spatial coordinate information and constructing a coordinate-aware attention map. In the paper, the coordinate attention module is embedded into the appearance branch, and the network structure is shown in Fig. 6.Figure 6The coordinate attention module structure first extracts the height and width features of the input data independently, and then learns them comprehensively.The input can be expressed as \(F = [f_{1} ,f_{2} \ldots ,f_{n} ]\). Through the average pooling layer, the input \(F\) is encoded separately in the horizontal and vertical directions, allowing the attention module to capture the precise location of facial features. Thus, the output of the c-th channel at width \(w\) can be written as Eq. (8), and the output of the c-th channel at height \(h\) can be written as Eq. (9).$$ z_{c}^{w} (w) = \frac{1}{H}\sum\limits_{0 \le i \le H} {f_{c} (j,w)} $$
(8)
$$ z_{c}^{h} (h) = \frac{1}{W}\sum\limits_{0 \le j \le W} {f_{c} (h,j)} $$
(9)
\(H\) and \(W\) represent the height and width of the image respectively. The size of the horizontal feature map \(Z^{h}\) is \([H,1]\) and the size of the vector feature map \(Z^{w}\) is \([1,W]\). Both are merged into a new feature map, the size of which is \([H,W]\). The results are passed through the \(1 \times 1\) convolution layer, normalization layer and \({\text{Re}} LU\) activation layer respectively.Then, the features are evenly divided into part \(F_{1}\) and \(F_{2}\) according to the second dimension, and then sent to the \(1 \times 1\) convolution layer respectively, so that their dimensions are again consistent with the dimensions of the input features of the coordinate attention module. Finally, the output is denoted by Eq. (10).$$ {\text{Output}} = F \times F_{1} \times F_{2} $$
(10)
Residual networkThe paper refers to the literature25 to construct a residual network for feature extraction as shown in Fig. 7. In the residual network, the introduction of jump links greatly alleviates the vanishing gradient problem. At the same time, directly connecting the input features to the output via a bypass reduces information loss and further improves network performance. After experimental testing, when the number of residual blocks is 2, the residual network has the best effect.Figure 7Specific architecture of the module Residual Network, which includes two layers of Residual Block.SpO2 estimatorThe SpO2 Estimator is shown in Fig. 8. It maps the high-dimensional features to SpO2 corresponding to time. Since SpO2 can equal interval discrete values from 1 to 100%, i.e., the sequential vector \(S = [1,2, \ldots ,100]\). The channels of input are compressed to 100 by a \(1 \times 1\) convolution layer, consisting with \(S\). Subsequently, obtain the feature map with these 100 channels by a normalization layer and a Softmax activation layer.Figure 8Specific architecture of the SpO2 estimator. Map the input features to SpO2 value through the softmax function.The loss functionIn the SpO2 estimation task, the MAE and RMSE between the predicted result and the real result are the targets that researchers care about. According to NBR ISO 80601-2-61:2015, to validate equipment for SpO2 estimation, the RMSE value must be less than 2% of the declared range11. MAE is also the smaller the better. At the same time, if the change trend of the estimated SpO2 is basically consistent with the actual value, it means that the estimated SpO2 can reflect the real situation. So, The Pearson coefficient is naturally also used to evaluate the quality of prediction results. The formulas are defined from Eqs. (11) to (13).$$ {\text{MAE}} = \frac{1}{T}\sum\limits_{i = 1}^{T} {\left| {X_{i} – Y_{i} } \right|} $$
(11)
$$ {\text{RMSE}} = \sqrt {\frac{1}{T}\sum\limits_{i = 1}^{T} {(X_{i} – Y_{i} )^{2} } } $$
(12)
$$ {\text{Pearson}} = \frac{{T\sum\nolimits_{i = 1}^{T} {X_{i} Y_{i} } – \sum\nolimits_{i = 1}^{T} {X_{i} } \sum\nolimits_{i = 1}^{T} {Y_{i} } }}{{\sqrt {\left( {T\sum\nolimits_{i = 1}^{T} {X_{i}^{2} } – \left( {\sum\nolimits_{i = 1}^{T} {X_{i} } } \right)^{2} } \right)\left( {T\sum\nolimits_{i = 1}^{T} {Y_{i}^{2} } – \left( {\sum\nolimits_{i = 1}^{T} {Y_{i} } } \right)^{2} } \right)} }} $$
(13)
where T is the length of the SpO2 array, X is the gold standard SpO2 series, and Y is the estimated SpO2 series.In predicting the SpO2 task using deep neural networks, there are a total of 5 loss functions that are commonly used, namely MSE22, RMSE21, SmoothL125, NegPearson22, and cross entropy25. The SmoothL1, NegPearson, and Cross Entropy are defined from Eqs. (14) to (16). The paper compares the effectiveness of new loss functions formed by linear combinations of the above 5 loss functions in predicting SpO2. Finally, it was concluded that the combination of SmoothL1 and NegPearson has the best results. Specific experimental results are displayed in Experiments.$$ {\text{SmoothL1}} = \frac{1}{T}\sum\limits_{i = 1}^{T} {\left\{ {\begin{array}{*{20}l} {0.5 \times (Y_{i} – X_{i} )^{2} ,} \hfill & {\left| {Y_{i} – X_{i} } \right| < 1} \hfill \\ {\left| {Y_{i} – X_{i} } \right| – 0.5,} \hfill & {otherwise} \hfill \\ \end{array} } \right.} $$
(14)
$$ {\text{NegPearson}} = 1 – \left| {{\text{Pearson}}} \right| $$
(15)
$$ {\text{CrossEntropy}} = – \sum\limits_{i = 1}^{T} {\sum\limits_{j = 1}^{N} {P_{{X_{ij} }} \log P_{{Y_{ij} }} } } $$
(16)
where T is the length of the SpO2 array, X is the gold standard SpO2, and Y is the estimated SpO2. \({\text{P}}_{{{\text{X}}_{{{\text{ij}}}} }}\) represents the probability that the gold standard SpO2 at time i is estimated to be j, and \({\text{P}}_{{{\text{Y}}_{{{\text{ij}}}} }}\) represents the probability that the estimated SpO2 at time i is estimated to be j.

Hot Topics

Related Articles