Enhancing fairness in AI-enabled medical systems with the attribute neutral framework

DatasetsIn this study, we include three large-scale public chest X-ray datasets, namely ChestX-ray1415, MIMIC-CXR16, and CheXpert17. The ChestX-ray14 dataset comprises 112,120 frontal-view chest X-ray images from 30,805 unique patients collected from 1992 to 2015 (Supplementary Table S1). The dataset includes 14 findings that are extracted from the associated radiological reports using natural language processing (Supplementary Table S2). The original size of the X-ray images is 1024 × 1024 pixels. The metadata includes information on the age and sex of each patient.The MIMIC-CXR dataset contains 356,120 chest X-ray images collected from 62,115 patients at the Beth Israel Deaconess Medical Center in Boston, MA. The X-ray images in this dataset are acquired in one of three views: posteroanterior, anteroposterior, or lateral. To ensure dataset homogeneity, only posteroanterior and anteroposterior view X-ray images are included, resulting in the remaining 239,716 X-ray images from 61,941 patients (Supplementary Table S1). Each X-ray image in the MIMIC-CXR dataset is annotated with 13 findings extracted from the semi-structured radiology reports using a natural language processing tool (Supplementary Table S2). The metadata includes information on the age, sex, race, and insurance type of each patient.The CheXpert dataset consists of 224,316 chest X-ray images from 65,240 patients who underwent radiographic examinations at Stanford Health Care in both inpatient and outpatient centers between October 2002 and July 2017. The dataset includes only frontal-view X-ray images, as lateral-view images are removed to ensure dataset homogeneity. This results in the remaining 191,229 frontal-view X-ray images from 64,734 patients (Supplementary Table S1). Each X-ray image in the CheXpert dataset is annotated for the presence of 13 findings (Supplementary Table S2). The age and sex of each patient are available in the metadata.In all three datasets, the X-ray images are grayscale in either “.jpg” or “.png” format. To facilitate the learning of the deep learning model, all X-ray images are resized to the shape of 256×256 pixels and normalized to the range of [−1, 1] using min-max scaling. In the MIMIC-CXR and the CheXpert datasets, each finding can have one of four options: “positive”, “negative”, “not mentioned”, or “uncertain”. For simplicity, the last three options are combined into the negative label. All X-ray images in the three datasets can be annotated with one or more findings. If no finding is detected, the X-ray image is annotated as “No finding”.Regarding the patient attributes, the age groups are categorized as “<60 years” or “≥60 years“30. The sex attribute includes two groups: “male” or “female”. In the MIMIC-CXR dataset, the “Unknown” category for race is removed, resulting in patients being grouped as “White”, “Hispanic”, “Black”, “Asian”, “American Indian”, or “Other”. Similarly, the “Unknown” category for insurance type is removed and patients are grouped as “Medicaid”, “Medicare”, or “Other”. The amount and proportion of X-ray images under attributes and cross-attributes for the three datasets are shown in Supplementary Tables S1,  S3–S5.All three large-scale public chest X-ray datasets are divided into training datasets, validation datasets, and test datasets using an 8:1:1 ratio (Supplementary Table S6). To prevent label leakage, X-ray images from the same patient are not assigned to different subsets.Attribute neutralizerThe AttrNzr is structured based on AttGAN31, allowing for continuous adjustment of attribute intensity while preserving other image information. It consists of two main components: the generator and the discriminator. The generator employs a U-net structure to encode the original X-ray image as a latent representation and decodes the concatenation of the latent representation and the attribute vector into the modified X-ray image. The discriminator serves as a multi-task image classifier, distinguishing between the original and modified X-ray images while identifying the X-ray attribute. The AttrNzr’s parameters are optimized through a loss function that combines attribute classification constraints, reconstruction loss, and adversarial loss (Supplementary Fig. S1).\({G}_{{enc}}\) and \({G}_{{dec}}\) indicate the encoder and the decoder of the generator. \(C\) and \(D\) indicate the attribute classifier and the discriminator. Denoted by \(a\) the original attribute vector, \(b\) the modified attribute vector, \(\hat{b}\) the identified attribute vector by \(C\), \(Z\) the latent representation, \({x}^{a}\) the original X-ray image with \(a\), \({x}^{\hat{a}}\) the modified X-ray image with \(a\), and \({x}^{\hat{b}}\) the modified X-ray image with \(b\). \(a\), \(b\), and \(\hat{b}\) contain \(n\) binary attributes, and can be expressed as \(a=\left({a}_{1},\cdots,{a}_{n}\right)\), \(b=\left({b}_{1},\cdots,{b}_{n}\right)\), and \(\hat{b}=\left({\hat{b}}_{1},\cdots,{\hat{b}}_{n}\right)\), respectively.In the AttrNzr, the image generated by the generator (encoder, and decoder) should meet three objectives: 1) \({x}^{\hat{a}}\) is the same as \({x}^{a}\); 2) the attribute of \({x}^{\hat{b}}\) is identified by \(C\) as \(b\); and 3) \({x}^{\hat{b}}\) is identified by \(D\) as the real X-ray image. Therefore, the loss function of the generator \({L}_{{gen}}\) is formulated as follows:$${L}_{{gen}}={\lambda }_{1}{L}_{{rec}}+{\lambda }_{2}{L}_{{{cls}}_{g}}+{L}_{{{adv}}_{g}},$$
(1)
where \({L}_{{rec}}\), \({L}_{{{cls}}_{g}}\), and \({L}_{{{adv}}_{g}}\) indicate the reconstruction loss, the attribute classification constraint, and the adversarial loss, respectively. \({\lambda }_{1}\) and \({\lambda }_{2}\) are hyperparameters for balancing different losses. \({L}_{{rec}}\) is measured by the sum of all the absolute differences between \({x}^{a}\) and \({x}^{\hat{a}}\), and is formulated as follows:$${L}_{{rec}}={{{\rm{||}}}{x}^{a}-{x}^{\hat{a}}{{\rm{||}}}}_{1}.$$
(2)
\({L}_{{{cls}}_{g}}\) is measured by the cross entropy between \(b\) and \(\hat{b}\), and is formulated as follows:$${L}_{{{cls}}_{g}}={\sum }_{i=1}^{n}-{b}_{i}log {C}_{i}\left({x}^{\hat{b}}\right)-\left(1-{b}_{i}\right)log \left(1-{C}_{i}\left({x}^{\hat{b}}\right)\right),$$
(3)
where \({C}_{i}\left({x}^{\hat{b}}\right)\) indicates the predication of the \({i}^{{th}}\) attribute. \({L}_{{{adv}}_{g}}\) is formulated as follows:$${L}_{{{adv}}_{g}}=-D\left({x}^{\hat{b}}\right),$$
(4)
In the AttrNzr, the discriminator/attribute-classifier should meet three objectives: 1) identify the attributes of \({x}^{a}\) as \(a\); 2) identify \({x}^{a}\) as the real X-ray image; and 3) identify \({x}^{\hat{b}}\) as the fake X-ray image. Therefore, the loss function of the discriminator/attribute-classifier \({L}_{{dis}/{cls}}\) is formulated as follows:$${L}_{{dis}/{cls}}={\lambda }_{3}{L}_{{{cls}}_{c}}+{L}_{{{adv}}_{d}},$$
(5)
where \({L}_{{{cls}}_{c}}\) and \({L}_{{{adv}}_{d}}\) indicate the attribute classification constraint, and the adversarial loss, respectively. \({\lambda }_{3}\) is the hyperparameter for balancing different losses. \({L}_{{{cls}}_{c}}\) is measured by the cross entropy between \(a\) and the attribute vector produced by \(C\), and is formulated as follows:$${L}_{{{cls}}_{c}}={\sum }_{i=1}^{n}-{a}_{i}log {C}_{i}\left({x}^{a}\right)-\left(1-{a}_{i}\right)log \left(1-{C}_{i}\left({x}^{a}\right)\right).$$
(6)
\({L}_{{{adv}}_{d}}\) is formulated as follows:$${L}_{{{adv}}_{g}}=-D\left({x}^{a}\right)+D\left({x}^{\hat{b}}\right).$$
(7)
The attribute vector comprises binary representations of attributes. For age and sex, “<60 years”/“≥60 years” and “female”/“male” are represented by 0/1. For multiclass attributes like race and insurance type, each subgroup is encoded using the one-hot encoding (Supplementary Fig. S2a). For example, the White is encoded as \(({\mathrm{1,0,0,0,0,0}})\), and the Hispanic is encoded as \(({\mathrm{0,1,0,0,0,0}})\). In the AttrNzr, the X-ray attribute is adjusted by modifying the attribute vector. The modification intensity α controls the degree of attribute modification. α ranges from 0 to 1, with 0 indicating no modification, 1 indicating negation of the attribute, and 0.5 indicating a neutral attribute (Supplementary Fig. S2b).The high scalability of the attribute vector allows AttrNzr to modify not only a single attribute but also multiple attributes simultaneously. For the three chest X-ray datasets, single-attribute AttrNzrs and multi-attribute AttrNzrs are trained respectively (Supplementary Table S7).To enhance the fundamental stability of the AttrNzr, several tips are implemented: 1) Gaussian noise with a mean of 0.1 is added to the X-ray image before inputting it to the discriminator; 2) 5% of fake/real labels are flipped during discriminator training; 3) label smoothing is applied to the attribute vector; 4) random horizontal flips are used to augment the X-ray image dataset; 5) a relatively large convolution kernel of size 6 × 6 is utilized; 6) the loss weights for the attribute classification constraint, reconstruction loss, adversarial loss, and gradient penalty are set to 10, 100, 1, and 10, respectively. Other training hyperparameters include a learning rate of 0.0001, a batch size of 64, and a training epoch of 300. The AttrNzr is trained on a Tesla V100 32GB GPU.AI judge for attribute recognitionIn this study, the judges identify the attributes of X-ray images generated by our AttrNzr. The first judge is an AI model that has been fully trained on original X-ray images to classify attribute types. The AI judge is used to identify the attributes of X-ray images that have been modified with different intensities. The modification intensity α is set to 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, and 1.0. To facilitate the evaluation of the performance of the AI judge, only AI judges for binary attributes (age, and sex) are trained in this study.After considering the performance of various deep learning models in disease diagnosis (which will be mentioned in the Disease Diagnosis Model section), ConvNet32 is selected to build the AI judge. The AI judge is designed with 2 output nodes, corresponding to “<60 years”/“≥60 years” or “female”/“male”. All parameters of the AI judge are initialized using ConvNet’s pre-training on the ImageNet dataset. Data augmentation techniques33, including random horizontal flip, random rotation, Gaussian blur, and random affine, are applied to expand the dataset. Other hyperparameters include a learning rate of 0.0005, a batch size of 120, and a training epoch of 100. The AI judge is also trained on a Tesla V100 32GB GPU. After the AI judge is fully trained, the Gradient-weighted Class Activation Map is involved to find the activated region of the modified X-ray image.Human judge for attribute recognitionThe second attribute recognition involves human judges identifying the attributes of X-ray images generated by our AttrNzr. Five junior physicians are invited from the Thoracic Surgery Department of Guangdong Provincial People’s Hospital to act as human judges for the attribute recognition. Due to variations in race and insurance between the regions where the large-scale public chest X-ray datasets are acquired and the working regions of the 5 human judges, the attribute recognition focused only on the attributes of age and sex.For each attribute, 5 groups of X-ray images are randomly selected from the ChestX-ray14 dataset. Each group contains 40 X-ray images that are modified by the AttrNzr using different modification intensities. To reduce the workload of the human judges, the modification intensity α is limited to five values: 0.0, 0.3, 0.5, 0.7, and 1.0.Even with different modification intensities, the same group of X-ray images still exhibits relative similarity. To prevent the identification decisions of the human judges from being influenced by modified X-ray images of the same group but with different modification intensities, each judge is not allowed to repeatedly identify the same group, regardless of the modification intensity. The assignment schedule for the five human judges is presented in Supplementary Fig. S3.Disease diagnosis modelAfter comparing the disease diagnosis performance of various deep learning networks on the three large-scale public chest X-ray datasets (Supplementary Table S8), ConvNet is selected as the DDM for this study. In these datasets, the “No finding” label and other finding labels are mutually exclusive, but the other finding labels themselves are not mutually exclusive. To simplify the disease diagnosis task, it is treated as a multi-label recognition task. In the DDM, the number of output nodes is equal to the number of finding labels, including the “No finding” label. The activation function of the last layer is sigmoid, and the loss function is binary cross-entropy, which calculates the loss between the target and the output probabilities. Taking into account the imbalance of findings in the dataset, we assign weights to the losses of the findings based on the number of X-ray images associated with each finding. The initialization, data augmentation, and hyperparameter settings remain consistent with those of the AI Judge.The instability of deep learning poses uncertainty in the evaluation of DDMs. To ensure reliable evaluation results, we conduct additional training for 20 epochs after the DDM has converged on the validation dataset. At the end of each training epoch, we save the output of the DDM. Finally, the DDM is evaluated based on the outputs obtained from these 20 epochs.Alternative unfairness mitigation algorithmsThree alternative algorithms for mitigating unfairness in AI-enabled medical systems are introduced in this study: the Fairmixup12, the Fairgrad11, and the Balanced sampling18. The first two algorithms require integration into the DDM, while the third is solely applied to the dataset.In the Fairmixup, mixup is employed to generate interpolated samples between different groups12. These interpolated samples introduce a smoothness regularization constraint that is incorporated into the loss function of AI models to mitigate unfairness. Mixup can be implemented at both the image and feature levels, referred to as Fairmixup and Fairmixup manifold, respectively. Interpolated samples are derived from blending two samples, thus, Fairmixup is effective in addressing unfairness associated with binary attributes such as age and sex. The implementation of Fairmixup is based on the official algorithm source code (https://github.com/chingyaoc/fair-mixup), with the regularization constraint weight in the loss function set to 0.05.The Fairgrad ensures fairness by assigning lower weights to examples from advantaged groups compared to those from disadvantaged groups11. This method is applicable only to binary classification tasks. Consequently, in our investigation, the multi-label recognition task is segmented into multiple binary classification tasks (15, 14, and 14 binary classification tasks in the ChestX-ray14, MIMIC-CXR, and CheXpert datasets respectively). The Fairgrad’s implementation is based on the official PyPI package (https://pypi.org/project/fairgrad/), and unfairness in the loss function is assessed using equalized odds.Balanced sampling combats unfairness by constructing group-balanced data, wherein the sample size of majority groups is randomly down-sampled to match that of the minority group while preserving proportional distributions among various findings. Details regarding the sample size of the minority group are available in Supplementary Table S1.For each alternative unfairness mitigation algorithm, model framework, data augmentation, learning rate, number of training epochs, and other configurations remain consistent with the baseline DDM.Performance evaluation metricsThe SSIM34 is utilized to evaluate the similarity between two X-ray images. SSIM is calculated on various windows of an image. The measure between two windows \(x\) and \(y\), with a size of \(N\times N\), is given by the formula:$${SSIM}\left(x,y\right)=\frac{\left(2{\mu }_{x}{\mu }_{y}+{c}_{1}\right)\left(2{\sigma }_{{xy}}+{c}_{2}\right)}{\left({\mu }_{x}^{2}+{\mu }_{y}^{2}+{c}_{1}\right)\left({\sigma }_{x}^{2}+{\sigma }_{y}^{2}+{c}_{2}\right)}.$$
(8)
Here, \({\mu }_{x}\) and \({\mu }_{y}\) represent the mean pixel values of \(x\) and \(y\), respectively. \({\sigma }_{x}^{2}\) and \({\sigma }_{y}^{2}\) denote the variances of \(x\) and \(y\), while \({\sigma }_{{xy}}\) represents the cross-correlation between \(x\) and \(y\). The variables \({c}_{1}\) and \({c}_{2}\) are included to stabilize the division when the denominator is weak. The size of the window is set to \(100\times 100\) in our study.In attribute recognition, the performance of the AI judge in identifying the original attributes of the modified X-ray image is evaluated using accuracy, sensitivity, specificity, and F1 score. Additionally, the area under the receiver operating characteristic curve (AUC-ROC) is calculated to provide further evaluation of the AI judge. For the human judges, only accuracy is used to assess their performance in identifying the original attributes of the modified X-ray image.To address the instability of the DDM, the outputs obtained from 20 epochs after convergence are averaged to obtain a stable output. In assessing the performance of the DDM for each finding, ROC curves and precision-recall (PR) curves are generated, and corresponding AUC values are computed. Additionally, accuracy, sensitivity, specificity, precision, and F1-score are calculated for evaluation purposes. Macro-averaging of these metrics across all findings is performed to assess the overall performance of the DDM.Unfairness evaluation metricsUnfairness is assessed by examining the performance of various subgroups20. In our study, ROC-AUC serves as the primary metric for evaluating model performance. To assess unfairness related to non-binary attributes, we employ two evaluation metrics: (1) Group Fairness, which measures the gap in ROC-AUC between subgroups with the highest and lowest AUC values20, and (2) Max-Min Fairness, which evaluates the AUC of the subgroup with the poorest performance20,21. Furthermore, we report values for other performance metrics such as accuracy, sensitivity, and specificity.Neither the Worst-case ROC-AUC nor the ROC-AUC Gap can reflect the performance differences among all subgroups. The standard deviation (SD) can measure the mutual difference among multiple variables. Therefore, we introduce the standard deviation of performance13,22 as the third evaluation metric of unfairness. The performance SD can be computed using the following formula:$${UI}=\sqrt{\frac{\mathop{\sum }_{i=1}^{M}{\left({{met}}_{i}-\overline{{met}}\right)}^{2}}{M}}.$$
(9)
Here, \(M\) represents the number of groups in the attribute. \({{met}}_{i}\) denotes the performance metric value of the \({ith}\) group, and \(\overline{{met}}\) represents the average performance across all groups.The performance SD quantifies the variation or dispersion of performance among different groups within the attribute. A low performance SD suggests that the performance of each group is closer to the average performance across all groups, indicating less unfairness. Conversely, a high performance-SD suggests that the performance of each group is spread out over a wider range, indicating greater unfairness. In our study, the calculation of the performance SD is also based on the stable output of the DDM.Statistical analysisSSIM is utilized to assess the similarity between the modified X-ray image and the original X-ray image. Subsequently, the Pearson correlation coefficient is employed to measure the correlation between the similarity and modification intensity. Additionally, the Pearson correlation coefficient is also used to evaluate the correlation between the judges’ identification performance and the modification intensity. The evaluation metrics of the DDM are calculated at a 95% confidence interval using non-parametric bootstrapping with 1000 iterations. Delong’s test is employed to test the statistical significance of the difference between two ROC curves. The confidence interval for the difference between the areas under the PR curves is computed using the bias-corrected and accelerated bootstrap method. If the 95% confidence interval does not encompass 0, it signifies a significant difference between the two areas (P < 0.05). The comparison of ROC curve and PR curve was performed by MedCalc.For a comprehensive comparison of the relative performance and unfairness mitigation among different algorithms, we employ the Friedman test35 followed by the Nemenyi post-hoc test20. Initially, relative ranks are computed for each algorithm within each dataset and attribute independently. Subsequently, if the Friedman test reveals statistical significance, the average ranks are utilized for the Nemenyi test. A significance threshold of P < 0.05 is adopted. The outcomes of these tests are presented through Critical Difference (CD) diagrams36. In these diagrams, methods connected by a horizontal line belong to the same group, indicating nonsignificant differences based on the p-value, while methods in distinct groups (not connected by the same line) exhibit statistically significant disparities. The Fairmixup and Fairmixup manifold techniques are unsuitable for non-binary attributes. Consequently, the Friedman test and the Nemenyi post-hoc test are performed across 6 {dataset, attribute} combinations: {ChestX-ray14, age}, {ChestX-ray14, sex}, {MIMIC-CXR, age}, {MIMIC-CXR, sex}, {CheXpert, age}, and {CheXpert, sex}.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles