Exploring de-anonymization risks in PET imaging: Insights from a comprehensive analysis of 853 patient scans

Data descriptionWe used a whole-body FDG-PET/CT dataset10 subject to a TCIA licence agreement5. The dataset cannot be shared here, in keeping with the Data Usage Agreement. It was initially provided for deep learning-based PET/CT data analysis to find tumour lesions in the context of the AutoPET challenge11. It includes images from 1,014 whole-body standard FDG-PET/CT scans of 900 patients. The scans had been acquired over four to eight PET bed positions, with the majority from the skull base to the mid-thigh. For all the scans, only the voxel size in the metadata was exploited. The PET imaging data were originally reconstructed using a 3D-ordered subset expectation maximization algorithm (two iterations, 21 subsets, Gaussian filter 2.0 mm, matrix size 400 × 400, slice thickness 3.0 mm, voxel size of 2.04 × 2.04 × 3 mm3), while the CT files were originally reconstructed using the following parameters: reference dose intensity of 200 mAs, tube voltage of 120 kV, and iterative reconstruction with a slice thickness of 2-3 mm. For this study purpose, all the CT images were resampled to the PET image space (CTres.nii.gz).Proposed approachThe overall process is provided in Fig. 2. Briefly, we used a semi-supervised approach based on the three following consecutive modules to perform facial recognition from PET imaging data:Morphology reconstruction moduleWe first binarized the PET images by thresholding the standardized uptake values (SUVs) at the 95th percentile to remove structural noise while keeping relevant signal to distinguish the skin, on which we would expect to see nonzero voxel values. As Fig. 5. A shows, a percentile-based thresholding is necessary to remove very low SUV values that represent noise while accounting for the specificities of the SUV count distributions, which make mean-based thresholding (Otsu) inadequate. Indeed, the outlier voxels with very large SUV counts will bias any mean calculation-based threshold and lead to considering the voxels from the skin as noise. For the CT scans, we identified two peaks in the histograms (Hounsfield units, HU): the first peak, at very low HU, represents air and water, and the second one, with very high HU, represents denser tissues such as bone. To exclude low-density tissues, we applied an Otsu threshold on voxel histograms. Contrary to PET imaging data, CT intensities follow a bimodal distribution, which makes the Otsu threshold better suited to perform binarization. An example is shown in Fig. 5B. It is also important to note that for both PET and CT modalities, removing zero or negative-valued voxels exclusively (0 SUV and –1024 HU) does not adequately remove structural noise, which warrants the use of the different thresholds. For each patient, we then selected the largest connected component to remove spatially isolated regions and used the marching cubes algorithm12 implemented in scikit-image13 to form iso-surfaces and deduce 3D meshes. As we are interested in 2D images for demonstration purposes exclusively here, we projected our 3D meshes to 2D space with raycasting14 by using the Open3d15 library. To isolate the upper part of the body while capturing the face with the best angle, we placed our camera centre at 1/8 from the top of the 3D mesh. We then cropped the centred 2D projection to keep only the top 1/4 of the total image, a rule of thumb for keeping the reconstructed facial morphology. Finally, we manually excluded patients with arms in front of their faces relative to the raycasting viewpoint and the patients for whom no full-body PET/CT scan was available, leading to 853 effective patients for downstream analysis.Fig. 5(a) SUV count (PET, left panel) and HU (CT, right panel) mean value distributions before and after thresholding to remove noisy voxels. Mean SUV counts were calculated for all voxels after removing voxels that were exactly at zero and after removing all voxels falling under a threshold. Mean HU values were calculated for all voxels after removing voxels at –1024. (referred to as zero thresholding in the figure), after thresholding calculated using the Otsu method and after applying the thresholding mask calculated using the PET modality of the same patient. (b) An example of CT binarization with Otsu thresholding. (c) An example of PET binarization with 95th percentile thresholding (same patient).Denoising moduleThree different configurations were assessed:

A basic configuration without any denoising, called “original”;

An unsupervised denoising configuration based on standard non-local means16 and wavelet transformations by using the scikit-image python functions, which we call “non-deep”;

A deep-learning-based method named “deep”, for which we trained a U-NET17 architecture with 4 levels of 16, 32, 64, and 128 filters. The input image is our reconstructed 2D morphology from the PET scan and we aim at outputting a “denoised” face image. Our ground truth is the 2D morphology obtained from the corresponding CT scan. For regularization purposes, we added batch normalization18 layers during the training and removed them at inference time. Our loss was a modified version of the structured similarity index19. Contrary to traditional losses such as the mean squared error (MSE) or the L1 norm, the structured similarity better considers structural information by aggregating three main features from an image: luminance \(l\), contrast \(c\), and structure \(s\). We performed a 5-fold cross-validation to test the denoising and downstream landmark detection analysis. For each fold, we split the training data into training and validation subsets to check for overfitting. We trained our model for 30 epochs with a batch size of 32 samples. We used the Adam optimizer20 with a starting learning rate of 0.01. We saved model checkpoints and used the set of model weights that achieved the lowest validation loss during training as our final model for the given fold.

Landmarks detection moduleThe placement of landmarks was performed using a partially accessible end-to-end neural network model, which takes images as inputs and outputs a dense mesh model of 468 vertices placed on the image (MediaPipe library in python API21). The machine learning pipeline includes two deep neural network models. The first operates on the full image and computes face locations, whereas the second, based on a residual neural network architecture, predicts the 3D landmarks via regression.Statistical analysisTo determine whether we can accurately ‘recognize’ a patient’s face directly from the PET data, we defined 3 evaluation objectives:

Objective 1: The capacity of our semi-supervised approach to confidently place a set of landmarks on the corresponding face without and with denoising. The MediaPipe model outputs a confidence score along with the corresponding landmarks. Any model output with a confidence score under 0.5 was considered of low quality following the recommendations of the MediaPipe library and was, therefore, disregarded. The MediaPipe model is trained on photographic face images and its ability to confidently place landmarks on faces derived from PET reconstructions hints to the generalization capabilities of the facial recognition approach detailed here.

Objective 2: Whether the position of these landmarks was similar to the positions placed on the corresponding CT scan of the same patient, as CT scans have higher resolution with notoriously good face recognition potential. We used the mean absolute distance normalized by the interocular distance21 between the landmarks placed on the 2D PET facial reconstructions and the corresponding landmarks on the CT. We considered these landmarks to be well distributed on the patient’s face if this normalized distance was under 15%. This threshold was chosen qualitatively to reflect the human perception of the similarity between point clouds. The same analysis was conducted with a threshold of 10%, and similar conclusions were drawn.

Objective 3: Whether our model retrieves the correct patient by comparing the landmarks placed on the corresponding PET image to the landmarks placed on all CT morphological reconstructions from the entire dataset. We computed the mean Euclidean distance between each pair of CT/PET landmark point clouds without and with realignment between the two modalities with the iterative closest point (ICP) algorithm22 (N = 10 iterations). Realignment is necessary to ensure that the correct CT is not matched to a PET solely due to the similar patient position in the scans but instead, because the facial landmarks closely resemble one another. Results without realignment provide an upper bound to evaluate the retrieval performance. They are indicated here to serve as the “best case scenario” where only corresponding PET and CT pairs are in the same reference frame, due to the position of the patient during signal acquisition. To perform the realignment, we focused on feature-based realignment using the facial landmarks and ICP as opposed to intensity-based alignment algorithms. This choice was motivated by the applicability of our alignment to PET images with photographic face images captured in a different setting. Real pictures displaying radically different intensity profiles to images derived from the PET modality will be much less prone to realignment into a common reference frame using intensity-based methods. Using facial landmarks as the features for our alignment further ensures that our method generalizes to photographic images being compared to our PET facial reconstructions as we are sure that the features are directly linked to facial features. Relying on descriptors given by a method such as SIFT23 would give us unreliable key points potentially unrelated to facial features. We considered the top-n accuracies in the comparison task (n in [1, 15]), i.e., the accuracy of choosing the corresponding CT with the lowest distance for each PET image or if the correct CT is within the n CT scans with the lowest distance. We further considered these top-n accuracies where we mixed CT target landmarks with landmarks obtained from a larger face picture database of 10 168 adult face images while performing realignment24. The choice of Euclidean distance on facial landmarks as our comparison metric also ensures that the proposed pipeline generalizes to comparisons with photographic face images with different intensity profiles.

Hot Topics

Related Articles