Generalizable and automated classification of TNM stage from pathology reports with external validation

The research performed complies with all relevant ethical regulations; the institutional and IRB that approved the study protocol are Columbia IRB number AAAL0601. IRB waived informed consent for the study due to its retrospective and anonymized nature, minimal risk and lack of patient contact.TCGA pathology report dataset construction with TNM annotationPathology reports and associated TNM clinical metadata were downloaded from the TCGA Genomic Data Commons (GDC) data portal from https://portal.gdc.cancer.gov. Reports were initially stored in PDF format; in previous work, we converted the TCGA pathology report corpus to machine-readable plain text using OCR, performed extensive curation, and fully characterized the final TCGA report set. The final dataset spanned 9,523 reports, with 1:1 patient:report ratio14.TNM staging annotation was contained within the clinical metadata provided by TCGA. The TNM staging attribute used in this study is pathological stage, i.e., stage based on pathologist assessment of patient tumor slide(s) combined with previous clinical results. This value is distinct from clinical stage as provided by TCGA; we chose pathological rather than clinical staging for ground truth as (1) it is considered the diagnostic gold standard during the course of patient care and (2) information concerning pathologic staging is contained within report text. Staging was determined in a systematic manner by TCGA across all patients17. All data used for ground truth labeling was derived from the TCGA metadata as provided by the TCGA data portal.TNM values were abstracted to numerical values, without additional letter suffixes—For example, N1B was converted to N1. Data availability, or TNM coverage, varied. A given report may have had no associated TNM data, full associated TNM data, or some combination of associated TNM values. Due to the difference in coverage, we separated the data by TNM data availability for individual classification tasks. Each target dataset consisted of non-uniform target value distributions, as displayed in Fig. 1C, to varying degrees.Finally, TCGA annotation of M01 was found to be inconsistent. We examined a random sample of 10 pathology reports, with 5 reports annotated as M0 and 5 reports annotated as M1 in the TCGA metadata. We found that 5/5 reports annotated as M0 were labeled consistently with the AJCC definition of M0. However, we found that 2/5 reports annotated as M1 were not labeled consistently with the AJCC definition of M1 (distant metastasis), but rather contained characteristics similar to the reports labeled M0. From this, we observe that the ground truth labels for the M01 target may not be uniformly accurate, as they were found to be at times inconsistent with the AJCC definitions of distant metastasis and inconsistently applied among reports.Comparison of clinically pre-trained BERT-based modelsFor each target, we performed fine-tuning experiments using two model-types, CB18 and CBB16. Both models had been pre-trained on a set of clinical notes (MIMIC III20). CB has consistently performed at a high level across a variety of clinical natural language processing tasks24,25,26. Model CB contains 108.3 M parameters and is based on the classic BERT architecture15. CB is, however, very limited by a maximum input document length of 512 tokens. As a result, reports longer than 512 tokens are truncated during training, and text beyond 512 tokens is not used for model learning. In addition, when applying the model to an external dataset, reports must again be truncated to 512 tokens, so that any information contained within text beyond 512 tokens is not applied toward model prediction. As many real-world reports are longer than 512 tokens, this is a serious limitation.A more recent model, CBB, has 128.1 M parameters and adopts the computationally-optimized BigBird architecture27. Bigbird is based on the BERT architecture but differs in the specification of the attention mechanism. Briefly, a sparse attention mechanism allows for longer inputs to be computationally tractable, providing linear run-time with number of input tokens (compared to the quadratic run-time of BERT) and better performance on benchmark tasks27. As a result, model CBB has a vastly increased document length capacity (4096 tokens), which allows the use of entire-length reports in both training and application. For example, in the TCGA pathology report dataset, over 66% of reports in the TCGA dataset contain >512 tokens (Table S1), while 12.9% have report length >2048 tokens, and only 0.7% have reports >4096 tokens.Multi-class classification tasks utilizing the TCGA pathology report datasetWe separated reports into reports with M01 annotation, reports with N03 annotation, and reports with T14 annotation. M01 annotation had the least coverage in the TCGA dataset overall. Each report set was divided into training (70%), validation (15%), and held-out test (15%) sets. As each patient corresponded to a single report, no patient spanned more than one train/validation/test (TVT) subset. In addition, when separating the reports into TVT subsets, we balanced on TNM value composition so that the same balance of values was consistent across TVT subsets. This allowed for fair comparison of performance across TVT subsets, with no TVT subset having a greater imbalance than the dataset overall.Independent models were trained and hyperparameter-optimized for each of M01, N03, and T14 classification targets separately, as specified below. We evaluated model performance based on macro AU-ROC and per-class AU-ROC (in a one-vs-all capacity). Each target was evaluated separately.Hyperparameter optimization, model fine-tuning, and model selectionClinicalBERT and Clinical-BigBirdFor hyperparameter optimization, we performed an iterative grid search across two learning rates, three batch sizes, and three random seeds (used for train/validation split). Due to memory limitations, the maximum number of input tokens per document that we were able to implement was 2048 input tokens. We used 512 input tokens for CB (the maximum allowed by the CB model), but for CBB we experimented with 512, 1024, and 2048 (the maximum allowed by our hardware). We fine-tuned each model for 30 epochs. Run-time of CBB experiments was substantially longer than that of CB experiments, with 2048 input token CBB (CBB-2048) instantiations taking almost 24 h of training run-time per parameter combination. We evaluated model performance depending on TCGA validation set AU-ROC, selecting the best final model per target based on this metric. We found that CBB-2048 was the best model type for T14 and N03 targets, whereas CBB-1024 was the best for the M01 target (Table S2). The final TNM models are made publicly available through Huggingface (https://huggingface.co), which is a widely used Python library for publishing and downloading LLMs.Llama 3Llama 3, developed by Meta AI, is a large-scale language model with 8 billion parameters, designed to capture a wide range of general knowledge and demonstrate state-of-the-art performance on various natural language understanding benchmarks. To adapt Llama 3 for our specific clinical classification tasks, we employed the Low-Rank Adaptors (LoRa) methodology, which allows for efficient fine-tuning of large pre-trained models. LoRa introduces low-rank matrices to model’s attention and feed-forward layers, enabling us to update only a small subset of the model’s parameters while keeping the rest frozen. This approach significantly reduces the computational resources required for fine-tuning and allows for rapid adaptation to new tasks. For the fine-tuning process, we initialized Llama 3 with its pre-trained weights and introduced LoRa adaptors with rank: r = 16 and scaling factor alpha = 16. We fine-tuned the model on the TCGA Pathology Report Dataset, targeting the classification layers for the M01, N03, and T14 staging annotations. The fine-tuning was conducted over 3 epochs with a batch size of 16 and a learning rate of 3e-4. We tested the fine-tuned model as well as the base model.Evaluation of training timeIn order to compare the different models’ training time, we set an experiment with the same conditions for the three models. We used one instance of NVIDIA A100 GPU. The specifications for this model include 80 GB of memory and 2 TB/s of memory bandwidth. The results can be seen in Table S6, showing the direct correlation between parameters and training time.Characterization of CUIMC pathology report datasetWe retrieved all reports from the CUIMC pathology report database, between 2010–2019. We removed empty reports and outside consultation reports. We selected for reports with the surgical pathology label, as this label indicated histopathology slide analysis in contrast to other report types generated by the pathology department. Report text remained intact, not pre-processed. TNM stage annotation data were located in a separate metadata table, derived from the tumor registry. We selected for patients with non-empty TNM values.We employed three attributes to match report text to patient TNM annotation: patient ID, report date (matched to TNM diagnosis date), and TNM-primary site (Fig. S4). Patient ID was matched exactly across the two databases. For date-matching, we allowed up to 90 days between report date and diagnosis date, as there is a lead-time/delay between pathologist documentation and official tumor registry stage extraction. We observed that the number of reports overall, as well as the number of reports per patient, increased as the time-window was expanded from 0 to 90 days. Additionally, we observed that a single patient may have multiple pathology reports potentially associated with a given TNM annotation, within the same time-window. We therefore imposed an additional matching requirement to ensure report-annotation relevancy, selecting the most relevant report as that which has the greatest number of report string matches to the TNM-associated primary site value. At this stage, the vast majority of patients were associated with a single TNM-report match. However, in the event that multiple reports were equally relevant, we concatenated reports to ensure that all relevant TNM-information would be captured.In the final CUIMC dataset, most reports had associated T14 annotation, and the least number of reports had M annotation, similar to the TCGA dataset (Table S3A). We tabulated the class imbalance for each target (Table S3D). We found that T4 and N3 are the least-prevalent classes per target, as was the case for the TCGA report set (Fig. 1C). We also found that the proportion of M1 is higher in the CUIMC dataset (20.1%) as compared to the TCGA dataset (6.7%). The range of diseases is larger for the CUIMC reports as compared to TCGA reports: The TCGA dataset ranged from 21–23 cancer types, whereas the CUIMC dataset spans 40–42 primary sites (although these terms are not directly comparable). We plotted the primary site distribution for each target report set (Fig. S4), finding that the distributions are similar across the three targets. As in the TCGA dataset, breast and lung are two of the most prevalent cancer sites, across all three targets. Finally, using the CBB tokenizer, we computed token statistics for each target dataset (Table S3C). Overall, we found that CUIMC pathology reports were longer than TCGA pathology reports, both on average and at maximum report length.Additionally, we explored the demographics of our dataset and included them Table S3B.Application of TCGA-trained models to CUIMC datasetTNM models were applied directly to the entire CUIMC report set (without any additional fine-tuning). As before, we calculated AU-ROC to evaluate model performance. We found that, as for TCGA validation and held-out test sets, M01 was the least well-performing model (as compared to the T14 and N03 models).We compared the CUIMC performance of our TNM models to those of Abedian et al. 5,10, which was the most comparable to ours in terms of the use of pathology report text as sole input, the predicted TNM target value ranges (T14, N03, and M01), and the inclusion of multiple cancer types in both train and test sets. Abedian et al. reported F1, rather than AU-ROC. We computed F1 for our models and compared our results to the pan-cancer test set results in ref. 5 (Table S3E). We found that our T14 model performed on-par with5, our N03 model performed somewhat better, and our M01 model performed substantially better than the equivalent model in ref. 5.We performed three additional experiments to probe our external validation results. First, although we found that the CBB model-type achieved the best performance on the TCGA report set, we were interested in whether this result would hold for CUIMC reports. To test this, we applied the best-performing TCGA-trained CB model to the CUIMC report set to predict the N03 target. There was a large difference in performance across all evaluation metrics, including overall macro and per-class AU-ROC between CB and CBB (Table S4). CBB likely performed better than CB due to its increased complexity as well as its increased input token size (Table S3C).Second, we tested whether our primary parameter for report-diagnosis matching, number of days between diagnosis and report date, had any impact on CUIMC performance. We ran the TCGA-trained models on CUIMC data for each target separately for 0, 10, and 30 days; we compared the results to the performance we achieved with 90-day report-matching (Fig. S5). In this sensitivity analysis, we found that AU-ROC remained stable as the number of days was varied. For the multi-class targets, T14 and N03, we plotted per-class changes over time, finding that there is a slight increase in per-class AU-ROC as the number of days increases. The magnitude of AU-ROC increase across number of days varies by class. The least-prevalent classes (e.g., T4 and N3) have the largest gain in AU-ROC as number of days increases; this is likely due to the increased likelihood of report relevance as number of days increases.Finally, we tested the removal of PHI, such as medical record number, date of birth, etc., from the preamble of each report for the T14 target. In the CUIMC dataset, most of the patient-identifying text was located in the first few lines of each report (whereas diagnosis information was not typically contained in this preamble section). Our hypothesis was that the model may perform better without extraneous patient details, particularly as these types of details had not been seen by the model when trained on the de-identified TCGA report set. However, we observed only a 0.0001 AU-ROC gain when PHI was removed (Table S5). We determined that PHI removal was not necessary for external validation, as increased pre-processing effort would potentially lead to only a very small performance gain.Software requirementsFor the training and testing of our model, we utilized the following Python (version 3.12) packages: numpy (version 1.26.4) for numerical computations, pandas (version 2.2.2) for data manipulation and analysis, scikit-learn (version 1.4.2) for machine learning algorithms, scipy (version 1.13.0) for scientific computing, seaborn (version 0.11.2) for data visualization, transformers (version 4.40.2) for natural language processing, and torch (version 2.3.0) for deep learning. Specifically, for the llama3 model, we employed accelerate (version 0.30.0) for optimizing training speed, bitsandbytes (version 0.43.1) for efficient computation, evaluate (version 0.4.2) for performance assessment, huggingface-hub (version 0.23.0) for model sharing, and peft (version 0.10.0) for parameter-efficient fine-tuning.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles