Data source and study populationThis study utilized the BraTS-MEN dataset, a large multi-site, heterogeneous, multi-parametric MRI (mpMRI) meningioma dataset17. The dataset was contributed by academic medical centers across the United States. The BraTS-MEN dataset was originally created as part of a segmentation challenge aimed at developing and evaluating automated algorithms for delineating meningioma tumor boundaries on MRI scans. Cases were identified based on histopathologic assessment following resection or biopsy, or based on a clinical and radiographic diagnosis of meningioma.For this study, we used the released training split of the BraTS-MEN dataset, as the segmentation masks for the validation split were not publicly available at the time this study was conducted, and developing an automatic segmentation model for meningiomas was beyond the scope of the current study. The initial dataset contained 1000 scans. For patients with multiple MRI scans at different time points, we used only the first available scan, reducing the dataset to 933 scans of 933 patients. We further excluded patients without grading information, resulting in a final dataset of 698 patients [524 (75.1%) with grade 1, 156 (22.3%) with grade 2, and 18 (2.6%) with grade 3 meningiomas].Each case included pre-operative mpMRI consisting of pre-contrast T1-weighted (T1W), post-contrast T1-weighted (T1C), T2-weighted (T2W), and T2-weighted Fluid Attenuated Inversion Recovery (FLAIR) sequences. The dataset provided pre-processed images, including co-registered and skull-stripped volumes, as well as segmentation masks for three distinct tumor sub-regions: enhancing tumor, non-enhancing tumor core, and surrounding non-enhancing T2/FLAIR hyperintensity (SNFH). The details of the segmentation process are provided elsewhere17.Outcome of interest and study designOur primary objective was to develop a non-invasive method for predicting meningioma grade using only pre-operative mpMRI scans. We approached this as a binary classification problem, aiming to distinguish between low-grade (grade 1) and high-grade (grade 2–3) meningiomas based solely on radiomic features extracted from the pre-operative mpMRI data. This approach could potentially aid in pre-operative planning and patient management by providing an early indication of tumor grade without requiring invasive procedures.Data splittingWe split the data into training (60%), validation (20%), and test (20%) sets. The training set was used for dimensionality reduction and model training. The validation set was used for hyperparameter optimization and model calibration. The test set remained completely independent throughout the entire model development process and was used solely for final model evaluation to obtain unbiased estimates of model performance.Feature extraction and dimensionality reductionFor feature extraction, we used a single binary segmentation mask obtained by merging the non-enhancing tumor core and enhancing tumor regions, representing the whole tumor without the SNFH. We extracted radiomic features from each patient’s mpMRI sequences (T1W, T1C, T2W, and FLAIR) using the PyRadiomics package18. A total of 4872 radiomic features were extracted for each patient.The extraction process yielded eleven distinct categories of features, including original and wavelet-transformed features. The wavelet transformations included low-low-low (LLL), high-high-high (HHH), and various combinations (LLH, LHH, LHL, HLL, HHL, HLH), where ‘L’ denotes a low-pass filter and ‘H’ represents a high-pass filter. Each feature category encompassed five separate feature matrices: first-order statistics (18 features), Gray Level Co-occurrence Matrix (GLCM, 24 features), Gray Level Size Zone Matrix (GLSZM, 16 features), Gray Level Run Length Matrix (GLRLM, 16 features), and Gray Level Dependence Matrix (GLDM, 14 features). An additional shape matrix (14 features) was included within the original feature category. These features capture a diverse range of information from the image data, including intensity distributions, textures, shapes, and wavelet transforms, enhancing the comprehensiveness of the data analyzed by our ML models.To reduce the dimensionality of the feature space and mitigate the risk of overfitting, we employed LASSO (Least Absolute Shrinkage and Selection Operator) regression. The LASSO model uses the tuning parameter ‘alpha’ parameter for feature selection. As alpha decreases, some of the model’s coefficients are reduced to zero. We set alpha to 0.001, which resulted in most coefficients being reduced to zero, leaving only the most predictive features. This technique reduced the number of features to 176, which were then sorted in descending order of importance for subsequent use in model development. This feature selection process was performed using only the training set. The validation set was used for hyperparameter optimization and model calibration, while the test set remained completely independent throughout the entire process, reserved solely for final model evaluation.All analyses in this study were performed using Google Colab with a CPU runtime. This cloud-based platform was sufficient for running our machine learning models, including feature extraction, dimensionality reduction, and model training processes. The use of traditional machine learning algorithms, as opposed to more computationally intensive deep learning models, allowed us to complete our analyses without the need for specialized hardware or GPU acceleration. This approach demonstrates the computational efficiency of our methodology.Model development and evaluationOur study employed five different supervised ML algorithms to construct predictive models: TabPFN, XGBoost, LightGBM, CatBoost, and Random Forest. Each algorithm was used to build a separate model for meningioma grade prediction. To address the class imbalance in the training set, we applied the Synthetic Minority Over-sampling Technique (SMOTE). SMOTE creates synthetic examples of the minority class (in this case, high-grade meningiomas) by interpolating between existing minority class instances, thereby balancing the dataset19.Feature scaling was performed using min-max scaling, which scales all numeric variables to a fixed range between 0 and 1. This scaling helps to standardize the range of independent variables and can improve the performance and training stability of many ML algorithms. The scaler was fit on the training data and then applied to the validation and test sets.We used Optuna, a hyperparameter optimization framework20, to tune the hyperparameters for each model, with the goal of maximizing the area under the receiver operating characteristic curve (AUROC) as the optimization metric. The optimization process involved 100 trials for each model, determining both the optimal model-specific parameters and the number of features to be used.During this optimization process, feature selection was performed dynamically. For each trial, the number of top features to use was treated as a hyperparameter, allowing the model to select an optimal subset from the previously sorted list of features. This approach enabled each model to determine its ideal balance between including informative features and avoiding overfitting.For each model, we performed hyperparameter optimization using Optuna, selected features based on the optimal number determined during optimization, fit the model on the training data, and calibrated the model using sigmoid calibration21. Following model development, we evaluated the final performance exclusively on the held-out test set, which had not been used in any way during model development or tuning.We evaluated model performance using several metrics, with the AUROC as the primary metric. Additional metrics included precision, recall, F1 score, accuracy, Matthews Correlation Coefficient (MCC), the area under the precision-recall curve (AUPRC), and Brier score. We calculated 95% confidence intervals (CIs) for these metrics using bootstrap resampling with 1000 iterations. For binary classification metrics (precision, recall, F1, accuracy, MCC), we utilized the Youden index to determine the optimal classification threshold22,23. We identified this threshold as the point on the receiver operating characteristic (ROC) curve corresponding to the maximum value of the Youden Index (J = sensitivity + specificity – 1), which is a common metric in diagnostic or prognostic test evaluation.To aid in model interpretation and visualization, we generated ROC curves, precision-recall curves (PRCs), calibration curves, and confusion matrices. We also computed SHAP (SHapley Additive exPlanations) values to provide insights into feature importance and model decision-making processes24.