Machine learning algorithm for predicting seizure control after temporal lobe resection using peri-ictal electroencephalography

Study populationRetrospective data were captured from patients who had undergone temporal lobe resection for drug resistant temporal lobe epilepsy at the Cleveland Clinic (Cleveland, OH) from 2011 to 2021. To be included in the study, patients had to have undergone a preoperative scalp EEG evaluation during an inpatient stay in the Cleveland Clinic Epilepsy Monitoring Unit (EMU) with a recorded seizure during that time. Post-surgical seizure outcome at the last follow-up was used as the basis of outcome classification. Patients who were seizure free postoperatively (equivalent to Engel Classification Class IA-D or International League Against Epilepsy ILAE Classes 1 and 2) were considered ‘surgical success’ cases while all others (Engel II-IV or ILAE 3–6) were considered ‘surgical failure’ cases33. To ascertain differences in the baseline characteristics of patients in the surgical success and surgical failure groups, we applied inferential tests; two-sided t tests were applied for comparison of means, Fisher Exact tests were applied for comparison of proportions when only two categories were present, Chi square tests were applied when more than 2 categories were present.The study was conducted under an outcomes registry protocol of the Cleveland Clinic Foundation Institutional Review Board and all methods were reviewed and approved by the Cleveland Clinic Foundation Institutional Review Board (IRB reference number 16-1539). The informed consent requirement was waived, which was approved by the Cleveland Clinic Foundation Institutional Review Board.EEG data acquisitionEEG studies at the Cleveland Clinic EMU were recorded using the Nihon Kohden JE-921A, JE-120A, and JE-208A headboxes (Nihon Kohden Corporation) using an extended 10–20 electrode placement scheme at a sampling rate of 200 Hz. The standard electrode placement at our center consists of 23 electrodes: Fz, Cz, Pz, Fp1, Fp3, C3, P3, O1, F7, T7, TP9, FT9, P7, Fp2, F4, C4, P4, O2, F8, T8, P8, FT10, TP10 (Supplementary Fig. 1). For certain analyses, we separately considered ‘temporal’ and ‘extra-temporal’ electrodes. For those analyses, the ‘temporal’ electrodes were TP10, FT10, P8, T9, P7, T8, F8, TP9, T7, F7. The standard reference electrodes for our headboxes are C3 and C4. No re-referencing was performed.During inpatient EMU stays, it is typical for patients to have multiple seizures over several days of observation so a standardized strategy is required in order to decide which seizure should be selected for each patient. It is theoretically possible that seizures captured later in the course of a patient’s inpatient monitoring stay may have more or less value to an outcome prediction model compared to those captured earlier in the stay. Additionally, some patients have multiple seizure types as classified by scalp EEG. In a prior analysis of 386 temporal lobe epilepsy patients from multiple surgical centers (including Cleveland Clinic)15, we found that while the vast majority (72%) had a single type of ictal EEG pattern on preoperative evaluation, a minority had multiple ictal patterns (the presence of multiple ictal patterns was not an independent predictor of postoperative surgical outcome). We employed a randomization strategy in order to address these sources of data variability at the individual patient level. For each patient across the cohort, we selected a single seizure at random and captured the EEG file in European Data Format (EDF) for subsequent analysis. As a post-hoc analysis, we reviewed 50 seizures from each outcome group and found that randomization had been efficacious (i.e. no statistically significant difference between proportion of different seizure types between outcome groups, see Supplementary Table 1).At our center, EEG files are annotated with ‘on’ and ‘off’ labels around the time of a seizure as part of the standard clinical workflow of patients undergoing epilepsy surgery evaluation. The ‘on’ label marks the time of seizure onset (i.e. time at which a trained EEG technologist can see the beginning of an organizing seizure activity on an EEG) while the ‘off’ label marks the post-ictal time point where no more organized seizure activity is ascertainable. The use of technician annotated data is reasonable as it has previously been shown in large prospective studies that interrater agreement for detecting seizures between trained EEG technologists and clinical neurophysiologists is almost perfect (i.e. > 95% in the context of epilepsy patients)34. We captured 2 min of data before the ‘on’ label (i.e. 2 min of pre-ictal data) and 3 min of data after the ‘off’ label (post-ictal data).EEG preprocessingPre-processing of EEG data including removal of additional electrode channels and harmonization of electrode labels was performed using the MNE library in Python. Subsequent pre-processing (including automated artifact annotation, generation of power spectra, feature extraction) was performed in MATLAB (MathWorks).Artifact detection and annotation was performed using an automated pipeline which we have previously reported35. Briefly, we use a trained and validated SVM classifier which samples the raw EEG time series data and annotates each second as either artifact free or artifactual. Subsequent analyses can then be performed exclusively on artifact-free data. When applied to the preictal data (2 min of raw data per patient), the artifact detector identified on average 77 s of artifact free data per patient in the surgical success group (95% CI 73–80) and 74 s in the surgical failure group (95% CI 71–79). When applied to the postictal data (3 min of raw data per patient), the artifact detector identified on average 88 s of artifact free data per patient for the surgical success group (95% CI 81–94) and 82 s for the surgical failure group (95% CI 76–87). A highpass filter (1 Hz) and notch filer (60 Hz) were applied (filtfilt function in MATLAB).EEG features for machine learningWe utilized artifact free EEG data to generate power spectral density information for each patient from both the preictal and postictal epochs. Specifically, a periodogram was gathered from artifact-free segments for each channel and these were then averaged to produce the power spectral density across all channels in each frequency bin for a given patient using the first 40 frequency bins (1-40 Hz) for model building. PSD was generated using the periodogram function in MATLAB. This resulted in a [23 channel × 40 frequency bin] matrix containing preictal data, and another [23 channel × 40 frequency bin] matrix containing postictal data. A Z-score normalization was applied on a per electrode and per frequency basis for machine learning applications.Clinical featuresWe included the nine clinical features from our previously published nomograms. These are: preoperative monthly seizure frequency, occurrence of generalized convulsion at any time before surgery (yes/no), cause of seizures (mesial temporal sclerosis/malformation of cortical development/stroke/tumor/other), years of epilepsy duration at time of surgery, gender (male/female), MRI findings (normal/abnormal), EEG seizure localization (always localizable/ sometimes not localizable), interictal epileptiform discharges (> 80% unilateral/ bilateral/ no interictal epileptiform discharges). In addition, we also included side of surgery (left/right), age at time of surgery in years, and follow-up period (at which seizure outcome was assigned) in years. In total, 12 clinical features were included.Final data structure for ML applicationsThe preictal and postictal EEG data (which were in the form of two separate [23 × 40] arrays) were flattened and then horizontally concatenated into a single one-dimensional array [1 × 1840]. The 12 clinical variables were then horizontally concatenated to this array to make an “EEG plus clinical variables” array for each patient [1 × 1852]. The final data matrix for all 294 patients was thus created [294 patients × 1852 features].Machine learningML was conducted in a Python environment. Consistent with recent trends in ML applications, we implemented an AutoML workflow to model building and used the Oracle Data Science platform36,37. Briefly, the AutoML utility is an automated method that fits multiple candidate classifier models to a dataset and uses a grid-search strategy to select an optimal set of hyperparameters and feature subsets based on a cross-validation strategy. The result of the AutoML implementation is a model that can be explored, validated, and optimized by the investigator based on the use case.We implemented a two-fold approach to model validation. In the AutoML pipeline, we implemented a stratified k-fold cross validation (k = 4) strategy. In the context of this cross-validation strategy, accuracy was calculated as the quotient of out-of-fold predicted labels that matched the true label divided by the total number of samples. We also calculated area under the receiver operating curve (AUC-ROC), precision and recall. Secondly, we implemented an ‘out of group’ testing strategy wherein we built the model using a stratified training set containing 75% of the total dataset, and retained 25% of the dataset as an “out of group” testing set. This strategy is commonly used as a means of detecting overfitting of machine learning models; significant discrepancies between cross-validation based estimated accuracy and out-of-group testing based accuracy raise concern for an overfitted model. Train/test splitting was automated using the train_test_split functionality of the sklearn package. All of optimized machine learning classifiers reported in the results section can be replicated using publicly available Python packages. The necessary packages and optimized hyperparameters are detailed in Supplementary Table 2.Decision curve analysis (DCA)We implemented DCA as described by Vickers et al.38. DCA is an innovative approach supplanting statistical model parameters (e.g., AUC) with individual preference and procedure outcomes to quantify the clinical usefulness of a model. This is accomplished by calculating a clinical “net benefit” (NB) for one or more prediction models in comparison to default strategies of treating all or no patients, or treatment based on other tests:$$Net Benefit \left(\text{pt}\right)=\frac{TP}{N}-\left(\frac{FP}{N}*\frac{pt}{1-pt}\right)$$where TP = true positives, FP = false positive, N = total number of patients, pt = probability threshold.In effect, the Net Benefit is a single number that incorporates true positive rate while penalizing for the harms of false positives. The probability threshold is the probability at which a clinical decision maker would be indifferent between two possible actions. In DCA, the Net Benefit provided by a clinical prediction model is plotted over a range of clinically relevant threshold probabilities; curves that lie higher on the plot represent more clinically useful prediction strategies.To plot a DCA, the range of pt should be clinically defined a priori as the range within which a guidance on risk could be helpful: below the minimum pt, a temporal lobe resection is typically not recommended (success chance is too low); above the max pt, temporal lobe resection is usually offered; in-between is the gray area where a model could inform the decision. If a model has the highest net benefit across the entire pre-defined range of pts, it should be used when clinically feasible.DCA also allows for estimation of the reduction in unnecessary interventions (i.e. likely unsuccessful surgeries that could have been avoided by using a given prediction model):$$Reduction in likely unsuccessful (i.e. avoidable) surgeries=\frac{NB of prediction model-NB of {\prime}treat al{l}{\prime} strategy}{\frac{pt}{1-pt}}$$DCA was implemented in Python using the dcurves 1.1.0 package.

Hot Topics

Related Articles