Interpretable surface-based detection of focal cortical dysplasias: a Multi-centre Epilepsy Lesion Detection study

Abstract One outstanding challenge for machine learning in diagnostic biomedical imaging is algorithm interpretability. A key application is the identification of subtle epileptogenic focal cortical dysplasias (FCDs) from structural MRI. FCDs are difficult to visualize on structural MRI but are often amenable to surgical resection. We aimed to develop an open-source, interpretable, surface-based machine-learning algorithm to automatically identify FCDs on heterogeneous structural MRI data from epilepsy surgery centres worldwide. The Multi-centre Epilepsy Lesion Detection (MELD) Project collated and harmonized a retrospective MRI cohort of 1015 participants, 618 patients with focal FCD-related epilepsy and 397 controls, from 22 epilepsy centres worldwide. We created a neural network for FCD detection based on 33 surface-based features. The network was trained and cross-validated on 50% of the total cohort and tested on the remaining 50% as well as on 2 independent test sites. Multidimensional feature analysis and integrated gradient saliencies were used to interrogate network performance. Our pipeline outputs individual patient reports, which identify the location of predicted lesions, alongside their imaging features and relative saliency to the classifier. On a restricted ‘gold-standard’ subcohort of seizure-free patients with FCD type IIB who had T1 and fluid-attenuated inversion recovery MRI data, the MELD FCD surface-based algorithm had a sensitivity of 85%. Across the entire withheld test cohort the sensitivity was 59% and specificity was 54%. After including a border zone around lesions, to account for uncertainty around the borders of manually delineated lesion masks, the sensitivity was 67%. This multicentre, multinational study with open access protocols and code has developed a robust and interpretable machine-learning algorithm for automated detection of focal cortical dysplasias, giving physicians greater confidence in the identification of subtle MRI lesions in individuals with epilepsy.


Introduction
The application of machine learning algorithms for diagnostics in biomedical imaging forms a spectrum from automating highthroughput imaging analysis to assisting diagnosis in rarer, clinically challenging pathologies. One barrier to clinical translation is the limited interpretability of these algorithms, leading to a common perception of them as impenetrable 'black boxes'. Identifying focal epileptogenic abnormalities on MRI is an outstanding clinical challenge in patients undergoing presurgical evaluation for drug-resistant focal epilepsy (DRFE). In DRFE, 16-43% of individuals are 'MRI-negative', i.e. no relevant abnormality is visually identified on their MRI scans. [1][2][3] A leading cause of DRFE and the most common histopathology in operated 'MRI-negative' cohorts is a malformation of cortical development, called focal cortical dysplasia (FCD). 4 As post-surgical seizure freedom is affected by whether the FCD can be identified on preoperative structural MRI, 1,5 there has been considerable effort placed in improving the detection of these lesions. However, machine-learning approaches provide little insight into factors determining classification. In clinically ambiguous images, where the need for algorithms is greatest, such insight would enable physicians to determine whether features identified by the classifiers are likely to be lesional in origin.
Despite extensive retrospective work to improve FCD detection, few automated methods have been used prospectively in the presurgical evaluation of patients with epilepsy. Alongside lack of interpretability, there are many additional reasons for this. Initially, many of the frameworks were developed at single epilepsy centres, resulting in small sample sizes and homogeneous datasets, where all patients have been scanned on the same MRI scanner with the same protocol, which reduces the likelihood of robustness of the results and the ability of the method to generalize. Many of these frameworks are not openly available and therefore difficult to reproduce. Although there has been some important research replicating previous methods, 15,18,19 there was a need to develop and validate automated FCD detection tools on multicentre data. Recently, the field has progressed with two large multicentre studies, 11,12 which successfully trained neural networks on voxel-based MRI data from 13 and 11 MRI scanners, respectively, to detect FCDs. However, neither of these studies included any patients with FCD type I lesions, which are particularly difficult to diagnose and represent some of the complex, challenging patients who present to epilepsy surgery centres.
Here, as part of the Multi-centre Epilepsy Lesion Detection (MELD) Project, 20 we aimed to collate a heterogeneous cohort of patients from multiple epilepsy surgery centres, across multiple MRI scanners including both 1.5 and 3 T field strengths; create protocols for decentralized MRI post-processing; and develop an openaccess, robust and interpretable surface-based classifier to detect FCD.

MELD project consortium
The MELD project (https://meldproject.github.io/) involves 22 research centres across 5 continents. Each centre received approval from their local institutional review board (IRB) or ethics committee (EC). IRB/EC waived the need for individual patient consent as this was a retrospective study using fully anonymized, routinely available data only.

Participants
Patients were included if they were over age 3, had a 3D preoperative T 1 -weighted MRI brain scan (1.5 or 3 T) and a radiological diagnosis of FCD or were MRI-negative with histopathological confirmation of FCD. Participants were excluded if they had previous surgery, large structural abnormalities in addition to the FCD or T 1 scans with gadolinium enhancement. Controls were included if they were over age 3, did not have epilepsy or another neurological condition and had a T 1 -weighted MRI brain scan (1.5 or 3 T). Patients scanned for headache could be included as controls if they had no other neurological conditions and the MRI was normal. The patients and controls included were a retrospective convenience sample. Centres, patients and controls were given pseudo-anonymized ID codes. Fig. 1 is an overview of the MELD FCD processing pipeline, which is explained in more detail in the sections below.

Site-level data collection and post-processing
Each site followed the protocols for site-level data collection and post-processing that are available at https://www.protocols.io/ researchers/meld-project and detailed in the following sections 'Participant demographics', 'MRI data collection and cortical surface reconstruction', 'FCD lesion masking' and 'Morphological/intensity features'. Structural MRI post-processing protocols were adapted from openly available ENIGMA-epilepsy protocols. 21

Participant demographics
The following data were collected for all patients: age at preoperative scan, sex, age of epilepsy onset, duration of epilepsy (time from age of epilepsy onset to age at preoperative scan), ever reported MRI-negative and histopathological diagnosis (ILAE three-tiered classification system), 22 seizure freedom (Engel class I or other) and follow-up time in operated patients.

MRI data collection and cortical surface reconstruction
3D T 1 -weighted and FLAIR (where available) MRI scans were collected at the 22 participating centres for all participants. We included MRI data acquired on Siemens, GE and Philips MRI scanners at either 1.5 or 3 T field strengths. Cortical surfaces were reconstructed using FreeSurfer. 23 Sites could process their data using either Linux or Mac operating systems and use either FreeSurfer v5.3 or v6.

FCD lesion masking
FCD lesions were delineated on the T 1 -weighted MRI scans at each site according to our lesion masking protocol. 24 For patients with a radiological diagnosis of FCD, a volumetric lesion mask was created using the preoperative T 1 scan and 3D FLAIR (where available). For MRI-negative patients but with histopathological confirmation of FCD, the postoperative scan was used to identify the location of the FCD on the preoperative T 1 or FLAIR. A volumetric lesion mask was then created on the preoperative MRI data. In both cases, masks were created by a neuroradiologist, neurologist or experienced epilepsy researcher at each site. Volumetric lesion masks were mapped to cortical reconstructions and small defects were filled in using five iterations of a dilation-erosion algorithm. Patients' lesions were registered to fsaverage_sym.
Interrater reliability in lesion masking was assessed by three expert neuroradiologists independently masking on 10 randomly chosen FCD lesions from one site.

Morphological/intensity features
The following measures were calculated in native space per vertex across the cortical surface in all participants: (i) cortical thickness; (ii) grey-white contrast; (iii) mean curvature; (iv) sulcal depth; and (v) intrinsic curvature. Thickness was calculated as the mean minimum distance (in millimetres) between each vertex on the pial and white matter surfaces. 25 Grey-white contrast was calculated as the ratio of the T 1 grey matter signal intensity (at 30% of the cortical thickness) to the white matter signal intensity (1 mm below the grey-white matter boundary). 26 Mean curvature was calculated at the grey-white matter boundary as 1/r, where r is equal to the mean of the principal curvatures k1 and k2. 27 The dot product of the movement vector of the cortical surface during inflation is used to calculate the sulcal depth. Intrinsic curvature was calculated as the dot product of the principal curvatures k1 and k2. 28 In participants with FLAIR data, FLAIR signal intensity was sampled at 25%, 50%, and 75% of the cortical thickness (GM FLAIR 25%, 50%, 75%), as well as at the grey-white matter boundary and 0.5 and 1 mm subcortically (WM FLAIR 0.5 mm, 1 mm).
To increase the stability of per-vertex measures, the following features were smoothed with a 5 mm Gaussian kernel: mean curvature and sulcal depth; and 10 mm Gaussian kernel: cortical thickness, grey-white contrast and FLAIR intensities at all cortical and subcortical depths. Intrinsic curvature was smoothed with a 20 mm Gaussian kernel to provide a measure of folding pattern abnormalities that is stable across adjacent gyri and sulci. All features were registered to bilaterally symmetrical template space, fsavera-ge_sym. Only anonymized participant demographic details and data matrices of anonymized features and lesion masks were shared with the MELD Project coordinators for multicentre analysis.

Centralized quality control and post-processing Quality control and data harmonization of surface-based data
Automated quality control was performed on the surface-based features to identify subjects with extreme structural and intensity values across multiple features and cortical areas, likely caused by imaging artefacts such as signal biases or FreeSurfer segmentation errors. A feature was considered an outlier if, in more than 10 non-lesional regions (from the Desikan-Killiany atlas), it was greater or less than 2.7 times the standard deviation from the mean of all participants' values. 21 Participants were considered outliers if they had multiple extreme features, two if features from T 1 -weighted scans only and three if FLAIR MRI scans available. Participants identified as outliers were excluded from all subsequent analyses. For further details see Supplementary Fig. 1.
Due to heterogeneity in MRI scanner hardware, scanner field strength, operating systems and FreeSurfer versions, which can all affect morphological and intensity feature values, 29 features were harmonized using ComBat 30 to control for non-biological variance while retaining biological covariates (age, sex and disease status; Supplementary Fig. 2). Independent test sites were harmonized to the main cohort ( Supplementary Fig. 2B). The harmonized data set features are henceforth referred to as 'ComBat' features.

Three-stage normalization of features
Surface-based MRI features underwent three normalization procedures to highlight feature abnormalities.
Step 1: To account for interindividual shifts in feature distributions, such as age and sex-related changes, features were normalized using intrasubject z-scoring. For example, the cortex is thicker in a 3-year-old than in a 60-year-old (Supplementary Fig.  2A). After intrasubject z-scoring, thickness metrics for both participants will all have a mean of 0 and a standard deviation of 1.
To account for interregional variability in features, two further normalization steps were carried out: interhemispheric asymmetry and per-vertex normalization by controls.
Step 2: Interhemispheric asymmetry maps of features were created by subtracting right hemisphere vertex values from left hemisphere values and vice versa. This procedure leverages the normal symmetry of cortical morphometric features and quantifies a key heuristic used to detect FCDs on radiological review, highlighting vertices that are significantly different from the contralateral side.
Step 3: The outputs from steps 1 and 2 were z-scored by the mean and standard deviation of features at each vertex from healthy controls to adjust for normal interregional variability. For example, the cortex in frontal regions is normally thicker than in the occipital cortex. By normalizing by the control values at each vertex, we can account for this normal variability to accentuate features that are abnormal for their position in the cortex.
The output of these normalization steps is a set of intrasubject and intersubject normalized features (henceforth 'normalized' features) and a set of intrasubject, asymmetry and intersubject normalized featured (henceforth 'asymmetry' features).

Characterization of focal cortical dysplasia features on MRI
Surface-based morphological features were calculated within the lesion masks of all patients. For controls, data were sampled from similarly sized regions for comparison. T 1 -derived features, available in all subjects, underwent Uniform Manifold Approximation and Projection (UMAP) embedding, 31 a non-linear dimensionality reduction where similar examples are plotted closer together. Lesions were clustered into groups according to their UMAP locations using a Gaussian mixture model.

Border zones
Lesion masks were drawn conservatively, to maximize the proportion of lesional vertices within the mask. There is inherent uncertainty in the precise borders of manually delineated lesion masks. Feature abnormalities extended approximately 40 mm beyond the lesion ( Supplementary Fig. 3). To account for this uncertainty, border zones were created around each lesion mask extending 20 and 40 mm across the cortical surface. Vertices between 0 and 40 mm from the lesion mask were excluded from training to reduce training on mislabelled data. Predicted lesion clusters within 20 mm of the lesion masks classified as detected for the sen-sitivity+ metric (see network evaluation section).

Network training, testing and interpretation Cohort splitting
An artificial neural network was trained on per-vertex postprocessed MRI features (ComBat, Asymmetry and Normalized), after border zones had been removed (33 total input features). The full cohort (excluding two independent test sites) of patients and controls were randomly assigned to either the train cohort (278 patients, 180 controls) or the test cohort (260 patients, 193 controls) ( Table 1). All experiments to determine the optimal data processing and network parameters were carried out through 10-fold cross-validation on the train cohort. The 10 folds were determined by a random partition of subjects in the train cohort. Hyperparameters were selected according to the aggregated performance metrics of each of the 10 cross-validation models on their respective validation set.

Network hyperparameters and training
The neural network architecture had two hidden layers (with 40 and 10 nodes, respectively) and one output node and used a dropout of 0.4 on the input layer for learning more robust representations. To adjust for the class imbalance between healthy and lesional examples, for each patient 2000 random lesional and non-lesional vertices were sampled per epoch. If a patient had less than 2000 lesional vertices, existing lesional vertices were randomly drawn multiple times. A focal loss 32 was used to concentrate network training on difficult examples. After training, the network predictions were thresholded using an optimal threshold determined based on the Dice (F1) score on the train cohort. For the full list of optimized parameters see Supplementary Table 1.
The following experiments were conducted to evaluate the impact of smoothing kernel size and feature normalization on classifier performance: (i) morphological and intensity features were smoothed with Gaussian kernels ranging from 3 to 25 mm and models were retrained using these smoothed features; and (ii) three models were retrained using (a) ComBat, (b) ComBat and normalized and (c) ComBat, normalized and asymmetry features. For these experiments, analyses were restricted to the train cohort. On each of 10 folds, a classifier was trained 10 times with random initializations and an ensemble of the 10 models was evaluated on the fold's validation cohort. Results were aggregated across the 10 folds.
For the final training and testing of the model after data and hyperparameter optimization, a classifier was trained five times with different random initializations on each of 10 training folds. The resulting 50 models were combined into one final ensemble model 33,34 by averaging the individual models' predictions. For every input, the final model will therefore run each of the 50 individual models and output the average lesional probability predicted by these models to increase predictive performance and stability. This final model was evaluated on the test cohort. To calculate individual performance statistics for subjects in the train cohort, a second ensemble network was trained in a similar manner on the test cohort and evaluated on the train cohort.

Evaluation metrics
Per-vertex lesion predictions for each individual were grouped into spatially connected clusters on the surface mesh. Clusters smaller than 100 vertices (approximately 0.5 cm 2 ) were filtered out as these are disproportionately false positives ( Supplementary Fig. 4). The following outcome measures were calculated: (i) sensitivity, defined as the proportion of patients where a predicted lesion cluster overlapped the manual lesion mask; (ii) sensitivity+, defined as the proportion of patients where a predicted lesion cluster overlapped the manual lesion mask or the border zone; (iii) specificity, defined as the proportion of controls with zero clusters; (iv) average number of clusters per patient; and (v) average number of clusters per control.

Network performance evaluation
Three complementary methods to understand and interrogate classifier performance and behaviour were used.
To determine how demographic and clinical factors influenced whether lesions were successfully detected by the classifier, two logistic regression models were used. The first included presurgically available variables: sex, scanner field strength, lesion hemisphere, FLAIR availability. The second included post-surgical variables (histopathological diagnosis and seizure freedom) and was applied on the cohort of patients who had undergone surgery. Statistical significance was determined through repeating regression analysis on randomly permuted cohorts (1000 permutations). Correction for multiple comparisons used the Benjamini-Hochberg procedure. 35 To understand classifier predictions, MRI features from predicted clusters were transformed into the UMAP embedded space described above.
To understand which specific features drove network predictions, integrated gradients saliency was computed. 36 This method computes which features are important to the network by looking at the integral (Riemann approximation) of the gradients computed from a baseline input (0 for each feature) to the actual feature values for each vertex.

Data availability
All data analysis was performed in Python. All protocols and code are available to download from https://www.protocols.io/ researchers/meld-project and www.github.com/MELDProject/ meld_classifier. Requests for access to the MELD dataset can be made through the project website https://meldproject.github.io//.

Participant demographics
After excluding patients with missing lesion labels (n = 37) and outliers (n = 14), a total of 571 FCD patients were included (Table 1). Each epilepsy surgery centre contributed 6-87 patients. Four hundred and nineteen patients underwent surgical intervention (73%) and histopathological diagnosis was available in 384 patients (92% of operated patients). Post-surgical outcome data were available in 361 patients (86% of operated patients); 68% were seizure free (Engel class 1) at last follow-up (median follow up = 2 years).

Interrater agreement in lesion masking
A set of three expert-defined lesion masks were created for 10 randomly selected subjects from one site ( Supplementary Fig. 5). The mean fraction mask overlap between rater-rater pairs was 42%, indicating that lesion annotations are likely to be heterogeneous. However, adding a border zone of 20 and 40 mm to the first rater's mask led to the overlap increasing to 82% and 94%, respectively. In a binary test of whether masks overlapped, with a border zone of 20 mm, there was at least one vertex overlap between all pairs of masks.

Focal cortical dysplasia lesion characterization
UMAP embedding of surface-based features from manual lesion masks and equivalent healthy cortex in the full cohort is shown in Fig. 2A. Compared to healthy control cortex, many lesions exhibited a distinct set of MRI features. There was heterogeneity in the set of abnormal features, with three distinct groups emerging (Fig. 2B). Group 1 was predominantly composed of FCD type IIA, IIB and unoperated lesions. These lesions were generally located at the bottom of a sulcus and characterized by increased intrinsic curvature, increased cortical thickness, decreased grey-white matter contrast and increased FLAIR in the white matter. Group 2 lesions were characterised by increased intrinsic curvature, decreased grey-white matter contrast and decreased intracortical FLAIR. Group 3 lesions, in which the lesional features overlapped with healthy cortex, were more heterogeneous and had less extreme feature values.

Impact of feature preprocessing on classifier performance
Performance of the classifier on the test cohort, full cohort and two independent sites are listed in Table 2. For the 278 patients in the train cohort, we assessed the impact of feature normalization procedures and smoothing kernels on classifier performance to establish the optimal input data for the classifier. There is an improvement in sensitivity+ (from 54% to 65%), sensitivity (from 44% to 59%) and in specificity (from 17% to 44%) following the threestage normalization of the data (Supplementary Table 2). As Gaussian smoothing kernel size increased in size ( Supplementary  Fig. 6), classifier sensitivity decreased. However, the number of detected clusters in patients and controls also decreased ( Supplementary Fig. 6). Based on these experiments we decided that using a 5 mm Gaussian kernel for sulcal depth and mean curvature, 10 mm for cortical thickness, grey-white contrast and FLAIR intensities at all cortical and subcortical depths and 20 mm for intrinsic curvature represents an acceptable trade-off between falling sensitivity and rising specificity.

Detection in the test cohort
For the 260 patients in the test cohort, the classifier predicted a median of 2 (interquartile range: 1-3) clusters (Table 2). These clusters overlapped with the manual lesion mask in 154 patients (sensitivity = 59%) and overlapped with the extended lesion mask (including border zones) in 174 patients (sensitivity+ = 67%). For the 193 controls in the test cohort, the classifier predicted a median of 0 (interquartile range: 0-1) clusters. No cluster was predicted in 105/193 controls (54% specificity). Examples of individual predictions for detected and undetected lesions are presented in Fig. 3.

Detection in the full cohort
In the full cohort (538 patients, 373 controls), i.e. including predictions from training the network on the test dataset and testing on the train dataset, results were similar to those on the test cohort only. Sensitivity was 58%, sensitivity+ was 65% and specificity was 52% (Table 2). The classifier predicted a median of two clusters in patients and zero clusters in controls. Out of the 178 patients who were 'ever reported MRI-negative', clusters overlapped with the extended lesion mask (including border zones) in 112 patients (sensi-tivity+ = 62.9%, Table 3). On a restricted cohort of patients with T 1 and FLAIR data, who had histopathologically confirmed FCD type IIB and were seizure-free, sensitivity was 85% (Table 3). Classifier performance according to histopathology is presented in Table 3.
One hundred and thirty-five of 364 histopathologically confirmed FCDs were 'ever reported MRI-negative', indicating a 'human false negative' rate of 37%. The classifier was able to detect 69% of these challenging cases.

Detection on independent test sites
When testing the classifier on the two independent sites (Table  2), sensitivity was 88% for site 1 (sensitivity+ 94%) and 56% for site 2 (sensitivity+ 62%). Specificity for site 1 was 17%, lower than expected compared to the full cohort. Performance variability is likely due to small sample sizes, which lead to large uncertainty in estimations of predictive performance. 37 Nevertheless, these data suggest that, after data harmonization, the algorithm  can generalize to detect FCDs on data from new, previously unseen sites.

Evaluating network performance across the full cohort Demographic and clinical factors affecting network sensitivity
The first logistic regression model ( Supplementary Fig. 7A), based on presurgical factors, showed that lesions were more likely to be detected in patients who were operated (β = 0.43, P = 0.04) and those that had FLAIR data available if they were scanned on a 1.5 T MRI scanner (β = 1.10, P = 0.01). Lesions were less likely to be detected in patients scanned on 1.5 T scanners (β = −0.60, P = 0.02) and when located in the left hemisphere (β = −0.41, P = 0.02). However, these did not survive correction for the number of factors in the logistic regression model. There was no association with age, i.e. there was no significant difference in detection rates between paediatric and adult patients. Among post-surgical factors ( Supplementary Fig. 7B), detection rates differed across histopathological subtypes, with 76.8% of FCD type IIB lesions detected, 64.6% of FCD type IIA, 72.7% in FCD type III and only 50.0% in FCD type I. FCD type I was significantly less likely (β = −0.53, P = 0.01) and FCD 2B more likely (β = 0.57, P = 0.02) to be detected than other histologies. Detection rates were non-significantly positively associated with post-surgical seizure freedom (β = 0.51, P = 0.04).
Patients who are not seizure-free may have more subtle lesions, which may contribute to both incomplete resections and the classifier not being able to detect them. Alternatively, the lesions in patients who are not seizure-free may have been incorrectly masked.

MRI features of predicted lesion clusters
The MRI features within the manually defined lesion masks clustered into three distinct groups (Fig. 4A). Groups 1 and 2 were associated with high detection rates (96.0% and 82.8%, respectively), whereas group 3, which largely overlapped healthy cortex, had much lower rates of detection (56.3%). A lower percentage of operated patients in group 3 were seizure-free (59.0% compared to 78% in groups 1 and 2). Predicted lesion clusters superimposed on this UMAP embedding entirely overlapped groups 1 and 2 (Fig. 4B) and no predicted lesion clusters were similar to group 3, which was indistinguishable from healthy cortex. For those manual lesion masks in group 3 that were correctly detected, the predicted lesion clusters exhibited features closer to those in groups 1 or 2 (Fig. 4C). This indicates that while the manual lesion masks for lesions in group 3 did not capture areas of cortical surface that exhibited characteristically abnormal MRI features, the neural network learned to identify an overlapping set of vertices that did exhibit abnormal feature characteristics.

Characterizing features salient to the network in segmenting focal cortical dysplasia lesions
In all patients, mean feature values and network saliencies were calculated for each feature within the predicted cluster. This enables the creation of a patient-specific report containing the predicted lesion location, which features are abnormal within that predicted cluster and how much weight those features had in driving the classifier prediction, which we illustrate in Fig. 5 with two examples. Patient 1's predicted lesion has decreased FLAIR in the grey matter, blurring at the grey-white matter boundary on T 1 and moderately increased intrinsic curvature (Fig. 5B). From these features, the computed saliency scores indicate that the neural network considers the decreased grey matter FLAIR and grey-white contrast most important for its prediction of lesional vertices. Patient 2 is an example of an FCD type IIB lesion without FLAIR features (Fig. 5). The predicted lesion has high intrinsic curvature, high cortical thickness and low grey-white matter boundary contrast. These are also the three features with positive saliency scores, i.e. feature values driving the classifier's 'lesion' prediction.

Discussion
We present an interpretable, fully automated pipeline for surfacebased detection of FCDs, which has been validated on a large withheld test cohort, incorporating data from 20 sites, and two independent sites. The sensitivity to detect lesions in the test cohort was 67%, with sensitivities of 94% and 62% in the independent sites, 85% in subcohort with T 1 and FLAIR data who were seizure-free with confirmed FCD IIB and 69% within patients with histologically confirmed FCD but had at some point been reported 'MRI-negative'. Logistic regression analyses indicated that FCD type IIB lesions had higher detection rates, whereas FCD type I lesions had lower detection rates. Multidimensional analysis of lesional cortex revealed groups of lesions characterized by different MRI features, histologies, post-surgical outcomes and detection rates. Individual patient reports provide a map of the predicted lesion locations alongside the quantified lesional features and how salient they were considered by the classifier. This study extends previous work on FCD detection in the largest MRI cohort of FCDs to date. Previous surface-based work has identified features that differentiate lesional cortex and developed machine-learning frameworks for the incorporation of these features. 14,15,17,19,38 However, being limited by small numbers of patients and data acquired from only one or two MRI scanners can lead to large error bars on estimates of sensitivity and specificity 37 and limited generalizability due to lack of diversity in training data.
Progress is also being made on automated volumetric MRI methods. [11][12][13] Both Gill et al. 12 and David et al. 11 report high sensitivities of 83% and 81%, respectively, on their independent test data. Within the MELD dataset, on a comparable 'gold-standard' subcohort of seizure-free patients with FCD type IIB who had T 1 and FLAIR MRI data, the algorithm had a competitive sensitivity of 85%. In addition, this work differs from these studies in the following key aspects. First, this study has a more representative, heterogeneous inclusion criteria. We aimed to develop an algorithm capable of detecting all FCD histopathological subtypes including some of the more challenging FCD type I cases. Second, our classifier predicts on average two clusters per patient in our independent test sites, compared to on average six clusters per patient reported by Gill et al. 12 Third, in comparison to David et al., 11 we make the code and trained model openly available, therefore fostering collaboration and clinical uptake of the work. In addition, our training dataset included lesions masked by different radiologists/ researchers at different institutes. This heterogeneity in lesion masking reduced overfitting of the network to one individual neuroradiologist's opinion. This large multisite, multiscanner cohort, including paediatric and adult data and all FCD histopathological subtypes, provided reliable and reproducible estimates of classifier performance that generalized well to two independent cohorts.
Our data-driven clustering of FCD lesions revealed three distinct groups of lesions. Group 1 had 'classical' radiological features of FCD type II; increased cortical thickness, blurring of the grey-white matter boundary, abnormal folding, FLAIR hyperintensity in the white matter and were often located at the bottom of sulci. They were associated with high detection rates by the neural network (96%) and had good seizure freedom rates (78%). Group 2 had more subtle features: blurring of the grey-white matter boundary, FLAIR hypointensity in the grey matter and some folding changes. However, our classifier was still able to detect 82.8% of these lesions and the patients in this group who had been operated on still had good seizure freedom rates (78%). In contrast, lesions in group 3 were difficult to differentiate from healthy cortex, they did not demonstrate characteristic FCD 'fingerprints' and only 59% of these patients were seizure-free after surgery. For group 3 lesions that were detected by the classifier (56.3%), the classifier identified a subset of vertices that exhibited MRI features more consistent with groups 1 and 2 (Fig. 4C). This suggests that these lesions are more subtle or difficult to delineate or structurally heterogeneous 39 on MRI.  One challenge in incorporating machine-learning algorithms in clinical practice is their perception as being 'black boxes', with limited feedback on what data have informed a prediction. Saliency aims to interrogate which specific input features drive neural network predictions. Our individual patient reports provide information on which features are abnormal within the predicted clusters, accompanied by their impact on classifier prediction (Fig. 5). A neuroradiologist or multidisciplinary team could use this tool to confirm their hypotheses in 'MRI-visible' lesions, to rereview the scans of 'MRI-negative' patients or motivate more detailed investigations, such as 7 T MRI, PET or stereo EEG. 19 They will obtain putative lesion locations identified by the classifier, equipped with an understanding of what features were considered suspicious and how they were abnormal, thus opening the 'black box'. In addition, by 'flagging' suspicious areas, this artificial intelligence radiological assistant may reduce the time taken for a neuroradiologist to review MRI scans or increase confidence in the radiological diagnosis of patients with suspected FCDs.

Limitations and future work
This study used multisite real-world data, which, while facilitating algorithm generalizability to new data and the utility of the developed tool, are heterogeneous. This heterogeneity arises from intersite differences in MRI scanners, sequences, field strengths as well as from variable post-processing operating systems and software versions and may have affected morphological and intensity feature values. These were partially mitigated through harmonization procedures but may still have impacted on algorithm sensitivity and specificity. Participating MELD sites manually masked FCD lesions and only surface-based data were shared with the project coordinators. While preserving a greater level of anonymity and facilitating data sharing, this preprocessing prevented comparison of predicted lesions with patients' volumetric MRIs. As with other FCD detection algorithms, false positives were common in both patients and controls. This neural network classifies individual cortical vertices; future work using incorporating neighbourhood information and incorporation with volumetric approaches may help to reduce the false positives. Furthermore, volumetric approaches would extend the detection of focal epilepsy pathology beyond the neocortex, in areas such as the hippocampus. This would enable the detection of hippocampal sclerosis in FCD type IIIa. Additionally, integrating electrophysiology might help to identify which structural abnormalities are epileptogenic. One challenge in all FCD detection work is deciding which patients are considered 'MRI-negative'. The measure 'ever reported MRI-negative' will vary based on the level of neuroradiological expertise at the individual site as well as the MRI scanner and sequences acquired. However, it should provide a measure of the more challenging lesions to detect. Lastly, drug-resistant focal epilepsy is caused by multiple pathologies of which FCDs are a significant subset. Invaluable future studies would extend the inclusion criteria to a wider spectrum of focal epilepsies.

Conclusions
We demonstrate how through open-science practices and decentralized MRI post-processing, one can create a dataset; and train and validate a machine-learning framework to assist in the diagnosis of a rare, clinically challenging pathology. The MELD FCD classifier is a fully automated, open-access surface-based tool that can be run on any patient with a suspicion of having an FCD who is over the age of 3 years and has a 1.5 or 3 T T 1 scan, with or without FLAIR data. The classifier is available on GitHub as a user-friendly Python package and can output a patient specific report detailing suspected structural abnormalities, which features are abnormal within these clusters and their impact on classifier prediction.