Interpretable machine learning for dementia: A systematic review

Abstract Introduction Machine learning research into automated dementia diagnosis is becoming increasingly popular but so far has had limited clinical impact. A key challenge is building robust and generalizable models that generate decisions that can be reliably explained. Some models are designed to be inherently “interpretable,” whereas post hoc “explainability” methods can be used for other models. Methods Here we sought to summarize the state‐of‐the‐art of interpretable machine learning for dementia. Results We identified 92 studies using PubMed, Web of Science, and Scopus. Studies demonstrate promising classification performance but vary in their validation procedures and reporting standards and rely heavily on popular data sets. Discussion Future work should incorporate clinicians to validate explanation methods and make conclusive inferences about dementia‐related disease pathology. Critically analyzing model explanations also requires an understanding of the interpretability methods itself. Patient‐specific explanations are also required to demonstrate the benefit of interpretable machine learning in clinical practice.

be widely adopted in the clinic. A major factor in this is the black-box nature of predictive models, which makes them difficult to interpret and, ultimately, to trust. 3,4 Interpretable machine learning (IML) often used synonymously with explainable artificial intelligence (XAI), can be used to explain the output of predictive models by (1) describing the mechanism by which the model generates its decision, (2) highlighting which of the input features are most influential on the decision, or (3) producing examples that maximize its confidence for a specific outcome. As shown in Figure 1, an interpretation stage can be introduced into the machine learning pipeline that can confirm a clinician's diagnosis or provide patient-specific evidence of the disease. Although arguments for explainability often focus on trust, [5][6][7][8][9][10][11][12] other goals include fairness, accessibility, interactivity, and exploration; the process of model interpretation may uncover new knowledge about the model, data, or underlying disease. 13 There is also growing pressure from a legal standpoint to provide explanations-both to the clinician and the patient. This became evident when new European General Data Protection Regulations (GDPRs) were introduced in 2018 calling for more transparency, and individuals were given a "right to explanation." 14 Moreover, a report from the National Health Service (NHS) Artificial Intelligence (AI) Lab and Health Education England published in May 2022 noted that "adopting AI technologies [is] at a critical juncture" with calls for "appropriate confidence" in AI for both health care workers and the public. The phrase "appropriate confidence" shifts focus way from trust (a subjective and qualitative measure) to reflect how users must be able to "make context-dependent value judgments and continuously ascertain the appropriate level of confidence in AI-derived information." 15 This distinction mirrors the difference between the use of AI for lone decision-making versus as a decisionsupport tool, with the latter being the focus of translational research.
The field of IML has grown rapidly over the last 20 years, 3,4,13 particularly in tasks involving natural language processing or computer vision. This rapid growth has led to inconsistencies in the terminology used to describe such methods, making it difficult to identify relevant studies. Although many reviews on IML introduce taxonomies that bring clarity to the different methods, 16 there is still inconsistency across research papers when incorporating explanation methods in their analysis. In dementia studies specifically, coupled with the variety of data available for differential diagnosis and prognosis, this has led to a complex landscape of methods that makes it hard to identify best practice. There is also variability across machine learning studies in the reporting of implementation details, which can also inhibit translation to clinical practice. This systematic review aims to summarize current progress and highlight areas for improvement to allow dementia researchers to better navigate this emerging field.

BACKGROUND
The landscape of interpretable machine learning has grown rapidly resources such as Christoph Molnar's guide. 17 Recent reviews of interpretable machine learning have introduced frameworks (taxonomies) that summarize their properties, provide a visual aid, and promote consistency across future work. 13,16,18

Properties of interpretable methods
Here we introduce some of the key properties of model interpretation methods. Understanding their properties can help researchers to critically analyze the resulting explanations, and identify which methods are most appropriate for a given clinical scenario or question. These properties include whether they are intrinsic or post hoc, model-agnostic or model-specific, and whether they produce model, individual, or group level explanations. However, the categorization of these methods varies across the literature and some methods can fall into more than one group. Therefore, these properties and the methods associated with them are best considered within the context of the predictive task.

Intrinsically interpretable models versus post hoc interpretation methods
Machine learning methods such as linear regression, k-nearest neighbors, decision trees, and their extensions can be classified as F I G U R E 1 We propose a diagnostic pipeline that starts with data acquisition through to clinical interpretation. Data can be categorized into imaging and non-imaging groups. Data items can be used individually or combined to make a prediction. A model can be trained to predict the probability an individual's likelihood to have or develop dementia using these data. A clinician using this model may wish to interpret the result, to understand "why" this person has been classified as having dementia, which could influence the most appropriate treatment response or help to confirm their own diagnosis. The interpretation method depends on the model and data types involved. Most methods either produce heatmaps, which visualize influential regions or use techniques to rank the most important features intrinsically interpretable because for a given set of inputs and outputs, the end-user can easily trace how the inputs have been used to arrive at the final probability, value, or prediction often via a formula or rule-based framework. Similarly, decision trees can be interpreted because their final probabilistic outputs or values are derived via a rule-based framework allowing users to trace the decision boundaries from input to output.
Post hoc interpretation methods involve an additional step of exploration after training, in which the trained model is probed or manipulated to generate information on how input features influence the output. Such methods include perturbation methods, backpropagation, feature relevance ranking, or example-based explanations.
Perturbation methods, sometimes referred to as sensitivity analysis, involve systematically changing the input data (e.g., removing features) and observing its effect on the output. This allows users to determine whether the model is more sensitive to specific features or regions. Backpropagation is often used for "black-box" models such as deep neural networks, where the underlying predictive process is complex due to non-linear operations and high-dimensional input data. These methods utilize the weights learned during training to propagate the output probability back into the input space, resulting in heatmaps that highlight the importance of pixels, regions, or features. By probing the model after training, post hoc approaches have the advantage of deriving explanations without compromising accuracy for instances where deep models outperform less-complex linear approaches. Figure 2 contains schematic representations of two post hoc methods: class activation mapping (CAM) and occlusion, their properties, and questions an end-user could use to determine which method is most appropriate.
Creating intrinsically interpretable models is more challenging for neural networks due to their complex architectures. However, exam-ples include ProtoPNet, 19 a neural network for which the final classification is generated by chaining learned "prototypes" (or parts of the image) through a transparent algorithm. Transformer networks can also be considered as an interpretable deep learning models because the self-attention mechanism that generates their output can also be used to highlight important regions or features. 20,21 Although transformers were designed initially for use in natural language processing tasks, the evolution of vision transformers has led to a rise in use across the medical imaging domain. 22  features across all classes. Methods such as LIME can be used to produce individual-level explanations, which describe the important features for a specific case. This is likely to be more useful in clinical settings, as patient-specific explanations can be used to inform future treatment or confirm a diagnosis. In many cases, a single IML method can be used to produce explanations across several levels.

Study motivation
Although there are several reviews that summarize IML literature across medical imaging and computer vision, 2,3,27 few focus on their application to dementia research and machine learning. Borchert and colleagues recently reviewed neuroimaging-based machine learning for dementia prediction, with recommendations on how to increase impact in memory clinic settings. 28 Similarly, Thibeau-Sutre and colleagues performed a review on interpretable methods in neuroimaging, where they highlighted various methods and assessed their reliability. 29 However, to our knowledge this systematic review is the first to consider both imaging and non-imaging-based machine learning methods for dementia diagnosis, where model interpretability is a specific inclusion criterion. Our review is also not limited to Alzheimer's disease but considers approaches that include a range of dementia-causing neurodegenerative diseases. This review aims to (1) summarize the different approaches to interpretable or explainable dementia prediction, (2) report and highlight the variability in study design and how this impacts clinical interpretability, and (3) offer recommendations for dementia researchers that wish to incorporate interpretable methods in future work.

MATERIALS AND METHODS
We conducted a systematic review of studies that used machine learning or deep learning for diagnostic classification of dementia and interpret the results either using post hoc analysis or inferring from an interpretable model. A protocol for this systematic review was registered on PROSPERO (ID: CRD42021291992). 30 PROSPERO is an international prospective register of systematic reviews that helps to avoid duplication and reduce reporting bias. A database search was used to identify reports published before March 1, 2022, across PubMed, Scopus, and Web of Science. We constructed our search query by linking four key concepts together: dementia, classification, machine learning, and interpretability. The search query run on each database is given below (adapted for each database) and all terms were searched across titles, abstracts, and keywords (if available): ("dementia" OR "alzheimer*") AND ("predict*" OR "classif*" OR "diagnosis") AND ("deep learning" OR "machine learning" OR "neural network*") AND ("explain*" OR "interpret*" OR "saliency" OR "Grad-CAM" OR "Layer?wise relevance propagation" OR "occlusion" OR

"visuali*" OR "transformer")
This returned 219 records on PubMed, for which the MeSH terms "dementia" and "diagnosis, computer assisted" were also used. On Scopus the query returned 531 records and on Web of Science the query returned 308 records. A total of 530 records were removed with EndNote's automated de-duplication tool and manual assessment before screening.

Screening process
All records were screened using a two-stage process using two independent reviewers based on: (1) title and abstract only and (2) full-text.
The inclusion and exclusion criteria used to filter studies are summarized below: 1. Article type • Inclusion: Any published original research paper (or pre-prints) in peer-reviewed academic journals or conferences. • Exclusion: Conference proceedings, corrections, erratum's, reviews, and meta-analyses.

Task
• Inclusion: Application of machine learning to do one or both of the following: (i) classify dementia patients from healthy controls or mild cognitive impairment patients, (ii) classify individuals that convert from stable/early mild cognitive impairment to progressive/late mild cognitive impairment or dementia. • Exclusion: Unsupervised algorithms (e.g., clustering methods, generative adversarial networks) or applications of supervised machine learning to non-diagnostic tasks (e.g., segmentation, brain atrophy, brain parcellation, brain-age prediction, prediction of cognitive assessment scores, genome-wide analysis, survival analysis).

Application to dementia
• Inclusion: Studies with patient groups based on a clinical diagnosis of dementia, Alzheimer's disease, or phenotypic syndrome (e.g., frontotemporal lobar degeneration). • Exclusion: Studies with patient groups based on other neurodegenerative diseases (e.g., Huntington's or Parkinson's disease) without an accompanying dementia diagnosis.
• Exclusion: Classification of other forms of neurodegeneration (e.g., multiple sclerosis, traumatic brain injury, stroke, or mild cognitive impairment only).

Model interpretability
• Inclusion: Studies must refer to the interpretability of the classification model in the abstract or provide example model explanations in the main text. • Exclusion: Classical data-driven feature selection or dimensionality reduction studies (e.g., principal component analysis).

Data
• Inclusion: Studies must report experimental details including the type of prediction model used, and at least one of the following performance metrics: accuracy, area under the curve, precision, recall, sensitivity, or specificity within the text or figures.
6. Not human • Exclusion: Non-human studies, for example, mouse models.
A PRISMA flowchart 31 describing the study selection process is shown in Figure 3. For title and abstract screening, any papers that were on the borderline for inclusion were assessed blindly by a second reviewer. Any studies without consensus automatically progressed onto the second screening stage. This led to 144 papers requiring a full-text screening for inclusion. We removed 48 reports upon reading the full-text for failing the eligibility criteria. We contacted the authors for any full-text papers we could not retrieve online. We applied the same blind review process for borderline full-text reports to obtain a F I G U R E 3 PRISMA flowchart outlining the screening strategy used to identify relevant studies. 31 32 However, they are not included in our review, as these models typically rely on unsupervised clustering methods.
As such there is a risk of bias, as we focus only on machine learning studies that explicitly mention interpretability and include example inferences.

RESULTS
We reviewed and extracted data from all included studies using Excel to highlight trends in the study design, performance, and IML methods used. The key findings are summarized in Figure 4 and extracted data items can be found in the Supplementary Material.

Study details
The key study details across all studies can be found in Table S1. Here we summarize the trends seen in the data sets used, variability in sample size, and use of neuroimaging data.  We observed a shift toward 3D whole image-based studies with time, likely due to hardware advancements and the increased performance benefits of deep learning over tabular data-based machine learning methods.

Implementation details
Technical details regarding the choice of predictive model, reported performance, and interpretation across all included studies can be found in Table S2. Here we comment on the observed model accuracies,   with Alzheimer's disease with those associated with frontotemporal dementia. 44 Liu and colleagues conducted causal analysis using genetic information alongside imaging-driven important regions. 55 Three studies made use of simulated data sets, where they had control over the group-separating features to perform preliminary tests of the explanation method. [56][57][58] Studies that used multiple interpretability methods (n = 9) were also able to comment on whether these highlighted the same regions. Four studies validated their findings by reporting a second classification accuracy using the identified regions of interest as input features 57,59,60 or incorporating them as anatomic landmarks. 61 Despite these efforts, most studies were unable to assess the utility of other regions that were highlighted by the model but were without known pathological relevance. Although Qiu and colleagues correlated their findings for 11 subjects with post-mortem neuropathology, they lacked the statistical power to draw any significant insights. 38 This challenge also prevailed for non-imaging studies, although models based on demographic information utilized known risk factors, 62 and speechbased models were able to contextualize their findings with phrases and indicators associated with Alzheimer's disease. 63 Moreover, many of the diagnostic labels in publicly available data sets are based on clinicians' ratings, which have been shown to be subjective and can be confirmed only through post-mortem analysis. Therefore, some studies may include dementia patients with mixed pathology, including vascular dementia, which should be considered when assessing potential diagnostic specificity of model predictions.

DISCUSSION
Our results highlight the growth in this cross-disciplinary research area, particularly through the combination of neuroimaging and neural networks, which can match and outperform clinical predictions across a range of dementia-related tasks. The range of accuracies indicate that interpretable models do not necessary require a loss in performance, previously seen as a limitation of IML, as studies have still been able to demonstrate ways to probe the "black-box," identify important features, or provide rule-based explanations. 64,65 To maximize the impact of machine learning in clinical practice, we provide recommendations to aid clinicians when interpreting results, encourage more homogenous reporting standards, and highlight several challenges that remain.

Recommendations for interpreting interpretability studies
Here we provide recommendations for comprehending studies on IML to help researchers interpret the results accurately: Scrutinize the interpretability method details: Currently all interpretability methods have limitations and drawbacks. Techniques such as occlusion are strongly linked to the sample size, as the more samples seen during training, the more robust it will be to changes in non-disease relevant patches. Sample size is also important for heterogenous disease pathologies. Data augmentation methods help to build models that generalize well to new cases; however, heterogeneity can make it difficult to decipher between patient-specific disease relevant pathology and spurious artifacts of the interpretation technique. There is also a strong dependence on model performance. This should be considered when being presented with interpretability findings, as explanations from models with poor predictive power may be inaccurate and group-level findings are likely to be affected by falsely classified samples.

Identify whether the method is model-, group-, or individual-level:
The results can differ greatly depending on whether the output is group-level or individual-level, and the pathways to clinical impact will vary as a result. Occlusion techniques are often not suitable for making individual-level explanations for a given prediction. The results obtained by occluding patches across a single example case are still a representation of the overall model susceptibility to a given patch. In contrast, methods such as LRP and CAM allow for individual heatmaps that reflect the regional relevance associated with a single case.
Relevance and importance do not guarantee biological significance: Although interpretable methods present exciting opportunities to improve our understanding of model predictions, the results are not necessarily related to biological or pathological features. Many of these methods are model agnostic or have been developed primarily outside the medical imaging context. Therefore, they lack considerations of causation needed to correlate their outputs with biological relevance. The values and scores derived from methods such as LRP are better interpreted as "where the model sees evidence" 50 or in the case of class activation maps, "which features has the model learned as relevant to this class." However, they are not sufficient for identifying potential interactions between voxels or features, or high-level concepts such as atrophy. Similarly, some identified features may be a result of the presence of noise, artifacts, or group differences from the underlying data, which can be misleading.

Recommendations for study design and report writing
When carrying out studies that incorporate interpretable methods, we highlight three recommendations when (designing the experiment and) reporting their findings: Design the entire study with the end-user in mind: The choice of interpretability method depends on the needs of the end-user. Therefore, it can be beneficial to conceptualize the type of questions to be asked, whether that may be "which features are most important to the model" or "for this individual, how have the input features been used to arrive at the final prediction?" Addressing the interpretability of the study early on will allow researchers to better design their study, such as determining whether ground-truth annotations may be desired to validate their interpretability models or if simulated preliminary results could benefit them as previously seen. 56,58,66 Research aiming to perform classification between disease groups may be better suited Consistently adhere to reporting standards: Adherence to reporting standards will play a crucial role in the development of this field as researchers will be able to quantitatively compare performance across studies (e.g., meta-analyses) and better contextualize results. Although several checklists and guidelines such as CLAIM (Checklist for Artificial Intelligence in Medical Imaging) 67 and STARD (Standards for Reporting of Diagnostic Accuracy Studies) 68 exist for AI applications in health care, here we emphasize areas in which we observed large variability across the included studies. For example, when reporting model performance results, we suggest that researchers provide confusion matrices, as they provide concise access to several measures of performance such as balanced accuracy, sensitivity, and specificity. Single measures of accuracy may not be sufficient, particularly in dementia studies where unbalanced data sets are common, and sensitivity to true positive cases may be more desirable than robustness to false positives. We also re-emphasize the importance of clearly specifying the sample size across prediction tasks and data sets and providing confidence intervals where available. This amount of detail varied among the studies included in our review but is important, particularly when reporting results from multiple prediction tasks. Data sets also differ in their labeling procedures, so studies must be careful when training models across cohorts and clearly highlight any discrepancies. Many dementia-causing diseases can only truly be diagnosed post-mortem, and definitions of categories such as mild cognitive impairment are still debated. 69 Furthermore, in imaging studies where multiple scans are available per participant (i.e., from several time points), researchers should ensure that their methods are robust to data leakage by splitting their data sets on a subject level and clearly stating if multiple scans have been used during training or testing. Models should be tested on hold-out test sets (and external data sets where possible) rather than relying on cross-validation for more a reliable estimate of performance on new data.

Remaining challenges
A key challenge that remains is that IML methods have yet to be thoroughly tested to ensure that they are robust and reliable. Some research efforts in computer vision have attempted to address this.
For example, Adebayo and colleagues define and tested several post hoc explanation methods against pre-defined sanity checks to see if explanations were robust to small perturbations in the data and different architectures. 70,71 Several methods failed these tests and were deemed to be unreliable. Moreover, Tian and colleagues evaluated the test-retest reliability of feature importance for models trained to predict cognition, and they elucidated a trade-off between feature weight reliability and model performance. 72 Our review identified one study, which assessed the robustness of two explanation methods by defining a continuity and selectivity metric. In that study, the authors tested whether the heatmaps produced via perturbation and occlusion techniques are consistent across similar images (continuity) and whether relevant occluded regions correlated with the change in class probability (selectivity). 73 They also quantitatively compared the heatmaps and their robustness characteristics across different model architectures.
A similar test was carried out by Thibeau-Sutre and colleagues who compared heatmaps produced across multiple cross-validation folds as well as different hyperparameter values. 74 However, none of these measures consider characteristics that are application specific, such as robustness to scanner artifacts or non-disease-related variability in brain structure that may arise in more clinical, diverse data sets. Moreover, a particular explanation method may be insufficient according to the test defined in computer vision-based studies but may still be sufficient for decision support. Context-specific quality criteria is needed to ensure that the outputs are clinically useful, while affording some flexibility against strict test as the field of IML continues to develop.
There was also a limited involvement of neuroradiologists and clinicians throughout these studies. This is essential to designing informed experiments that address the relevant questions and ensuring that work in this field has an impact on translation. Ding

Future directions
Interpretable machine learning has the potential to enhance the dementia prediction pipeline and open avenues for new insights into disease mechanisms. Group-or patient-level explanations could be useful for identifying features that are relevant to specific phenotypes or stages and aiding the development of preventative therapies.
Identifying which regions the model focuses on could also be used to influence other stages in the imaging protocol. For instance, acquisition sequences could be optimized for imaging-specific regions of interest, even in real time. 78 More generally, being able to differentiate between biologically relevant features specific to groups with similar clinical profiles helps to demonstrate the benefit of computerassistive technologies. Individualized, patient-specific explanations can serve as a huge step toward personalized medicine with clinicians being able to identify key drivers of a patient's diagnosis. Looking ahead, interpretable models could help to advance scientific discovery by identifying novel biomarkers such as disease-specific genes. 79 Although machine learning is not currently used in clinical trial recruitment, model explanations also provide opportunities to enhance patient stratification or explore treatment response through predictors associated with specific brain regions.

CONCLUSION
Interpretability is key for the clinical application of machine learning in decision-making tools for dementia prediction. The need for model explanations has been identified both in the legal sector and health services as the use of machine learning based solutions continues to rise.
In this systematic review, three databases were searched to identify 92 studies that have applied interpretable methods to machine learning models designed for the prediction of dementia. We found a large bias toward open-source data sets such as ADNI, which may have limited the generalizability of findings. A key emerging theme was the challenge of validating interpretation methods. Although this challenge also exists outside of dementia research, we highlight that domainspecific quality criteria may also require critical assessment of the clinical utility. Dementia prediction tasks are made ever more difficult by the high dimensionality of data and interactions between factors such as age, sex, genetic history, and lifestyle. Building models that make use of this multi-modal landscape of information but can still disentangle their influences on the output would help bring the power of machine learning models one step closer to large-scale clinical adoption.