DIVE: A spatiotemporal progression model of brain pathology in neurodegenerative disorders

&NA; Current models of progression in neurodegenerative diseases use neuroimaging measures that are averaged across pre‐defined regions of interest (ROIs). Such models are unable to recover fine details of atrophy patterns; they tend to impose an assumption of strong spatial correlation within each ROI and no correlation among ROIs. Such assumptions may be violated by the influence of underlying brain network connectivity on pathology propagation – a strong hypothesis e.g. in Alzheimer's Disease. Here we present DIVE: Data‐driven Inference of Vertexwise Evolution. DIVE is an image‐based disease progression model with single‐vertex resolution, designed to reconstruct long‐term patterns of brain pathology from short‐term longitudinal data sets. DIVE clusters vertex‐wise (i.e. point‐wise) biomarker measurements on the cortical surface that have similar temporal dynamics across a patient population, and concurrently estimates an average trajectory of vertex measurements in each cluster. DIVE uniquely outputs a parcellation of the cortex into areas with common progression patterns, leading to a new signature for individual diseases. DIVE further estimates the disease stage and progression speed for every visit of every subject, potentially enhancing stratification for clinical trials or management. On simulated data, DIVE can recover ground truth clusters and their underlying trajectory, provided the average trajectories are sufficiently different between clusters. We demonstrate DIVE on data from two cohorts: the Alzheimer's Disease Neuroimaging Initiative (ADNI) and the Dementia Research Centre (DRC), UK. The DRC cohort contains patients with Posterior Cortical Atrophy (PCA) as well as typical Alzheimer's disease (tAD). DIVE finds similar spatial patterns of atrophy for tAD subjects in the two independent datasets (ADNI and DRC), and further reveals distinct patterns of pathology in different diseases (tAD vs PCA) and for distinct types of biomarker data – cortical thickness from Magnetic Resonance Imaging (MRI) vs amyloid load from Positron Emission Tomography (PET). We demonstrate that DIVE stages have potential clinical relevance, despite being based only on imaging data, by showing that the stages correlate with cognitive test scores. Finally, DIVE can be used to estimate a fine‐grained spatial distribution of pathology in the brain using any kind of voxelwise or vertexwise measures including Jacobian compression maps, fractional anisotropy (FA) maps from diffusion tensor imaging (DTI) or other PET measures.


Introduction
Many biomarkers exist that can be used to track the severity of neurodegenerative diseases such as Alzheimer's disease (AD). Clinical function can be measured using cognitive assessments performed by an expert clinician and brain atrophy can be measured using Magnetic Resonance Imaging (MRI). Other measures include molecular markers such as aggregation of misfolded amyloid-beta or tau measured using Positron Emission Tomography (PET), and measures of white-matter degradation such as fractional anisotropy (FA) from Diffusion Tensor Imaging (DTI). The evolution of these biomarkers across the disease time-course creates a unique signature of the disease that can be used to stage patients, which is helpful for stratification in clinical trials.
A hypothetical model of disease progression has been proposed by (Jack Jr et al., 2010), describing the order of abnormality of key biomarkers along the progression of AD. The model suggests that amyloid-beta and tau biomarkers become abnormal long before symptoms appear, followed by brain atrophy measures and cognitive decline. Motivated by this hypothetical model, several data-driven disease progression models have been proposed in recent years, which aggregate information from multiple biomarkers into a single time frame representing disease progression. One such model is the Event-Based Model (EBM) (Fonteijn et al., 2012;Young et al., 2014), which models the progression of disease as a sequence of discrete events, representing underlying biomarkers switching from a normal to abnormal state. Other types of disease progression models (Donohue et al., 2014;Jedynak et al., 2012;Li et al., 2017;Lorenzi et al., 2017;Schiratti et al., 2015) have been developed, that build continuous trajectories by "stitching" together short-term follow-up data from individual subjects. In contrast to the discrete disease stages that are estimated by the EBM, these models also compute a continuous measure of disease stage for every individual by estimating individual time shifts and progression speeds.
Current image-based disease progression models estimate the evolution of the disease using a small set of biomarkers corresponding to pre-defined regions-of-interest (ROI). This ROI parcellation is usually coarse, doesn't allow one to find spatially dispersed patterns of atrophy. While spatiotemporal longitudinal models have already been demonstrated (Derado et al., 2010;Hyun et al., 2016;Lorenzi et al., 2015), these models regress against pre-defined sets of covariates such as age or clinical markers. This is problematic because, in the case of age, this assumes all subjects have the same age of disease onset. Similarly, clinical markers are noisy, biased, suffer from floor/ceiling effects and are not sensitive in pre-symptomatic phases. Recently, some spatiotemporal models that estimate subject-specific time-shifts have been developed (Bilgel et al., 2016;Koval et al., 2017). However, these models generally cannot recover dispersed and disconnected pathological patterns, because they assume correlation of voxels based on spatial distance, either through a distance function or distance from control points. However, spatially dispersed pathological patterns have been observed in AD and related dementias and are hypothesized to appear due to interactions of pathology and brain networks (Seeley et al., 2009). Therefore, a spatiotemporal disease progression model that allows recovery of dispersed atrophy patterns present in AD, is not currently available.
In this work, we present DIVE: Data-driven Inference of Vertexwise Evolution. DIVE is a novel disease progression model with single vertex resolution that makes only weak assumptions on spatial correlation. In contrast to approaches which model temporal trajectories for a small set of biomarker measures based on a priori defined ROIs, DIVE models temporal trajectories for each vertex on the cortical surface. DIVE combines unsupervised learning and disease progression modelling to identify clusters of vertices on the cortical surface that show a similar trajectory of brain pathology over a particular patient cohort. This formulation enables us to estimate a fine-grained spatial distribution of pathology and also provides a novel parcellation of the brain based on temporal change.
We first test DIVE on synthetic data and show that the model can recover known biomarker trajectories and disease progression scores. We then demonstrate the model on data from two cohorts: the Alzheimer's Disease Neuroimaging Initiative (ADNI) and the Dementia Research Centre (DRC), UK. We use the model to reveal spatiotemporal patterns of pathology to a much finer resolution than previous models and demonstrate the ability to assign subjects to stages that predict progression. Finally, we validate DIVE in terms of how robust are the estimated pathology patterns and how well the disease progression scores correlate with cognitive tests.
In this section we describe the mathematical formulation of DIVE (section 2.1), then we show how to fit the model using Expectation Maximisation (section 2.2) and we describe further implementation details of the algorithm (section 2.3). Afterwards, we outline the synthetic data-generation process (section 2.4) for testing the model in the presence of ground truth, as well as the pipeline for pre-processing the ADNI and DRC datasets (section 2.5). Figure 1. Diagram of the proposed DIVE model. DIVE assumes that biomarkers of pathology (e.g. cortical thinning) can be measured at many vertices (i.e. locations) on the cortical surface (A), where each vertex has a distinct trajectory of change during disease progression (B). The model thus assigns to every cortical vertex one of a small set of temporal trajectories describing the change in some image-based measurement (e.g. cortical thickness, amyloid PET, DTI fractional anisotropy measures) from beginning to end of the disease progression. The estimation process simultaneously estimates the set of clusters, the trajectory defining each cluster, and the position of each subject along the trajectories, which are defined on a common time line. The process iterates assignment of each vertex to clusters (red, green and blue in this diagram) (C), estimation of the trajectory in each cluster (D) and estimation of the disease progression score (location along trajectory) for each subject (E), all within an expectation-maximisation framework, until convergence. Figure 1 illustrates the DIVE aims and implementation. DIVE input measures are vertexwise or voxelwise biomarker measures in the brain (Fig 1A), such as cortical thickness or amyloid load. A vertex is a location on the cortical surface at which a biomarker of pathology is quantifiable (e.g. cortical thickness). For each vertex on the cortical surface (or voxel in the 3D brain volume), we estimate a unique trajectory along the disease progression timeline (Fig 1B), while also estimating subject/visitspecific disease progression scores (i.e. disease stages). We do that by grouping vertices with similar biomarker trajectories into clusters (Fig 1C), and we estimate a representative trajectory for every cluster (Fig 1D). Each trajectory is a function of subject-/visit-specific disease progression scores (DPS) (Fig 1E). The DPS depends linearly on the time since baseline visit, but with subject-specific slope and intercept.

Modelling Subject-specific Time Shifts
The disease progression score s ij for subject i at visit j is defined as a linear transformation of time since baseline measurement t ij (in years): where α i and β i represent the speed of progression and time shift (i.e. disease onset) of subject i.

Modelling Biomarker Trajectory for a Single Vertex
DIVE assumes that the biomarker measure at each vertex on the cortical surface follows a sigmoidal trajectory f( . ; θ) over the disease progression score s and with parameters θ. We choose a parametric sigmoid function because it is a parsimonious parametric model that offers better fit compared to linear models, is monotonic, and can account for floor and ceiling effects (Caroli and Frisoni, 2010;Sabuncu et al., 2011). We also assume that vertices are grouped into K clusters and we model a unique trajectory for each cluster k ∈ {1, ... , K}, which will be referred to as cluster trajectories. The sigmoidal function f(s; θ k ) for cluster k is defined as: where s is the disease progression score from Eq. 1 and θ k = [a k , b k , c k , d k ] are parameters controlling the shape of the trajectoryd k and d k +a k represent the lower and upper limits of the sigmoidal function, c k represents the inflection point and a k b k /4 represents the slope at the inflection point. For a given subject i at visit j, the value V l ij of its biomarker measurement at vertex l is a random variable that has an associated discrete latent variable Z l ∈ [1, ... , K] denoting the cluster it was generated from. The value of V l ij given that it was generated from cluster Z l can be modelled as: where N(V l ij | f(α i t ij + β i | θ Zl ), σ Zl ) represents the probability density function (pdf) of the normal distribution that models the measurement noise along the sigmoidal trajectory of cluster Z l , having variance σ Zl . Next, we assume the measurements from different subjects are independent, while the measurements from the same subject i at different visits j are linked using the disease progression score from Eq. 1. Moreover, we also assume a uniform prior on cluster membership Z l . This gives the following model: where I = {(i, j)} represents the set of all the subjects i and their corresponding visits is the 1D array of all the values for vertex l across every visit of every subject. Vectors α = [α 1 , … , α S ] and β = [β 1 , … , β S ], where S is the number of subjects, denote the stacked parameters for the subject shifts. If a subject i has multiple visits, these visits share the same parameters α i and β i . Vectors θ = [θ 1 , … , θ K ] and σ = [σ 1 , … , σ K ], with K being the number of clusters, represent the stacked parameters for the sigmoidal trajectories and measurement noise specific to each cluster.

Modelling Biomarker Trajectories for all Vertices
So far we have a model for only one vertex on the brain surface. We therefore extend the formulation to all the vertices by assuming all these vertex measurements are spatially independent, giving the complete data likelihood: ... , Z l ], L being the total number of vertices on the cortical surface. The formulation assumes spatial independence between measurements in different vertices, but in section 2.1.4 the model is extended to capture spatial correlations. We get the final model log likelihood for incomplete data by marginalising over the latent variables Z: Throughout the article, we will use the shorthand z lk = p(Z l = k).

Modelling Spatial Correlation
The version of the model presented so far does not make any assumptions on spatial correlation. However, the regional organisation of the cortex suggests we would expect spatial correlation of the vertex measurements. More precisely, measures of cortical thickness or other modalities are often similar in neighbouring vertices on the cortical surface and likely belong to the same cluster. DIVE can be easily extended to include mild spatial constraints on the correlation of vertex measurements via a Markov Random Field (MRF), which encourages neighbouring vertices to have the same corresponding cluster. We hypothesise that incorporating such constraints should reduce the effects of noise and produce a more stable clustering. However, this does not model correlation between the actual vertex values, but only between the latent variables Z l , i.e. the cluster membership of each vertex. The MRF thus has the advantage of not requiring the use of huge covariance matrices, which are otherwise needed if we want to model correlation of vertex values directly. With the MRF, the full-data likelihood function of the model now becomes: where Ψ(Z l , Z l2 ) is a clique term representing the likelihood of a neighbouring vertex l 2 to have similar label with vertex l. The formula for the clique term is: where λ is a parameter controlling how much to penalise neighbouring vertices that belong to distinct clusters, and g and h are positive, monotonic functions over the λ>0 range. We choose g(λ)=λ and h(λ)=λ 2 , which results in a concave objective function for λ, ensuring that it can later be optimised (see M-step). Therefore, the model parameters that need to be estimated are M = [α, β, θ, σ, λ] where α and β are the subject specific shifting parameters, θ and σ are the cluster specific trajectory and noise parameters and λ is the clique parameter denoting the penalisation of spatially non-smooth assignments of latent variables Z.

Fitting the Model using Generalised Expectation-Maximisation
We choose to fit our model using Expectation-Maximisation (EM), because it offers a fast convergence given the large number of parameters that need to be estimated and the huge dimensionality of relevant datasets (e.g. 1973 subjects x 163,842 vertices in ADNI). In the next two sections we outline the E-step and M-step. While both of these steps have no closed-form solution, we will solve them using numerical optimisation, which only results in an increase in the objective function at each iteration. However, the EM algorithm is still guaranteed to converge, and this approach is called Generalised EM (Bishop, 2006).
Algorithm 1 shows the model fitting procedure using the EM algorithm. The procedure first initialises (line 1) some parameters required to start the EM algorithm: the subject time shifts α and β and the latent parameters z lk which represent the assignment of vertices to clusters. In the M-step, the method updates the trajectories of each cluster (lines 4-6), the subjects-specific time shifts (line 9) and the clique penalty term λ (line 17). In the E-step, the method computes z lk (line 18) using previously defined functions that compute z lk given a fixed λ (line 14).
Algorithm 1. The optimisation procedure for fitting DIVE using the Expectation-Maximisation algorithm.

E-step
In the Expectation step, at iteration u we seek an estimate of p(Z | V, M (u-1) ), given the current estimates of the parameters M (u-1) = [ θ k (u-1) , σ k (u-1) , α i (u-1) , β i (u-1) , λ i (u-1) ]. While, this does not directly factorise over the vertices l due to the MRF terms, we condition the clique terms on the values of Z from the previous iterations. This approximation gives the following factorizable likelihood: The factorised form allows for tractable computation and memory storage of p(Z). Let z lk (u) = p(Z l = k | V l , M (u-1) ,Z (u-1) ). After simplifications we reach the following update rule: The full derivation is given in the Supplementary material. In order to enable optimisation over λ, a final modification of this step is performed, by considering z lk to be functions ζ lk (λ) over λ. This results in the update equation from Alg. 1, line 18 which is based on pre-defined terms on lines 13-14.

M-step
In the Maximisation step we try to estimate the model parameters M = (α, β, θ, σ, λ) that maximise E Z|V,M (u-1) [log p(V,Z|M)]. We cannot simultaneously optimise all 5 sets of parameters, so we optimise them independently. In order to get the update rule for the trajectory parameters θ k corresponding to cluster k we need to maximise the expected log likelihood with respect to θ k . The key observation here is that if we assume fixed α, β and Z, then the trajectory parameters θ k for every cluster k are conditionally independent, i.e. θ k ⊥ θ m | (Z, α, β, σ) ∀ (k, m), k ≠ m. This allows us to maximise every θ k independently using the following equation: A similar observation of conditional independence can also be observed for the latent variables Z. This allows us to decompose the joint distribution over Z, and after expanding the noise model we reach the optimisation problem from Alg. A similar equation, yet in closed form, is also obtained for σ k (Alg. 1, line 6). After estimating θ and σ for every cluster, we use the new values to estimate the subject specific parameters α and β. For every subject i, we maximise the expected log likelihood with respect to α i , β i independently, and after simplifications we obtain the update rule from Alg. 1, line 9, which is again solved using numerical optimisation. For the numerical optimisation of θ we used the Nelder-Mead method for its robustness, while for α and β we used the second-order Broyden-Fletcher-Goldfarb-Shanno algorithm due to fast convergence. Finally, we achieved a significant speedup in the evaluation of objective functions by computing a z lk -weighted average of vertex measurements within each cluster (see Supplementary section 4). This resulted in a convergence time of around 6 hours for the larger datasets (ADNI).
For optimising λ, we again try to optimise in the M-step the expected full data likelihood under the Z estimates from the previous iteration: We simplify the above equation by expanding the likelihood model and approximating the joint over Z with the product of the marginals z lk over all vertices l. This results in the update equation from Alg. 1 line 17see supplementary material for full derivation. In this final equation we also replaced z lk with a function ζ lk (λ) over λ, which updates z lk based on the current value of λ being evaluated. This is done to increase convergence, as latent variables z lk are highly coupled with the value of λ being evaluated.

Parameter initialisation and priors
Before starting the fitting process, we need to initialise α, β and the clustering probabilities z lk (Alg. 1, line 1). We set α i and β i to be 1 and 0 respectively for each subject, which sets the initial disease progression score to the age of the subject at the clinical visit. We initialise z lk using k-means clustering of the vectors V l . The DPS scores have two extra degrees of freedom (scale and shift), which we account for by setting informative gamma and Gaussian priors on parameters α i and β i respectively , which work well in practice as they result in realistic ranges for α i and β i of around [0.3, 3] and [-15,15] respectively. Such informative priors on α i and β i also help deal with singularities in the objective functions of α i and β i when the biomarker trajectories are flat. As already explained in (Jedynak et al., 2012), the sigmoid parameters θ k are not identifiable, so we need to apply the following transformation on line 5 of Alg u) . This ensures model identifiability and is performed at every iteration.

Estimating the Optimal Number of Clusters
The EM procedure needs to specify a-priori the number of clusters to fit on the data. We optimise the number of clusters K using Akaike Information Criterion (AIC), which we found to better agree with ground truth in simulations than other information criteria such as the Bayesian Information Criteria (BIC). The number of parameters of the fitted model is 5K+2S+1, where S is the number of subjects. Note that z lk are not included as parameters of the model because they are latent variables that are marginalised (see Eq. 6). We repeat the fitting procedure for each K from 2 to 100 clusters and select the K that minimises the AIC.

Motivation
Initial assessment of DIVE performance uses synthetic data, where we know the ground truth. The aim is to explore how accurately we can recover ground truth parameters as the problem becomes harder in three different scenarios: Scenario 1: as the number of clusters increases, evaluate how well DIVE can estimate the correct number of clusters using AIC and BIC Scenario 2: as the trajectories become more similar, test how well we can recover the assignment of vertices to clusters and the DIVE parameters Scenario 3: same as Scenario 2, but for decreasing number of subjects

Synthetic Data Generation
We first designed a basic simulation, which the model should be able to fit well since the trajectories were designed to be well separated and enough subject data was generated along the disease time course. The data in the basic simulation was generated as follows: 1. sample age a i1 and time shift parameters α i , β i for 300 subjects with 4 timepoints (each timepoint 1 year apart), with a i1 ~ U(40,80), α i ~ Γ(6.25, 6.25), β i ~ N(0, 10). Time since baseline has been obtained for every visit j of subject i as follows: t ij = a ij a i1 2. generate three sigmoids with different ( 4. sample a set of L perturbed trajectories θ l from each of the original trajectories, one for each vertex ( Fig. 2A, gray lines) using covariance matrix C θ = diag ([0, 2b k /15, 11.6, 0]). 5. sample subject data for every vertex l from its corresponding perturbed trajectory θ l with noise standard deviation σ l = 1.
From the basic simulation, we generated synthetic data for each of the three scenarios by varying one parameter at a time and kept the other parameters constant, having the same values as in the basic simulation. We varied the following parameters: Scenario 1: number of clusters -2, 3, 5, 10, 15, 20, 30 and 40. The cluster centres were spread evenly across a fixed total DPS range where data was available. Scenario 2: distance between trajectory centres (as proportion of total DPS range sampled) -0.33, 0.30, 0.23, 0.17, 0.10, 0.07, 0.03 and 0.02 Scenario 3: number of subjects -300, 200, 100, 50, 35, 20, 10 and 5

Model Fitting and Evaluation
Since there was no spatial information in the data generation procedure, we used DIVE without the MRF extension. For Scenario 1, we estimated using AIC and BIC the optimal number of clusters. For Scenarios 2 and 3, after fitting the parameters of DIVE, we calculated the agreement between the final clustering probabilities p(Z l ) and the true clustering assignments a [l]. This agreement, which we will call the clustering agreement, is defined as , where τ is any permutation of cluster labels. We also computed the error in the DPS estimation (sum of squared differences, SSD) and trajectory estimation (SSD between predicted trajectory and true trajectory at DPS points of every subject visit).

Data Acquisition and Pre-processing
Data used in this work were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu) and from the Dementia Research Centre, UK. For ADNI, we downloaded all T1 MR images that have undergone gradient warping, intensity correction, and scaling for gradient drift. We included subjects that had at least 3 scans, to ensure we get a robust estimate of the subject specific parameters. This resulted in 138 healthy controls, 235 subjects with mild cognitive impairment (MCI) and 81 subjects with Alzheimer's disease.
We also downloaded all AV45 PET images from ADNI that were fully preprocessed, having the tag 'Co-reg, Avg, Std Img and Vox Siz, Uniform Resolution'. This meant that the images were co-registered, averaged across the 6 five-minute frames, standardised with respect to the orientation and voxel size and smoothed to produce a uniform resolution of 8mm full-width/half-max (FWHM).
The DRC dataset consisted of T1 MRI scans from 31 healthy controls, 32 PCA and 23 typical AD subjects with at least 3 scans each and an average of 5.26 scans per subject. All PCA patients fulfilled the (Tang-Wai et al., 2004) criteria and (Mendez et al., 2002) criteria based on clinical review. The typical AD patients all met the criteria for probable Alzheimer's disease (Dubois et al., 2010;Dubois et al., 2007).
Given that the ADNI and DRC datasets contained subjects with different modalities or diseases, we ran DIVE independently on the following four cohorts (see Table 1 for demographics): 1) ADNI MRI: controls, MCI and tAD subjects from ADNI (cortical thickness data) 2) DRC tAD: tAD subjects and controls from the DRC dataset (cortical thickness data) 3) DRC PCA: PCA subjects and controls from the DRC dataset (cortical thickness data) 4) ADNI PET: AV45 scans from ADNI containing subjects with following diagnoses: healthy controls, subjective memory complaints, early MCI, late MCI and Alzheimer's disease. On both datasets, in order to extract reliable cortical thickness measures, we ran the Freesurfer longitudinal pipeline (Reuter et al., 2012), which first registers the MR scans to an unbiased within-subject template space using inverse-consistent registration. The longitudinally registered images were then registered to the average Freesurfer template. No further smoothing was performed on these images (FWHM level of zero mm). From these template-registered volumetric images, cortical thickness measurements were computed at each vertex (i.e. point) on an average 2D cortical surface manifold. For each vertex we averaged the thickness levels from both hemispheres. Finally, we standardised the data from each vertex with respect to the values of that vertex in the control population. Each of the final images had a resolution of 163,842 vertices on the cortical surface.

PET AV45 Images
We computed amyloid SUVR levels using the PetSurfer pipeline (Greve et al., 2016;Greve et al., 2014), which is available with Freesurfer version 6. The PetSurfer pipeline first registers the PET image with the corresponding MRI scan, then applies Partial Volume Correction, and finally resamples the voxelwise SUVR values onto the cortical surface. While the final images also had a resolution of 163,842 vertices, the PET data we obtained from ADNI was inherently more smooth than the MRI cortical thickness data (8mm FWHM).

The MRF Neighbourhood Graph
We estimated the MRF neighbourhood graph based on a Freesurfer triangular mesh for the fsaverage template. Each vertex was a triangle on the brain surface estimated with Freesurfer, and we connected the vertices if the corresponding triangles had a shared edge. For the MRF neighbourhood graph, we used a 3 rd degree neighbourhood structure, meaning that two vertices were considered neighbours if the shortest path between them was not higher than 3.

Results
We first present results on synthetic data (section 3.1), then on ADNI and DRC datasets (section 3.2), followed by model evaluation (section 3.3) using crossvalidation and correlation with cognitive markers.

Results on synthetic data
In the basic simulation, we obtained a clustering agreement ‫א‬ of 0.97, which suggests that almost all vertices were assigned to the correct cluster. Figure 2A shows the original trajectories and the recovered trajectories using our model, plotted against the disease progression score on the x-axis and the vertex value on the y-axis. In  On the x-axis we show the variable that was changing within the scenario (e.g. number of clusters), while on the y-axis we show the agreement measure ‫,א‬ representing the percentage of vertices that were assigned to the correct cluster.
The results show that, in a simple experiment where the trajectories are well separated, DIVE can very accurately estimate which clusters generated each vertex. Moreover, the recovered trajectories and DPS scores are close to the true values. The results of Scenario 1 also suggest that both AIC and BIC are effective at estimating the correct number of known clusters, with AIC having slightly better performance than BIC for larger numbers of clusters. On the other hand, the results of the stress test scenarios 2 and 3 show that performance measure ‫א‬ drops when the trajectories become very similar with each other or when the number of subjects decreases. This happens because small differences in trajectories are hard to detect in the presence of measurement noise, while a small number of subjects doesn't provide enough data to accurately estimate the parameters. Similar decreases in performance for scenarios 2 and 3 are observed also for other measures, such as the error in recovered trajectories or DPS scores ( Supplementary Fig 1).

Initial Hypotheses
Using ADNI and DRC datasets, we aim to recover the spatial distribution of cortical atrophy and amyloid pathology, as well as the rate and timing of these pathological processes. In particular, we hypothesize that these spatial patterns of pathology and their evolution will be: 1. similar on two independent typical AD datasets: ADNI and DRC 2. different on distinct diseases: tAD vs PCA 3. different in distinct modalities: cortical thickness from MRI vs amyloid load from AV45 PET.

Results
The optimal number of clusters, as estimated with AIC, was three for the ADNI MRI dataset, three for the DRC tAD dataset, five for the DRC PCA dataset and eighteen for the ADNI PET dataset. Fig. 3A-left shows the results from the ADNI MRI dataset, where in the left image we coloured the vertices on the cortical surface according to the cluster they most likely belong to. We assigned a colour for each cluster (both the brain figures on the left and the trajectory figures on the right) according to the extent of pathology of its corresponding trajectory at a DPS score of 1. The cluster colours range from red (severe pathology) to blue (moderate pathology). In Fig. 3A-right, we show the resulting cluster trajectories with samples from the posterior distribution of each θ k . Similar results are shown for the other three datasets: the DRC tAD dataset (Fig. 3B), DRC PCA dataset (Fig. 3C) and the ADNI PET dataset (Fig. 3D). We notice that in tAD subjects using the ADNI datasets (Fig. 3A), there is more severe cortical thinning mainly in the inferior temporal lobe (red cluster), with disperse atrophy also in parietal and frontal regions (green cluster), with relative sparing of the inferior frontal and occipital lobes. In tAD subjects from the DRC dataset, we see a relatively similar pattern, however with more pronounced atrophy in the supramarginal cortex (red cluster) compared to ADNI. The spatial distribution of cortical thinning found with DIVE resembles results from previous longitudinal studies such as (Dickerson et al., 2008;Singh et al., 2006). However, in contrast to these approaches, our model gives insight into the timing and rate of atrophy and is also able to stage subjects across the disease time course. We also find that the cluster trajectories in the DRC tAD dataset have similar dynamics to the ADNI MRI dataset, although they show a clearer separation between each other.
In the PCA subjects (Fig. 3C), we find that atrophy is mainly focused on the posterior part of the brain, mostly the posterior parietal and supramarginal areas, with limited spread in the motor cortex, anterior temporal and frontal areas. There is a clear separation of atrophy extent between the anterior and posterior part of the brain, with the boundary running along the motor cortex. This posterior-focused pattern of atrophy is different from the one found in the tAD datasets, and agrees with previous findings in the literature (Crutch et al., 2012;Lehmann et al., 2011).
In ADNI PET (Fig. 3D) we see that the regions with the highest amyloid uptake are more spatially continuous, comprising the precuneus and anterior frontal areas. On the other hand, the anterior-superior temporal gyrus shows the least uptake of amyloid. This result closely matches the result by (Bilgel et al., 2016), which used a completely different dataset and modelling technique. The "layers of clusters" starting from the precuneus and frontal lobes, which range from severe to less severe atrophy, suggest a continuum of variation in vertex trajectories in the case of the PET dataset (Fig 3D-right). These trajectories all start with a low amyloid SUVR, between 0 and 0.25, but in late stages the trajectories for some clusters such as cluster 0 can reach an SUVR of 1.5. The reason for seeing this continuum might be because the PET images have a much lower resolution than MR images and were smoothed by ADNI during the pre-processing steps.

Motivation
We further tested the robustness and validity of the model as follows:

Evaluation Procedure
For all scenarios, we ran 10-fold cross-validation (CV) on the ADNI MRI dataset. At each fold we fit the model using 3 clusters, since this was the optimal number of clusters found previously on the entire dataset. The trained model was then used to estimate the DPS of the test subjects.

Evaluation Results
Fig . 4 shows the brain clusters and corresponding trajectories, estimated for all the cross-validation folds after fitting the model on the training data. The clusters have been coloured using a similar colour scheme as in Fig. 3. In Fig 5 we   The results in Fig. 4 demonstrate that DIVE is robust in cross-validation, as the estimated clusters and trajectory parameters are all similar across folds. The average Dice score overlap across the 10-folds range were 0.77, 0.76 and 0.90 for clusters 0, 1 and 2 respectively. The DIVE-derived DPS scores, which were estimated purely based on MRI data, are also clinically relevant as they correlate with cognitive tests (Fig 5).
The performance of DIVE in terms of subject staging and biomarker prediction also compares favourably with simpler no-staging and ROI-based models (Table 2). Results show that DIVE has comparable performance to the ROI-based model, both in terms of subject staging and cortical thickness prediction. The fact that DIVE has similar performance to a simpler model which has less parameters is evidence that the estimated patterns are meaningful. Moreover, DIVE offers qualitative insight into the fine-grained spatial patterns of pathology and their temporal progression. Furthermore, the No-staging model performs significantly worse than DIVE, both in terms of subject staging and for biomarker prediction. This suggests that, when modelling progression of AD, it is important to account for the fact that patients are at different stages along the disease time-course.

Summary and Key Findings
We presented DIVE, a spatiotemporal model of disease progression that clusters vertex-or voxel-wise measures of pathology in the brain based on similar temporal dynamics. The model highlights, for the first time, groups of cortical vertices that exhibit a similar temporal trajectory over the population. The model also estimates the temporal shift and progression speed for every subject. We applied the model on cortical thickness vertex-wise data from three MRI datasets (ADNI, DRC tAD and DRC PCA), as well as an amyloid PET dataset (ADNI). Our model found qualitatively similar patterns of cortical thinning in tAD subjects using the two independent datasets (ADNI and DRC). Moreover, it also found different patterns of pathology dynamics on two distinct diseases (tAD and PCA) and on different types of data (PET and MRI-derived cortical thickness). Finally, DIVE also provides a new way to parcellate the brain that is specific to the temporal trajectory of a particular disease.

Limitations and future work
DIVE has some limitations that can be addressed. First, we assumed that cluster trajectories follow sigmoidal shapes, which is not the case for many types of biomarkers in ADNI which do not plateau in later stages. The assumption of sigmoidal trajectories can be avoided using non-parametric curves such as Gaussian Processes (Lorenzi et al., 2017), which would be straightforward to incorporate into the DIVE framework. Another limitation of the model is that it assumes all subjects follow the same disease progression pattern, which might not be the case in heterogeneous datasets such as ADNI or DRC. This can be a concern, as there might be a pattern of pathology that occurs in a small set of subjects. However, DIVE can be extended to account for heterogeneity in the datasets by modelling different progression dynamics for distinct subgroups, using unsupervised learning methods like the SuStaIn model by (Young et al., 2018). While SuStaIn, just like DIVE, estimates clusters and trajectories within the dataset, the clusters in SuStaIn are made of subjects with similar disease progression, while the clusters in DIVE are made of vertices with similar progression. Future work could combine clustering along both subjects and vertices simultaneously to estimate disease subtypes with distinct spatiotemporal dynamics at the vertexwise level.
There are several potential future applications of DIVE. One of the advantages of DIVE is that it can be used to study the link between disconnected patterns of brain pathology and connectomes extracted from diffusion tractography or functional MRI (fMRI). Such an analysis would enable further understanding of the exact underlying mechanisms by which the brain is affected by the disease. Our model, which can estimate fine-grained spatial patterns of pathology, is more suitable than standard ROI-based methods for studying the link between pathology and these structural or functional connectomes, because white matter or functional connections have a finegrained and spatially-varying distribution of endpoints on the cortex.
Apart from studying the link with brain connectomes, there are other potential applications for DIVE. The model can be applied to study other types of vertexwise or voxelwise data, such as FDG PET, tau PET, DTI or Jacobian compression maps from MRI. Moreover, the model can also be extended to cluster points on the brain surface according to a more complex disease signature, that can be made of two or more biomarkers. For example, using our cortical thickness and amyloid PET datasets from ADNI, we could have clustered points on the brain based on both modalities simultaneously. Such complex disease signatures can offer important insights into the relationships between different modalities and underlying disease mechanisms.
DIVE is a spatiotemporal model that can be used for accurately predicting and staging patients across the progression timeline of neurodegenerative diseases. The spatial patterns of pathology can also be used to test mechanistic hypotheses which consider AD as a network vulnerability disorder. All these avenues can help towards disease understanding, patient prognosis, as well as clinical-trials for assessing efficacy of a putative treatment for slowing down cognitive decline. (C-D) The same error scores for Scenario 3. We notice that as the problem becomes more difficult, the errors in the DIVE estimated parameters increase. Errors were measured as sum of squared differences (SSD) between the true parameters and estimated parameters. For the trajectories, the SSD was calculated only based on the sigmoid centres, due to different scaling of the other sigmoidal parameters.

Acknowledgements
2 Comparison between DIVE and other models

Motivation
We were also interested to compare the performance of DIVE with other disease progression models. In particular, we were interested to test whether: • Modelling dynamic clusters on the brain surface improves subject staging and biomarker prediction • Modelling subject-specific stages with a linear transformation (the α i and β i terms) improves biomarker prediction

Experiment design
We compared the performance of our model to two simplified models: • ROI-based model: groups vertices according to an a-priori defined ROI atlas. This model is equivalent to the model by Jedynak et al., Neuroimage, 2012 and is a special case of our model, where the latent variables z lk are fixed instead of being marginalised as in equation 6.
• No-staging model: This is a model that doesn't perform any time-shift of patients along the disease progression timeline. It fixes α i = 1, β i = 0 for every subject, which means that the disease progression score of every subject is age.
We performed this comparison using 10-fold cross-validation. For each subject in the test set, we computed their DPS score and correlated all the DPS values with the same four cognitive tests used previously. We also tested how well the models can predict the future vertex-wise measurements as follows: for every subject i in the test set, we used their first two scans to estimate α i = 1, β i = 0 and then used the rest of the scans to compute the prediction error. For one vertex location on the cortical surface, the prediction error was computed as the root mean squared error (RMSE) between its predicted measure and the actual measure. This was then averaged across all subjects and visits. Table 1 shows the results of the model comparison, on ADNI MRI dataset. Each row represents one model tested, while each column represents a different performance measure: correlations with four different cognitive tests and accuracy in the prediction of future vertexwise measurements. In each entry, we give the mean and standard deviation of the correlation coefficients or RMSE across the 10 cross-validation folds.  Table 1: Comparison of our model with two more simplistic models on the ADNI MRI dataset. For each of the three models, we show the correlation of the disease progression scores (DPS) with respect to several cognitive tests: CDRSOB, ADAS13, MMSE and RAVLT. The correlation numbers represent the mean correlation across the 10 cross-validation folds.

Derivation of the Generalised EM algorithm
We seek to calculate M (u) = arg max M E Z|V,M (u−1) [log p(V, Z|M )]+log p(M ) where M (u) = (α (u) , β (u) , θ (u) , σ (u) , λ (u) ) are the set of model parameters at iteration u of the EM algorithm, and Z = (Z 1 , . . . , Z L ) is the set of discrete latent variables, where Z l represents the cluster that voxel l was assigned to, so Z l ∈ 1, . . . , K. While Z l (with capital letter) is a random variable, we will also use the notation z l (small letter) to represent the value that the random variable Z l was instantiated to. Finally, p(M (u) ) is a prior on these parameters that is chosen by the user. Expanding the expected value, we get: The E-step involves computing p Z = (z 1 , ..., z L )|V, M (u−1) , while the M-step comprises of solving the above equation.

E-step
In this step we need to estimate p(Z|V, M (u−1) ). For notational simplificy we will drop the (u − 1) superscript from M where N l is the set of neighbours of vertex l. However, this doesn't directly factorise over the vertices l due to the MRF terms Ψ(Z l , Z l 2 ). It is however necessary to find a form that factorises over the vertices, otherwise we won't be able to represent in memory the joint distribution over all Z variables. If we make the approximation p(Z|V, M ) ≈ L l p(V l |Z l , M ) then we loose out all the MRF terms and the model won't account for spatial correlation. We instead do a first-degree approximation by conditioning on the values of Z (u−1) N l , the labels of nearby vertices from the previous iteration. The approximation is thus: This form allows us to factorise over all the vertices to get p(Z l |V l , M ): where C is a normalistion constant that can be dropped. We can now further factorise p(Z l |Z m) ) and apply a similar factorisation to the prior p(Z (u−1) N l |V l , M ), resulting in: Factorising the summation over z N l 's we get: Replacing z l 2 with k 2 we get: We shall also denote z lk = p(Z l |V l , M ). Further simplifications result in: We further define the data-fit term D lk as follows: This results in: Finally, we simplify the sum over k 2 to get the update equation for z lk : In practice, we cannot naively compute the exponential term z lk = exp(log(z lk )) due to precision loss. However, we go around this by recomputing the exponentiation and normalisation of z lk simultaneously. Denoting x(k) = log z lk , for k ∈ [1 . . . K], we get:

M-step
The M-step itself does not have a closed-form analytical solution. We choose to solve it by successive refinements of the cluster trajectory parameters and the subject time shifts.

Optimising trajectory parameters
Trajectory shapeθ Taking equation 1 and fixing the subject time-shifts α, β and measurement noise σ, we can find its maximum with respect to θ only. More precisely, we want: We observe that for each cluster the individual θ k 's are conditionally independent, i.e. θ k ⊥ ⊥ θ m |{Z, α, β, σ} ∀k, m. We also assume that the prior factorizes for each θ k : log p(θ) = K k log p(θ k ). This allows us to optimise each θ k independently: Replacing the full data log-likelihood, we get: Note that we didn't include the MRF clique terms, since they are not a function of θ k . We propagate the logarithm inside the products: (17) We next assume that Z l , the hidden cluster assignment for vertex l, is conditionally independent of the other vertex assignments Z m , ∀m = l (See E-step approximation from Eq. 3). This independence assumption induces the following factorization: p(Z = (z 1 , ..., z L )|V, M (u−1) ) = L l p(Z l = z l |V, M (u−1) ). Propagating this product inside the sum over the vertices, we get: The terms which don't contain θ k dissapear: We further expand the gaussian noise model: Constants dissapear due to the arg max and we get the final update equation for θ k : Measurement noiseσ We first assume a uniform prior on the σ parameters to simplify derivations. Using a similar approach as with θ, after propagating the product inside the logarithm and removing the terms which don't contain σ k , we get: Note that, just as for θ above, the MRF clique terms were not included because they are not a function of σ k . Expanding the noise model we get: The maximum of a function l(σ k ) can be computed by taking the derivative of the function l and setting it to zero. This is under the assumption that l is differentiable, which it is but we won't prove it here. This gives: Propagating the differential operator further inside the sums we get: We next perform several small manipulations to reach a more suitable form of the derivative and then set it to be equal to zero: Finally, we solve for σ k and get its update equation:

Estimating subject time shiftsα, β
For estimating α, β, we adopt a similar strategy as in the case of θ, up to Eq. 18. This gives us the following problem: The terms α i , β i for other subjects i = i dissappear: Expanding the gaussian noise model we get: After removing constant terms we end up with the final update equation for α i , β i :

Estimating MRF clique termλ
We optimise λ using the following formula: Note that p(Z|V, M (u−1) , λ, Z (u−1) ) is a function of λ, so for each lambda we estimate z lk through approximate inference. We do this because otherwise the optimisation of λ will only take into account the clique terms and completely exclude the data terms. We further simplify the objective function for lambda as follows: We take the logarithm: Let us denote z lk = p(Z l = k|V, M (u−1) , λ, Z (u−1) ). Assuming independence between the latent variables Z l we get: However, we now want to make z lk a function of λ as previously mentioned, so z lk = ζ lk (λ), for some function ζ lk . More precisely, using the E-step update from Eq. 12 we define for each vertex l and cluster k a function ζ lk (λ) as follows: where D lk is as defined in Eq 10. We then replace z lk with ζ lk (λ) and introduce the chosen MRF clique model to get: We separate the cliques that have matching clusters to the ones that don't: We also factorise the clique terms: Finally, we simplify to get the objective function for λ.
For implementation speed-up, data-fit terms D lk can be pre-computed.

Fast DIVE Implementation -Proof of Equivalence
Fitting DIVE can be computationally prohibitive, especially given that the number of vertices/voxels can be very high, e.g. more than 160,000 in our datasets. We derived a fast implementtion of DIVE, which is based on the idea that for each subject we compute a weighted mean of the vertices within a particular cluster, and then compare that mean with the corresponding trajectory value. This is in contrast with comparing the value at each vertex with the corresponding trajectory of its cluster. In the next few sections, we will present the mathematical formulation of the fast implementation for parameters [θ, α, β]. Parameter σ already has a closed-form update, while parameter λ has a more complex update procedure for which this fast implementation doesn't work. For each parameter, we will also provide proofs of equivalence.

Fast implementation
The fast implementation for θ implies that, instead of optimising Eq. 29 we optimise the following problem: where < V ij >Ẑ k is the mean value of the vertices belonging to cluster k. Mathematically, we definê Z k = [z 1k γ k , z 2k γ k , . . . , z Lk γ k ] where γ k = ( L l=1 z lk ) −1 is the normalisation constant. Moreover, we have that < V ij >Ẑ k = L l=1 z lk γ k V ij . We take the derivative of the likelihood function l f ast of the fast implementation (Eq. 36) with respect to θ k and perform several simplifications: using the fact that L l=1 γ k z lk = 1 we get: By setting the derivative to zero, the optimal θ is thus a solution of the following equation:

Slow implementation
We will prove that if theta is a solution of the slow implementation, it is also a solution of Eq. 41, which will prove that the fast implementation is equivalent. The slow implementation is finding θ from the following equation: Taking the derivative of the function above (l slow ) with respect to θ k we get: After swapping terms around and using distributivity we get: This is the same optimisation problem as in Eq. 41, which proves that the two formulations are equivalent with respect to θ.

Noise parameterσ
The noise parameter σ can actually be computed in a closed-form solution for the original slow model implementation, so there is no benefit in implementing the fast update for σ. Moreover, the σ in the fast implementation computed the standard deviation in the mean value of the vertices within a certain cluster, and not the deviation withing the actual value of the vertices.

Fast implementation
The equivalent fast formulation for the subject-specific time shifts is similar to the one for the trajectory parameters. It should be noted however that we need to weight the sums corresponding to each cluster by γ −1 k . This gives the following equation for the fast formulation: In order to prove that this is equivalent to the slow version, we need to take the derivative of the likelihood function (l f ast ) from the above equation with respect to α i , β i and set it to zero: We expand the average across the vertices and slide the derivative operator inside the sums: Since L l=1 γ k z lk = 1 we get: Removing the factor 2 and sliding γ k : Further sliding L l=1 z lk to the left we get the final optimisation problem:

Slow implementation
In a similar way to the trajectory parameters, we want to prove that solving the problem from Eq. 50 (fast implementation) is the same as solving the original slow implementation problem, which is defined as: Taking the derivative of the function above with respect to α i , β i , we get: This is the same problem as the fast implementation one from Eq. 50, thus the fast model is equivalent to the slow model with respect to α, β.