Manifold Markov chain Monte Carlo methods for Bayesian inference in diffusion models

Bayesian inference for nonlinear diffusions, observed at discrete times, is a challenging task that has prompted the development of a number of algorithms, mainly within the computational statistics community. We propose a new direction, and accompanying methodology—borrowing ideas from statistical physics and computational chemistry—for inferring the posterior distribution of latent diffusion paths and model parameters, given observations of the process. Joint configurations of the underlying process noise and of parameters, mapping onto diffusion paths consistent with observations, form an implicitly defined manifold. Then, by making use of a constrained Hamiltonian Monte Carlo algorithm on the embedded manifold, we are able to perform computationally efficient inference for a class of discretely observed diffusion models. Critically, in contrast with other approaches proposed in the literature, our methodology is highly automated, requiring minimal user intervention and applying alike in a range of settings, including: elliptic or hypo‐elliptic systems; observations with or without noise; linear or non‐linear observation operators. Exploiting Markovianity, we propose a variant of the method with complexity that scales linearly in the resolution of path discretisation and the number of observation times. Python code reproducing the results is available at http://doi.org/10.5281/zenodo.5796148.


Introduction
A large number of stochastic dynamical systems are modelled via the use of diffusion processes, see e.g. Kloeden and Platen (1992); Oksendal (2013) and the references therein. An enormous amount of research has been dedicated to both the theoretical foundations of such processes and -as with this work -their statistical calibration. Our work lies in the context of processes observed discretely in time, under a low frequency regime, so that approximations of typically analytically intractable transition densities are assumed to be inaccurate. In this setting, data augmentation approaches within a Bayesian framework have delivered the prevailing methodologies, see e.g. Sørensen (2009); Papaspiliopoulos et al. (2013), as they provide various model-specific algorithms for treating a number of different specifications of the structure of the diffusion process and of the observation regime. The performance of the developed algorithms can be improved via a combination of model transforms, often motivated by the Roberts-Stramer critique (Roberts and Stramer, 2001) -that the posterior distribution of the diffusivity parameters given a time discretisation of the process degenerates at finer resolutionsand more efficient Markov chain Monte Carlo (mcmc) kernels.
The work herein provides a natural approach for Bayesian inference over diffusion processes. Observations are treated as constraints placed on latent paths and parameters. This gives rise to the viewpoint that the posterior can be expressed as the prior distribution restricted to a manifold. We apply existing mcmc methods for sampling from distributions supported on submanifolds based on the simulation of constrained Hamiltonian dynamics (see, e.g., Hartmann and Schütte (2005); Rousset et al. (2010); Brubaker et al. (2012); Lelièvre et al. (2019)) to efficiently explore this manifold-supported posterior distribution. This class of methods relies on symplectic integrators for constrained Hamiltonian systems (Andersen, 1983;Leimkuhler and Skeel, 1994;Reich, 1996;Leimkuhler and Matthews, 2016). Critically, we leverage the Markovian structure of the diffusion process and Gaussianity of the driving noise to design a scalable inferential procedure. The main contributions of the proposed methodology can be summarised as follows: (i) We provide a new viewpoint and accompanying algorithmic methodology for calibrating stochastic differential equation (sde) models. The posterior is expressed as a distribution supported on a manifold embedded in a non-centred parametrization of the latent path and parameter space. We then make use of a constrained Hamiltonian Monte Carlo (hmc) scheme to explore this manifold, jointly updating both the parameters and latent path. (ii) Unlike other algorithms that are often limited to specific model families, our approach is highly automated and remains unchanged irrespective of the choice of diffusion and observation models, including: elliptic or hypo-elliptic systems; data observed with or without noise; linear or non-linear observation operators. (iii) We propose a novel constrained integrator that exploits the Gaussianity of the prior distribution on the pathspace. This leads to an improved scaling in sampling efficiency as the resolution of the latent path discretisation is refined. (iv) We propose a scheme to exploit the Markovian structure of sde models to ensure that the computational cost of the integrator for the Hamiltonian dynamics scales linearly both with the resolution of path discretisation and the number of observation times. To the best of our knowledge, the developed approach for leveraging the Markovian structure of the model is new. (v) Our method extends the family of sdes for which statistical calibration is now attainable. Consider the class of R X -valued sdes directly observed (without noise) via a non-linear function h : R X → R Y at a finite set of times, for X ≥ Y ≥ 1. In such a scenario, standard data augmentation schemes fail (for non-trivial choices of h) as the posterior of the latent variables given the observations does not have a density with respect to the Lebesgue measure. In contrast, our method remains applicable and unchanged.
Remark 1.1 (Criteria). To clarify the position of the framework put forward in this work within the wide field of statistical calibration for sdes, we list a number of criteria met by our algorithm: (i) It carries out full Bayesian inference for the model at hand.
(ii) It respects the Roberts-Stramer critique: the mixing times remain stable as the resolution of the path-discretisation is refined. (iii) It is applicable in scenarios where data is observed with or without noise; it is stable in the setting of diminishing noise. (iv) It is applicable in the case of both full and partial observations. For partial observations, it accommodates both linear and non-linear observation operators. (v) It attains the above via a unified and, in principle, automated methodology.
We have chosen the applications in Section 7 to highlight these properties. To our knowledge, the proposed method is unique in satisfying all of criteria (i)-(v).
The rest of the paper is organised as follows. Section 2 presents a generic class of sde models relevant to our work. Section 3 recasts the inferential problem as one of exploring a posterior distribution supported on a manifold. Section 4 describes the constrained hmc method for sampling such distributions on implicitly defined manifolds. Section 5 shows how the Markovian structure of sdes model can be exploited to design a scalable implementation of the methodology. Section 6 discusses related works. Section 7 illustrates the approach on several numerical examples, with comments on algorithmic performance and comparisons to alternative mcmc methods. Section 8 concludes with a brief summary and directions for future research.
Notation. Sans-serif symbols are used to distinguish random variables from their realisations (respectively, x and x). The set of integers from A ∈ Z to B ∈ Z inclusive, B ≥ A, is A:B. Floor and ceiling operations are denoted x and x respectively. A symbol subscripted by a set indicates an indexed tuple, e.g. x A:B = (x s ) s∈A:B . The set of linear maps from a vector space X to a vector space Y is L(X , Y). For f : R M → R N , the Jacobian of f is ∂f : R M → R N×M and for f : R M → R, its gradient and Hessian are ∇f : R M → R M and ∇ 2 f : R M → R M×M . For a multiple argument function g the Jacobian with respect to the ith argument is denoted ∂ i g and ∂g = (∂ 1 g, ∂ 2 g, . . . ). The concatenation of vectors x and y is denoted [x; y] and the concatenation of a tuple of vectors x 1:N is [x 1:N ] = [x 1 ; . . . ; x N ] with the operation acting recursively e.g. [x 1:N ; y] = [[x 1:N ]; y]. The determinant of a square matrix M is |M |. The N × N identity matrix is I N . The block diagonal matrix with M 1:N left-to-right along its diagonal is diag M 1:N . The N-dimensional Lebesgue measure is λ N . The set of Borel probability measures on a space X is P(X ).

Diffusion model
We consider the task of inferring the parameters of Itô-type sdes of the form dx(τ ) = a(x(τ ), z) dτ + B(x(τ ), z) db (τ ) (1) defined on a time interval T ⊆ R ≥0 , where z is a Z ⊆ R Z -valued vector of model parameters, x a X ≡ R X -valued random process, b a B ≡ R B -valued standard Wiener process, a : X × Z → X the drift function and B : X × Z → L(B, X ) the diffusion coefficient function. This time-homogeneous sde system can be characterised by a family of Markov kernels κ τ : X × Z → P(X ) with κ τ −τ (x, z)(dx) the probability of x(τ ) ∈ dx given (x(τ ) = x, z = z), for (τ, τ , x, z) ∈ T × T × X × Z. The parameter z is assigned a prior distribution µ ∈ P(Z) and, given z, the initial state x 0 is given a prior ν : Z → P(X ). We assume the system is observed at T times with a fixed inter-observation interval ∆ > 0 and T = [0, T∆]. The Y ⊆ R Y -valued observed vectors y 1:T are then defined for each t ∈ 1:T as y t = h(x(t∆), z, w t ) with h : X × Z × W → Y the observation function, and w t ∼ η the observation noise vector at time index t with distribution η ∈ P(W) and W ⊆ R W .
In the former case the observation noise vectors w 1:T can be omitted from the model. Our methodology readily extends to irregular observation times and time-varying model specification -for the sde and the observation parts -however, for brevity of exposition, we only describe the equispaced and time-independent case here.
In general, it is neither possible to exactly sample from the Markov kernels κ τ nor evaluate their densities with respect to the Lebesgue measure on X . We thus adopt a data-augmentation approach (Elerian et al., 2001;Roberts and Stramer, 2001) and consider a discrete-time model formed by numerically integrating the original sde; although this will introduce discretisation error, the error can be controlled by using a fine time-resolution. We split each inter-observation interval into S smaller time steps δ = ∆ S . Given a time discretisation, a variety of numerical schemes for integrating sde systems are available with varying levels of complexity and convergence properties (Kloeden and Platen, 1992). The schemes of interest in this article can be expressed as a forward operator f δ : Z×X ×R V → X defined such that, given parameters z ∈ Z, a current state x ∈ X and a random vector v ∼ N (0, I V ), f δ (z, x, v) is approximately distributed according to κ δ (x, z) for small time steps δ > 0. The simplest and most commonly used scheme is the Euler-Maruyama method, where V = B and f δ (z, x, v) = x + δa(x, z) + δ 1 2 B(x, z)v. Importantly, the methodology developed in this article straightforwardly accommodates higher order methods, such as the Milstein scheme (Mil'shtejn, 1975).
For a particular choice of numerical scheme, given the parameters z ∼ µ and initial position x 0 ∼ ν(z), the states at all subsequent time steps x 1:ST are iteratively generated Model 1 Time-discretised diffusion generative model.
function g x:,y : (z, x 0 , v 1:St , w 1:t ) for s ∈ 1:St via the forward operator f δ with x s denoting the discrete time approximation to the continuous time state x(sδ). The observations y 1:T are computed from the discrete time state sequence x 1:ST via the observation function h and observation noise vectors w 1:T . The overall generative model is summarised in Model 1.

Inferential objective on a manifold
We are interested in computing expectations with respect to the joint posterior of z, x 0 , x 1:ST , given observations y 1:T = y 1:T . However, the states at nearby time steps will be highly dependent under the prior on x 1:ST for small δ. Such strong dependencies are characteristic of centred parametrisations of hierarchical models, and have a deleterious effect on the performance of many approximate inference algorithms (Papaspiliopoulos et al., 2003(Papaspiliopoulos et al., , 2007Betancourt and Girolami, 2015).

Non-centred parametrisation
One can instead choose to parametrise the inference problem in terms of the latent vectors v 1:ST used to numerically integrate the sde. Given values for z, x 0 and v 1:ST , the state sequence x 1:ST can be deterministically computed. Such a reparametrisation has the property that, under the prior, all components of the latent vectors v 1:ST are independent standard normal variables. We further assume the following.
Assumption 3.1. There exist functions g z : R U → Z and g x 0 : Z × R V 0 → X and corresponding distributionsμ ∈ P(R U ),ν ∈ P(R V 0 ) with strictly positive smooth density functions with respect to the Lebesgue measures λ U and λ V 0 , respectively, such that g z (u) ∼ µ and g x 0 (z, v 0 ) ∼ ν(z) ∀z ∈ Z if u ∼μ and v 0 ∼ν.
Under such parametrisation in terms of q := [u; v 0 ; v 1:ST ; w 1:T ] all of (u, v 0 , v 1:ST , w 1:T ) are then a-priori independent and the resulting prior distribution ρ ∈ P(R Q ) with Q = U + V 0 + STV + TW, has a density with respect to the Lebesgue measure λ Q , Model 2 gives the generative model under this non-centred parametrisation and defines a function g y : which generates observations given values for the latent variables. The Model 2 Non-centred parametrisation of generative model.
observations can be thought of as imposing a series of constraints on the possible values of the latent variables q; under additional assumptions on the regularity of the mapping g y : from latent variables to observations, the set of q values satisfying the constraints will form a differentiable manifold embedded in R Q . The posterior distribution on q given y 1:T = y 1:T will not have a density with respect to the Lebesgue measure λ Q as the manifold it has support on is a λ Q -null set. In the following section we show however that by using a different reference measure we can compute a tractable density function for the posterior.

Target posterior on manifold
We define a constraint function c : with the set of values on the manifold M := {q ∈ R Q : c(q) = 0} corresponding to all inputs of g y : consistent with the observations. We make the following assumption.
Assumption 3.2. The constraint function c is continuously differentiable and has Jacobian ∂c which is full row-rank ρ-almost surely.
The differentiability requirement will be met if f δ , g z , g x 0 and h are all themselves continuously differentiable with respect to each of their arguments. The rank condition on the Jacobian requires that the observed variables do not give redundant information about the latent variables, i.e. no observed variable can be expressed as a deterministic function of a subset of the other observed variables. In the case of observations subject to Gaussian additive noise this will always be satisfied if the noise covariance is full-rank. In the noiseless observation case the condition will be met if no component of the state at an observation time x St is fully determined by the state at the previous observation time x S(t−1) and parameters z, and the functionh : X → Y has Jacobian with full row-rank everywhere.
Under these assumptions M will be a D = Q − C dimensional differentiable manifold embedded into the Q dimensional ambient space. The posterior distribution π ∈ P(R Q ) on q given c(q) = 0 (and so y 1:T = y 1:T ) is supported only on M. Note that M has zero Lebesgue measure, so π does not have a density with respect to λ Q . To define an appropriate reference measure we further assume the following.
Assumption 3.3. The ambient latent space R Q is equipped with a metric tensor with a fixed positive definite matrix representation M .
A possible reference measure is then the D-dimensional Hausdorff measure η M D on the ambient space, which has the required property that π is absolutely continuous with respect to η M D . For measurable subsets M is the Riemannian measure on the manifold M with metric induced from the ambient metric (see Lemma C.2 in Appendix C). As later results will be more naturally stated in terms of the Riemannian measure, we will use σ M M as the reference measure here.
Proposition 3.1. Under Assumptions 3.1 to 3.3 the posterior π has a density A proof is given in Appendix C. See also Rousset et al. (2010), Diaconis et al. (2013) and Graham and Storkey (2017). The negative log posterior density thus reads

MCMC on implicitly defined manifolds
In this section, we review mcmc methods for sampling from a distribution supported on an implicitly defined manifold. We stress at this point that we do not design a fundamentally new such mcmc method but instead rely on modifying and combining existing methodologies. In particular, we adopt a symplectic integrator for constrained Hamiltonian systems (Andersen, 1983;Leimkuhler and Skeel, 1994;Leimkuhler and Matthews, 2016) to simulate Hamiltonian dynamics trajectories on the manifold, and use this as a proposal generating mechanism within a Hamiltonian Monte Carlo (hmc) scheme (Duane et al., 1987;Neal, 2011;Betancourt, 2017). The use of constrained Hamiltonian dynamics within an hmc context has been previously proposed multiple times -see for example Hartmann and Schütte (2005), Rousset et al. (2010, Ch. 3), Brubaker et al. (2012) and Lelièvre et al. (2019). We refer the interested reader to Arnol'd (2013);Holm et al. (2009) for general background on constrained mechanics and to Barp (2020, Ch. 3) for a comprehensive review of hmc on manifolds.

Constrained Hamiltonian dynamics
To define the constrained Hamiltonian system, we first augment the latent vector q, henceforth the position, with a momentum p. Formally the momentum is a co-vector i.e. a linear form in L(R Q , R) and the metric on the position space induces a co-metric on the momentum space with matrix representation M −1 . As a common abuse of notation, we will not distinguish between vectors and co-vectors and simply consider p as a vector in R Q equipped with a metric with matrix representation M −1 . The Hamiltonian Thus far we have an unconstrained Hamiltonian system. To restrict q to M, we introduce a Lagrange multiplier function λ : R Q × R Q → R C implicitly defined so that constraint c(q) = 0 is enforced, at all times, in the following dynamics. The constrained Hamiltonian dynamics associated with the Hamiltonian in (5) are then described by the system of differential algebraic equations (daes) The condition that the primary constraints, c(q) = 0, are preserved in time implies a set of secondary constraints of the form ∂c(q)M −1 p = 0.
Definition 4.1. The set of momenta satisfying the secondary constraints at a position q coincides with the co-tangent space of the manifold M at q, denoted Definition 4.2. The set of positions and momenta in the manifold and corresponding co-tangent spaces respectively are termed the co-tangent bundle, denoted See for example Leimkuhler and Reich (2004, Chapter 7). Proofs are also given in Appendices E and F. Together these properties mean the flow map Φ hc t has an invariant measure on T * M.
Corollary 4.1. The conservation properties in Propositions 4.1 and 4.2 imply that the measure ζ(dq, dp) ∝ exp(−h(q, p))σ T * M (dq, dp) is invariant under the flow map Φ hc t corresponding to the constrained dynamics in Eq. (6).
Using the definitions in Eq. (5) and Eq. (7), it readily follows that the target posterior π(dq) ∝ exp(− (q))σ M M (dq) is the marginal distribution on the position under the invariant measure ζ. Thus, the flow map Φ hc t can be used to construct a family of Markov kernels which marginally leave π invariant.

Momentum resampling
As the dynamics remain confined to a level-set of the Hamiltonian in Eq. (5), a Markov chain constructed by iterating Φ hc t will not be ergodic. By resampling the momentum between Φ hc t applications we can however move between Hamiltonian level-sets. To orthogonally (with respect to the co-metric) project a momentum onto T * q M, the co-tangent space at q, we apply the projector P M (q), defined as Using P M we can independently sample a momentum from its conditional distribution given the position under the measure ζ by projecting a sample from N (0, M ).

Numerical discretisation
In general the system of daes in Eq. (6) will not have an analytic solution, and we are required to use a time discretisation to approximate the exact flow map Φ hc t . We will first introduce a class of symplectic integrators for the unconstrained Hamiltonian system before showing how they can be used to construct a symplectic integrator for the constrained Hamiltonian system.

Unconstrained integrator: Störmer-Verlet and Gaussian splittings
A standard approach for defining symplectic integrators for Hamiltonian systems is to split the Hamiltonian into a sum of components for which the exact corresponding flow map can be computed, with a splitting of the form h(q, p) = h 1 (q)+h 2 (q, p) particularly common. If Φ h 1 t and Φ h 2 t denote the flow maps associated with the dynamics for Hamiltonians h 1 and h 2 respectively, then the symmetric composition is a symplectic and second-order accurate integrator for the Hamiltonian system (Leimkuhler and Reich, 2004). Furthermore, as both Φ h 1 t and Φ h 2 t are time-reversible, Ψ t is also time-reversible.
Various choices can be made for splitting the Hamiltonian of interest in Eq. (5) between h 1 and h 2 , subject to the requirement that the flow map Φ h 2 t can be computed, with Φ h 1 t (q, p) = (q, p − t ∇h 1 (q)) always trivial to compute. An obvious splitting is h 1 (q) = (q) and h 2 (q, p) = 1 2 p T M −1 p; in this case Φ h 2 t (q, p) = (q + tM −1 p, p). The composition then corresponds to the Störmer-Verlet integrator (Verlet, 1967).
In our setting, the log prior density on the ambient space log dρ /dλQ is quadratic in the components of the position q corresponding to v 1:ST due to their standard normal prior distribution. It will typically also be possible to choose an appropriate parametrization such that the prior densities dμ /dλZ, dν /dλX and dη /dλW are equal to or well approximated by standard normal densities. An alternative splitting, which can be useful in this setting, is then h 1 (q) = (q) − 1 2 q T q, h 2 (q, p) = 1 2 q T q + 1 2 p T M −1 p, with the simplification h 1 (q) = 1 2 log |G M (q)| when log dρ /dλQ(q) = − 1 2 q T q. The quadratic form of h 2 and corresponding linear derivatives mean the corresponding flow map is still exactly computable. If we let R be an orthonormal matrix with the normalised eigenvectors of M −1 as columns and Ω a diagonal matrix of the square-roots of the eigenvalues such that M −1 = RΩ 2 R T then we have that . This splitting and corresponding integrator has been used previously in various settings (Beskos et al., 2011;Neal, 2011;Beskos et al., 2013a;Shahbaba et al., 2014). Importantly as the flow-map Φ h 2 t exactly preserves Gaussian prior measures, under certain assumptions the change in Hamiltonian over a trajectory generated using the integrator does not grow as the dimension Q of the space is increased, so the probability of accepting a proposed move from the trajectory remains independent of dimension for a fixed step size. This in contrast to the Störmer-Verlet integrator for which for a fixed step size the accept probability will tend to zero as the dimension becomes large (Beskos et al., 2011).
In the context here of inference in partially-observed diffusion models, as the time step δ of the discretisation of the diffusion is decreased (or equivalently S increased), the dimension of set of latent noise vectors v 1:ST and so q will increase, with the prior distribution on q tending to a distribution with a density with respect to an infinite-dimensional Gaussian measure in the limit δ → 0. As here the target posterior distribution has support only on a submanifold of the ambient space, the results of Beskos et al. (2011) do not directly carry over, however, empirically we have found that a constrained integrator based on this Gaussian splitting gives an improved scaling in sampling efficiency with S compared to the Störmer-Verlet splitting as we illustrate in our numerical experiments in Section 7.

Constrained integrator
We now show how a constraint-preserving symplectic integrator can be formed from the unconstrained integrator Ψ t . In Reich (1996, Section 3.1) it is observed that the map defined by Π λ (q, p) = (q, p−∂c(q) T λ(q, p)), with q ∈ M, is symplectic for any function λ that is sufficiently regular (e.g. continuously differentiable). Reich (1996) then shows that if a second-order accurate symplectic integrator for an unconstrained system with Hamiltonian as in Eq. (5) is defined by the map Ψ t , then the integrator with step defined by the composition (q , p ) = Π λ • Ψ t • Π λ (q, p), with λ implicitly defined by solving for the primary constraints, c(q ) = 0, and λ by solving for the secondary constraints, ∂c(q )M −1 p = 0, is a second-order accurate symplectic integrator for the corresponding constrained system.
Rather than composing instances of Π λ with the overall map Ψ t as proposed by Reich (1996), we can instead consider composing Π λ with the component maps which make up Ψ t to enforce the constraints within each 'sub-step'. This was proposed for the specific case of Ψ t corresponding to a Störmer-Verlet integrator in the geodesic integration algorithm of Leimkuhler and Matthews (2016).
For a general quadratic h 2 (covering both the Störmer-Verlet and Gaussian splittings introduced above) the associated component flow-maps Φ h 1 t and Φ h 2 t can be expressed for suitable choices of matrices (Γ q,q t , Γ q,p t , Γ p,q t , Γ p,p t ) as This can be explicitly solved for λ to give Ξ h 2 t (q, p) = (q, P M (q)p) with (q,p) = Φ h 2 t • Π λ (q, p) as defined in Eqs. (11) and (12). For sufficiently small t and sufficiently smooth constraint functions it can be shown that there exists a locally unique solution to (12) (Brubaker et al., 2012;Lelièvre et al., 2019). In general, though, there may be multiple or no solutions, and even if there is a unique solution the iterative solver may fail to converge. This lack of a guarantee of converging to a unique solution presents a challenge in terms of maintaining the timereversibility of the Ξ h 2 t step and so the overall integrator. To enforce reversibility on Ξ h 2 t we apply a reversibility-check transform R defined , with (q, p) values for which the condition is not met causing evaluation of R(Ξ h 2 t )(q, p) to raise an error. Similarly if the iterative solves in the evaluation of either the forward Ξ h 2 t or time-reversed Ξ h 2 t maps fail to converge an error is also raised. The map R(Ξ h 2 t ) is then by construction reversible unless an error is raised that can be suitably handled downstream by the hmc implementation. The approach of using an explicit reversibility check in mcmc methods using an iterative solver was first proposed by Zappa et al. (2018) with subsequent application within the context of constrained hmc in Graham and Storkey (2017) and Lelièvre et al. (2019).
With the reversibility check, the map R(Ξ h 2 t ) is guaranteed to be time-reversible if it does not raise an error. As Ξ h 1 t is also time-reversible and both maps are symplectic, defines a time-reversible symplectic map on T * M whenever an error is not raised. In practice the equality conditions indicating whether the iterative solver has converged and the reversibility check is satisfied, are both relaxed to an error norm being less than tolerances θ c and θ q respectively. Further details of the implementation of the integrator and its relation to previous work are given in Appendix H.

Choice of metric matrix representation M
We recommend choosing M = cov(q) −1 for q ∼ ρ, i.e. the precision matrix under the prior ρ; this requires that ρ has finite second-order central moments. While there is no fundamental requirement for M to match the prior precision matrix and so a different choice of M could be used when cov(q) −1 is not defined, heuristically we find that the performance of the proposed methodology is improved when ρ is exactly or 'close to' Gaussian in all components, and this can usually be arranged by transforms of u, v 0 and w 1:T and corresponding reparametrisations of (μ, g z ), (ν, g x 0 ) and (η, h). The constrained Hamiltonian dynamics in (6) with M = cov(q) −1 are equivalent to the dynamics under a linear transform q = L T q with LL T = M for which cov(q ) = I Q and normalising for the prior scales and correlations in this manner appears to improve the robustness and efficiency of the algorithm.

Overall algorithm
Pseudo-code corresponding to applying the reversible, constraint-preserving and symplectic integrator with step Ξ h 1 t /2 • R(Ξ h 2 t ) • Ξ h 1 t /2 within a hmc algorithm is summarised in Algorithm 1. Any errors raised when integrating the trajectory by iteratively applying the ConstrStep function are handled by terminating the trajectory and the hmc transition returning the initial state i.e. a 'rejection'. Although for simplicity we have described in Algorithm 1 the use of a constraint-preserving integrator within a Metropolis-adjusted hmc algorithm with a static integration time It per chain iteration, in practice we use a hmc algorithm which dynamically adapts the integration time It, in particular the dynamic multinomial hmc algorithm described in the appendix of Betancourt (2017), an extension of the No-U-Turn-Sampler algorithm (Hoffman and Gelman, 2014). We also use the dual-averaging scheme of Hoffman and Gelman (2014) to adaptively tune the integrator step-size t in a warm-up sampling phase to target an acceptance statistic of 0.8. A general purpose implementation of the full algorithm is provided in Python package Mici (Graham, 2019), which we use in the numerical experiments in Section 7.
We have found the suggested defaults values for the various algorithmic parameters work well in practice for a range of different models. This therefore results in an automated methodology with a practitioner only needing to specify functions to evaluate the log prior density log dρ /dλQ and constraint function c for the diffusion model in question, with the required derivatives of these functions being able to be constructed algorithmically (Griewank and Walther, 2008).

Computational cost
We can apply Algorithm 1 to perform inference in partially-observed diffusion models by targeting the manifold-supported posterior distribution in the non-centred parametrisation of the time-discretised model described in Section 3.2. While this approach allows significant generality in the choice of the elements of the diffusion and observation model, it can be computationally expensive to run. To analyse the cost of Algorithm 1 in this setting we make the following simplifying assumption.
Assumption 5.1. The Newton iteration to solve (12) converges within J iterations for some J > 0 that does not depend on S and T, for fixed t, θ c and θ q .
This assumption appears to hold in practice, and we provide numerical evidence to this effect in the numerical experiments in Section 7. We then have the following. A proof is given in Appendix A. The cost of Algorithm 1 when applied directly to the posterior distribution with density in (4) therefore scales linearly with the number of discrete time steps per observation S but cubically with the number of observation times T.

Exploiting Markovianity for scalability
While we have so far considered only sampling from the posterior distribution on latent variables (u, v 0 , v 1:ST , w 1:T ) given observations y 1:T , the constrained hmc approach we have described can be applied to sampling from any conditional distribution of the joint distribution on observations and latent variables under the generative model, of which the target posterior distribution is just one example.
One way to improve the scaling of the computational cost with respect to the number of observation times is therefore to restrict the information flow through the state sequence x 1:ST by conditioning on a set of intermediate states in the sequence. Due to the Markovian nature of the state dynamics, the state sequences x 0:s−1 and x s+1:ST are conditionally independent given the state x s and the parameters z for any s ∈ 1:ST.

Model 3 Generative model conditioning on intermediate states (x
As a consequence under the non-centred parametrisation of the generative model in Model 2, we have that if we condition on the intermediate state x St at the t th observation time then we can independently generate the observation sequences y 1:t and y t+1:T from respectively (u, v 0:St , w 1:t ) and (u, v St+1:ST , w t+1:T ).
We can extend this idea to conditioning on multiple intermediate states in the sequence. If at B − 1 observation time indices t 1:B−1 ⊆ 1:T the full state is conditioned on, x Stb = x Stb ∀b ∈ 1:B − 1, then we have that the observation subsequence y t b−1 +1:tb and conditioned state x Stb depend only on the latent variables (u, v St b−1 +1:Stb , w t b−1 +1:tb ) for each b ∈ 1:B − 1 (with t 0 = 0) and the final observation subsequence y t B−1 +1:T depends only on the latent variables (u, v St B−1 +1:ST , w t B−1 +1:T ). Due to these conditional independencies introduced when conditioning on the values of (x Stb ) B−1 b=1 , we can 'split' the generation of the state sequence in to B independent calls to a function which given a conditioned state x St b−1 generates the subsequence of states for step indices St b−1 +1:St b and outputs the observations y t b−1 +1:tb and final state x Stb of the subsequence (or just observations y t B−1 +1:T for the final subsequence). For noiseless observations, y tb is completely determined by x Stb , and so only y t b−1 +1:tb−1 and x Stb should be returned for the non-final subsequences. The resulting conditioned generative model is summarised in Model 3.
Using gȳ from Model 3 we can then define partial constraint functions We then define respectively partitioned and full constraint functionsc : The Jacobian of the full constraint function will then have the block structure The Gram matrix can then be decomposed as corresponding to a rank U correction of a block-diagonal matrix D ([u;v;w]).
Using the matrix determinant lemma we then have that u,v,w). Similarly, the Woodbury matrix identity yields, for a vector r ∈ R C and q = [u;v;w], that By applying a sequence of constrained hmc Markov kernels, each conditioning on intermediate states (x Stb ) B−1 b=1 at a different set of observation time indices t 1:B−1 as well as the observations y 1:T we can construct a mcmc method which asymptotically samples from the target posterior distribution at a substantially reduced computational cost compared to the case of conditioning only on the observations y 1:T . To analyse the computational cost of applying the constrained hmc implementation in Algorithm 1 to the conditioned generative model, we assume the following.
Assumption 5.2. T = BR and t b = bR ∀ b ∈ 1:B − 1, i.e. that the observations are split in to B equally sized subsequences of R observation times.
Assumption 5.3. The Newton iteration to solve (12) converges within J iterations for J > 0 that does not depend on R, S and T, for fixed t, θ c and θ q .
In practice we will need to alternate with conditioning on a different set of observation times to allow the Markov chain to be ergodic, e.g. t b = (2b−1)R 2 ∀ b ∈ 1:B, with in this case the observation times split in to B − 1 subsequences of R observations times and two subsequences of R 2 and R 2 observation times. This alternative splitting will result in only minor difference in operation cost compared to the equispaced partition hence we consider only the former case in our analysis. Assumption 5.3 is motivated by our observation that the average number of Newton iterations needed for convergence appears to be independent of R, S and T. A proof is given in Appendix B. If the number of observations per subsequence R is kept fixed, the computational cost of each constrained integrator step therefore scales linearly with in both the number of time steps per observation S and the number of observation times T.

Related work
Our approach follows the general framework described in Graham and Storkey (2017) for performing inference in generative models where the simulated observations can be computed as a differentiable function of a random vector with a known prior distribution. As in this work, Graham and Storkey (2017) suggest using a constrained hmc algorithm to target the manifold-supported posterior distribution arising in such a setting, and consider a diffusion model with high-frequency noiseless observations of the full state as an example. In this setting with S = 1 the constraint Jacobian was observed to have a structure allowing a O(T 2 ) cost implementation of the operations required for each constrained integrator step.
Here we make several important extensions to the framework of Graham and Storkey (2017), with the scheme proposed in Section 5.1 allowing efficient O(ST) cost constrained integrator steps irrespective of the observation regime (high-or low-frequency, partial or full, with or without noise) and the use of a constrained integrator based on a Gaussian splitting as proposed in Section 4.3 giving improved mixing performance as the timediscretisation is refined (S increased). Further by integrating the constrained integrator into an adaptive hmc algorithm (Hoffman and Gelman, 2014) we eliminate the need to tune the integrator step size and number of integrator steps per trajectory, giving a more automated inference procedure.
The non-centred parametrisation of the diffusion generative model described in Section 3.1 has similarities to the innovation scheme of Chib et al. (2004), and its later extension in Golightly and Wilkinson (2008), which recognises for the specific case of an Euler-Maruyama discretisation, that the state sequence x 1:ST can be computed as a function of the model parameter z, initial state x 0 and increments of the driving Brownian motion process. This relationship can be inverted to compute the increments given x 0:ST and z. By performing a Metropolis-within-Gibbs update to z conditioning on the increments and observations, the degeneracy in the conditional distribution of parameters of the drift coefficient when instead conditioning on x 1:ST as S → ∞ is avoided, thus producing an algorithm respecting the Roberts-Stramer critique. Our approach generalizes this idea beyond the Euler-Maruyama case by allowing for a generic forward operator f δ , and jointly updated all latent variables under this reparametrisation rather than using it to only update the parameters.
The conditioning scheme proposed in Section 5.1 is similar in spirit to blocking schemes proposed previously in mcmc methods for inference in partially-observed time series models, see e.g. Shephard and Pitt (1997); Golightly and Wilkinson (2008);Mider et al. (2020), however, the implementation and motivation of the approach here both differ. In the blocking schemes, conditioning on intermediate states introduces con-ditional independencies allowing proposing updates to blocks of the latent path given fixed parameters in a Metropolis-within-Gibbs type update, with a separate update to the parameters. Here we jointly update the parameters and latent path, and use the conditioning to induce structure in the constraint Jacobian which can be used to reduce the cost of the constrained integrator.
Hypoelliptic diffusions have a rank-deficient diffusion matrix B(x, z)B(x, z) T , but still have transition kernels κ τ with smooth densities with respect to the Lebesgue measure due to the propagation of noise to all state components via the drift function. Prior work on the calibration of such models has often adopted a maximum likelihood approach, in the setting of high-frequency observations, see e.g. Ditlevsen and Samson (2019) and the references therein. The singularity of the Wiener noise increment covariance matrix when discretising using an Euler-Maruyama scheme can be avoided via the use of a higher-order discretisation scheme: Ditlevsen and Samson (2019) use a strong order 1.5 Taylor scheme to obtain consistency in the estimation of parameters in both the drift and diffusion coefficient functions.
In terms of our criteria, in Remark 1.1, to our knowledge there is currently no alternative algorithm that satisfies them all for noiselessly observed hypoelliptic systems. The guided proposals framework (van der Meulen and Schauer, 2018; Bierkens et al., 2020; van der Meulen and Schauer, 2020) comes close, as it allows for Bayesian inference in both elliptic and hypoelliptic systems, fully or partially observed with noise or with noiseless observations and a linear observation functionh(x) = Lx, and respects the Roberts-Stramer critique. The approach however does not allow for non-linear noiseless observations, and the methodology requires choosing a tractable auxiliary process used to construct the proposed updates to the latent path given observations and parameters, with the original and auxiliary processes needing to satisfy matching conditions on their drift and diffusion coefficients, which can be non-trivial -for example, when the diffusion coefficient is state dependent -hindering the automation of the methodology. In contrast, our method can be applied without change to both hypoelliptic and elliptic diffusions.
A long line of previous work has considered mcmc methods for performing inference in distributions on non-Euclidean spaces, particularly prominent being the influential paper Girolami and Calderhead (2011) where the latent space is equipped with a userdefined Riemannian metric which facilitates local rescaling of the posterior distribution across different directions. Related algorithms have also been proposed based for finitedimensional projections of distributions with densities with respect to Gaussian measures on Hilbert spaces (Beskos, 2014;Beskos et al., 2017).
In our case the manifold structure arises directly from the observational constraints placed on the latent space of a generative model and the manifold is extrinsically defined by its embedding in an ambient latent space. Rather than the non-trivial task of selecting a positive-definite matrix valued function to define a Riemannian metric on the latent space, our method only requires the user to choose a matrix representing the fixed metric on the ambient space. As discussed in Section 4.4 we find the prior precision matrix to be a good default in practice.

Numerical examples
To demonstrate the flexibility and efficiency of our proposed approach we now present the results of numerical experiments in a range of different settings: hypoelliptic and elliptic systems, simulated and real data, noiseless and noisy observations. In all cases we use the same methodology, as described in the preceding sections, for performing inference, and where possible we compare to alternative approaches.

FitzHugh-Nagumo model with noiseless observations
As a first example we consider a stochastic-variant of the FitzHugh-Nagumo model (FitzHugh, 1961;Nagumo et al., 1962), a simplified description of the dynamics of action potential generation within an neuronal axon. Following Ditlevsen and Samson (2019) we formulate the model as a X = 2 dimensional hypoelliptic diffusion process where the components of the Z = 4 dimensional parameter vector are z = [σ; ; γ; β] and the Wiener process b has dimension B = 1. We initially assume the Y = 1 dimensional observations y 1:T correspond to noiseless observation of the first state component i.e.h(x) = x 1 , with an inter-observation time interval ∆ = 1 5 . Further details of the discretisation and priors used are given in the Appendix J.
To measure sampling efficiency in the experiments we use two complementary metrics: the average computation time per constrained integrator stepτ step and the estimated computation time per effective sampleτ eff , i.e. the total chain computation time divided by the estimated effective sample size (ess) for the chain for each parameter. Proposition 5.2 closely relates toτ step , and so by estimating how this quantity varies with R, S and T we can empirically test whether the proposed scaling holds in practice. While our analysis only considers the computational cost of the constrained integrator, ultimately we are interested in the overall sampling efficiency, which also depends on the mixing performance of the chains; by measuringτ eff we therefore also gain insight into how our approach performs on this metric. In order to empirically verify that Assumption 5.3 holds in practice we additionally record the average number of Newton iterations per constrained integrator step (averaged over both forward and time-reversed Ξ h 2 calls) which we denoten.
The ess estimates were computed using the Python package ArviZ (Kumar et al., 2019) using the rank-normalisation approach proposed by Vehtari et al. (2019). The chain computation times were measured by counting the calls of the key expensive operations in Algorithm 1 and separately timing the execution of these operations -details are given in Appendix M. The Python package JAX (Bradbury et al., 2018) was used to allow automatic computation of the derivatives of model functions and all plots were produced using the Python package Matplotlib (Hunter, 2007).
For all experiments we use chains which alternate between Markov transitions which condition on the states at observation time indices {t b : bR ∀ b ∈ 1:B} and {t b : values and splittings tested we ran three sets of four chains of 1250 iterations each with independent initialisations (details of the initializations are given in Appendix L), with the first 250 iterations of each set of four chains an adaptive warm-up phase used to tune the integrator step-size t, with the samples from these warm-up iterations omitted from the ess estimates but included in the computation time estimates. For all sets of chains the split-R convergence diagnostic values computed from the (non-warm-up iterations of the) four chains for all parameter values using rank-normalisation and folding were less than 1.01 as recommended in Vehtari et al. (2019). A dynamic hmc implementation (Betancourt, 2017) was used to set the number of integrator steps per trajectory in each transition. A summary of all the algorithmic parameter values used in the numerical experiments is given in Appendix N.
The top panels in Fig. 1 show how the number of Newton iterations required to solve (12) in each of the forward and reverse Ξ h 2 steps, averaged across the chains, varies with R, S and T and for the two different Hamiltonian splittings. We see that for a given splitting, the number of Newton iterations is close to constant in all cases, providing empirical support for Assumptions 5.1 and 5.3. The bottom panels in Fig. 1 instead show the average time per integrator stepτ step varies with R, S and T for both splittings. We see that the log-domain least-square fits show a very close to linear scaling ofτ step with both S and T, verifying these aspects of the O(R 2 ST) scaling claimed in Proposition 5.2. The growth ofτ step with R over the range here is sub-quadratic (the dotted line shows a quadratic trend for reference), however there is visible acceleration in the growth. An inspection of a breakdown of the total computation time spent on different individual operations revealed that for smaller R the O(RST) computation of the constraint Jacobian is dominating, with the O(R 2 ST) linear algebra operations only becoming significant for larger R. Fig. 2 shows how the estimated computation time per effective sampleτ eff varies with each of R, S and T, for each of the four model parameters and for each of the two Hamiltonian splittings. First considering the results for varying number of observations per subsequence R we see the efficiency is maximised (τ eff minimised) for both splittings for an intermediate value of R ≈ 5, with a small drop-off in efficiency for R = 2 and a larger decrease in efficiency as R is increased beyond 5. This reflects the competing effects of the reduced cost of each constrained integrator step as R is made smaller versus the reduced chain mixing performance in each transition for smaller R due to the extra states being (artificially) conditioned on. Importantly we see however that the latter effect is less significant (in this model at least), meaning that performance is still close to optimal for R = 2, suggesting performance will not be too adversely effected if a too small R value is chosen. Now turning our attention to the plots ofτ eff versus the number of discrete time steps per inter-observation interval S, we see that there is a clear difference in the scaling of τ eff with S for the two Hamiltonian splittings, with the Gaussian splitting giving a only slightly above linear scaling across all four parameters with exponents in the range 1.06-1.10 compared to 1.14-1.44 for the Störmer-Verlet splitting. Inspection of the integrator step sizes t (not shown), which were adaptively tuned in a warm-up phase to control the average acceptance statistic for the chains to be fixed, reveals that for the Gaussian splitting the step size t is in the range t = 0.29±0.01 for all S while for the Störmer-Verlet splitting shows a decrease from t = 0.20 for S = 25 to t = 0.10 for S = 400, consistent with results suggesting that the step size of the Störmer-Verlet integrator needs to scale with Q − 1 /4 to maintain a constant accept probability of a static integration time hmc algorithm in the unconstrained setting (Neal, 2011;Beskos et al., 2013b), compared to a dimension-free dependence of the acceptance probability on t for integrators using the Gaussian splitting in appropriate targets (Beskos et al., 2011). While we have emphasised here the superior performance of the Gaussian splitting, we note that the growth ofτ eff with S for both methods is very favourable, and shows our approach is able to remain efficient for fine time-discretisations of the continuous time model.
Finally we consider the bottom row of plots in Fig. 2, showing howτ eff varies with the number of observation times T for each model parameter. We see that in this case both splittings give very similar scalings, with a close to linear growth inτ eff with T for all four parameters. The (infinite-dimensional) target posterior being approximated for each T value differs here unlike the case for varying R and S), in particular becoming more concentrated as T increases. The increase inτ eff with T seems to be largely attributable to the increase in the computational cost of each constrained integrator step with T, and so the mixing performance of the chains seems to be largely independent of T. This suggests that the constrained hmc algorithm is able to efficiently explore posterior distributions with varying geometries. While here the concentration of the posterior is due to an increasing number of observations, in the following section we will see that our approach is also robust to varying informativeness of the individual observations. 7.2 FitzHugh-Nagumo model with additive observation noise As a second example we consider the same hypoelliptic diffusion model as in the preceding section, but now with observations subject to additive Gaussian noise of standard deviation σ y , i.e. h(x, z, w) = x 1 + σ y w and η = N (0, 1). The presence of additive observation noise means that the posterior on (u, v 0:ST ) given y 1:T = y 1:T has a tractable Lebesgue density. We therefore compare our constrained hmc approach to running a standard (unconstrained) hmc algorithm targeting the posterior on (u, v 0:ST ) with details of the posterior density and hmc algorithm used given in Appendix I. As a further baseline we also compare to the approach of van der Meulen and Schauer (2020), which uses a Metropolis-within-Gibbs scheme alternating guided proposal Metropolis-Hastings updates to the latent path x 0:ST given parameters z, with random-walk Metropolis (rwm) updates to the parameters z given the path x 0:ST . An application of this approach to the FitzHugh-Nagumo model considered here is described in van der Meulen et al. (2020), and we use the Julia code accompanying that article to run the experiments.
We use the same priors and time-discretisation as in the previous section, and fix S = 40 and T = 100. Simulated observed sequences y 1:T were generated for each of the observation noise variances σ 2 y ∈ {10 −4 , 10 −3 , 10 −2 , 10 −1 }. In all cases y 1:T was generated using the same parameters and initial state as in the previous section and sharing the same values for v 1:ST and w 1:T (sampled from their standard normal priors). Chains targeting the resulting posteriors were run for each σ 2 y value and for each of the three mcmc methods being considered.
For our constrained hmc algorithm we used R = 5 and ran chains using a constrained integrator based on the Störmer-Verlet splitting, with results instead using the Gaussian splitting showing a similar pattern of performance and hence omitted here to avoid duplication. For the standard hmc algorithm a diagonal metric matrix representation M was adaptively tuned in the warm-up iterations with this found to uniformly outperform using a fixed identity matrix for all σ y values tested here. For both the standard and constrained hmc algorithms we run four chains of 3000 iterations with the first 500 iterations an adaptive warm-up phase used to tune the integrator step-size t (and M for the standard hmc case). For the guided proposals / rwm case we ran four chains of 3 × 10 5 iterations, with the first 5 × 10 4 iterations an adaptive warm-up phase where the persistence parameter of the guided proposals update to x 1:ST | z and step sizes of the random-walk proposals for the update to z | x 1:ST were adapted as described in van der Meulen and Schauer (2020).
Estimated computation time per effective sampleτ eff values were calculated for the chains of each of the three mcmc methods and each of the observation noise variance values σ 2 y . The ess estimates were calculated as described in the preceding section, however, the true total wall-clock run times were used for the chain computation times here due the difficulty in ensuring a consistent treatment of different mcmc algorithms in the approach used in the previous section. To ensure as fair a comparison as possible all chains were run on the same computer and limited to a single processor core to avoid differences due to varying exploitation of parallel computation in the implementations. Fig. 3 summarises the results. We first note that despite the large number of iterations used for the guided proposals / rwm chains (3 × 10 5 ), the per-parameter split-R diagnostics (Vehtari et al., 2019) for the chains indicated non-convergence of the chains in nearly all cases, with only the chains for σ y = 10 −2 appearing to be approaching convergence withR values in the range [1.01, 1.05] compared toR values in the range [1.39, 1.87] for σ y = 10 −0.5 . The poor convergence here seems to be at least in part due to the difficulty in finding globally appropriate values of the rwm step sizes, with the step sizes still changing significantly in the final iterations of the warm-up phase and the final adapted values differing significantly across chains for the same σ y .
Given the poor convergence the estimated ess values must be treated with some caution, however even for the σ y = 10 −2 case where the chains appeared to be closest to convergence the estimatedτ eff values for the guided proposals / rwm chains are between 30 and 80 times larger than the corresponding values for the constrained hmc chains. As Julia implementations of numerical algorithms generally significantly outperform Python equivalents (Bezanson et al., 2017), the superior sampling performance of the (Python) constrained hmc implementation compared to the (Julia) guided proposals / rwm implementation here seems unlikely to be just due to differences in the efficiency of the implementations, but rather reflects significantly improved mixing of the joint gradient-informed updates to the latent variables by the constrained hmc algorithm, compared to the non-gradient-informed Metropolis-within-Gibbs updates of the guided proposals / rwm algorithm.
Comparing now the results for the standard and constrained hmc algorithms, we see that while both algorithms perform similarly for larger σ y values (i.e. less informative observations), the constrained hmc algorithm provides significantly better sampling efficiency for smaller σ y values. Inspecting the integrator step size t set at the end of the adaptive warm-up phase for each of σ y values reveals that, while for constrained hmc all step sizes t fall in the range 0.17-0.18 and so seem invariant to σ y , for the standard hmc chains, t ranges from 5.1 × 10 −4 for σ y = 10 −2 to 9.1 × 10 −3 for σ y = 10 −1 , resulting in a need to take more integrator steps per transition to make moves of the same distance in the latent space and hence a decreasing sampling efficiency as σ y becomes smaller.
The results for σ y = 10 −0.5 break the trend of increasing σ y leading to increased efficiency for the standard hmc chains, with a significant increase inτ eff compared to σ y = 10 −1 . This seems to be due to a roughly halving of the integrator step size t set in the adaptive warm-up phase to 5.2 × 10 −3 for σ y = 10 −0.5 , which, combined with the more diffuse posterior for the larger σ y value, led to a significant increase in the average number of integrator steps per transition and so computational cost per effective sample. A potential explanation for the decrease in the adapted step size is that the more diffuse posterior extends to regions where the posterior density has higher curvature necessitating a smaller step size to control the Hamiltonian error. In contrast, for the constrained hmc chains, the Hamiltonian error is controlled with a close to constant step size for all σ y , however, there is a drop in efficiency as σ y becomes larger, which seems to be due to the more diffuse posterior requiring a greater number of integrator steps to explore and so higher computational cost per effective sample on average.

Susceptible-infected-recovered model with additive observation noise
As a final example we perform inference in an epidemiological compartmental model given real observations of the time course of the number of infected patients in an influenza outbreak in a boarding school (Anonymous, 1978). Specifically we consider a diffusion approximation of a susceptible-infected-recovered (sir) model (see, e.g., the derivation in Fuchs (2013, §5.1.3)), with a time-varying contact rate parameter itself modelled as a diffusion process as proposed in Ryder et al. (2018), resulting in the following three-dimensional elliptic sde system where τ is the time in days, s and i are the number of susceptible and infected individuals respectively, c the contact rate, N the population size and γ the recovery rate parameter. The sde for c arises from log c following an Ornstein-Uhlenbeck process with reversion rate α, long term mean β and instantaneous volatility σ. As each of s, i and c represent positive-valued quantities, the diffusion state is defined to be x = [log s; log i; log c] ∈ R 3 with drift a and diffusion coefficient B functions derived from the above sdes via Itô's lemma. By computing the time-discretisation in this logtransformed space, the positivity of s, i and c is enforced and the numerical issues arising when evaluating the square-root terms in the diffusion coefficient for negative s, i or c are avoided. The observed data y 1:T corresponds to measurements of the number of infected individuals i = exp(x 2 ) at daily intervals, i.e. ∆ = 1, over a period of T = 14 days, with the observations assumed to be subject to additive noise of unknown standard deviation σ y , i.e. y t = exp(x 2 (t))+σ y w t . The Z = 5 dimensional parameter vector is then z = [γ; α; β; σ; σ y ]. Details of the priors and discretisation used are given in Appendix K.
We compare the performance of our proposed constrained hmc approach to a standard hmc algorithm, with the noise in the observations meaning that the posterior on u and v 0:ST admits a Lebesgue density. For each algorithm we run four chains of 3000 iterations with the first 500 iterations an adaptive warm-up phase. For our constrained hmc algorithm due to the small number of, and high correlations between, the observations we do not introduce any artificial conditioning on intermediate states i.e. R = T = 14.
Chains using constrained integrators based on both the Störmer-Verlet and Gaussian splitting show very similar performance here so we show only the Störmer-Verlet case to avoid duplication. For the standard hmc algorithm a diagonal metric matrix representation M was adaptively tuned in the warm-up iterations. The estimated computation time per effective sampleτ eff values were calculated using the total wall-clock run times for the chains.
The results are summarised in Fig. 4. The left plot shows the estimated perparameterτ eff values for each of the two mcmc methods: we see that the constrained hmc algorithm is able to give significantly improved sampling efficiency over standard hmc here. More importantly however it appears that the standard hmc algorithm is in fact failing to explore the full posterior. The grid of six plots in the right of Fig. 4 shows the estimated posterior marginals for each parameter computed from either the constrained or standard hmc chain samples. There is a clear discrepancy in the estimated posterior marginal of σ y between the two methods, with the standard hmc chains having many fewer samples at smaller σ y values compared to the constrained hmc chains. Fig. 6 shows the corresponding estimated pairwise marginals with σ y on a log-scale, with the poorer coverage of small σ y values by the standard hmc chains more apparent.
While the per-parameterR diagnostics for the standard hmc chains are all below 1.01, some hint of the underlying issue being encountered here is given by the high number of iterations in which the integration of the Hamiltonian dynamics diverged for the standard hmc chains -roughly 4% of the non-warm-up iterations for each chain. Such divergences are indicative of the presence of regions of high curvature in the posterior distribution that result in the numerical simulation of the Hamiltonian trajectories becoming unstable, and in some cases may be ameliorated by use of a smaller integrator step size t (Betancourt, 2017).
Here specifically the adaptive tuning of the step size t in the warm-up phase has led to a step size which is too large for exploring the regions of the posterior in which σ y is small. Although by setting a higher target acceptance statistics for the step size adaptation algorithm or hand tuning t to a smaller value we could potentially fix this issue, this would be at the cost of an associated decrease in sampling efficiency, leading to even poorer performance relative to the constrained hmc chains. As seen in the results for the FitzHugh-Nagumo model in Section 7.2, if σ y is fixed the integrator step size t for the standard hmc algorithm needs to be decreased as σ y is decreased to control the acceptance rate resulting in a higher computation cost per effective sample - Fig. 5 illustrates this directly for the sir model. When σ y instead is unknown as here, standard hmc needs to use a step size appropriate for the smallest σ y value 'typical' under the posterior, which if σ y is poorly informed by the data (as is the case for this model) can require using very small integrator step sizes t. In contrast as the constrained hmc algorithm is able to use an integrator step size t which is independent of σ y , the sampling efficiency of the chains is not limited by the need to use a small step size t to explore regions of the posterior in which σ y is small.

Conclusions & further directions
We have introduced a methodology for calibrating a wide class of diffusion-driven models. Our approach is based on recasting the inferential problem as one of exploring a posterior supported on a manifold, the structure of the latter determined by the observational constraints on the generative model. Once this viewpoint is adopted, available techniques from the literature on constrained hmc can be called upon to allow for effective traversing of the high-dimensional latent space. We have further shown that the Markovian structure of the model can be exploited to design a methodology with computational complexity that scales linearly with both the resolution of the time-discretisation and the number of observation times.
A critical argument put forward via the methodology developed in this work is that practitioners working with sde models are now provided with the option to refer to a single and highly automated, algorithmic framework for Bayesian calibration of their models. This algorithmic framework employs efficient Hamiltonian dynamics and adheres to all sought out criteria listed in Remark 1.1.
When exploring distributions with rapidly varying curvatures, standard hmc methods with a fixed step size can yield trajectories that either require too small of a step size (as in the FitzHugh-Nagumo model with noise in Section 7.2), or become unstable and diverge if the step size is not small enough in areas of high curvature of the posterior on the latent space (as with our sir example in Section 7.3 where variations in the scale parameter σ y have strong effect on the curvature). In both cases, particularly strong effects can render standard hmc non-operational (as in the sir case). Although the methodology presented in Girolami and Calderhead (2011) can in principle be helpful in such contexts, this class of algorithms is intrinsically constructed to induce good performances in the centre of the target distribution as it involves an expectation over the data, and not the given data, for the specification of the employed Riemannian metric. Constrained hmc dynamics can provide a more appropriate approach for dealing with rapidly varying curvature across the whole of the support of the target distribution. When combined with efficient discretisations of the dynamics -as in the case of the class of diffusion models we have studied in this work -they can provide statistically efficient methods.
The viewpoint adopted in this paper is potentially relevant to a larger class of stochastic models for time series (e.g. random ordinary differential equations), as well as other Markovian model classes (e.g. Markov networks). Some of the authors are currently involved in applying such mcmc methods to Bayesian inverse problems; manifold structures naturally appear in the low noise regime (Beskos et al., 2018). In general, we believe that the approach presented in this paper warrants further investigation, with a corresponding study of critical algorithmic aspects, e.g., computational complexity and mixing properties.

Acknowledgements
MMG acknowledges support from the Lloyd's Register Foundation programme on datacentric engineering at the Alan Turing Institute. AHT and MMG acknowledge support from the Singapore Ministry of Education Tier 2 (MOE2016-T2-2-135) and a Young Investigator Award Grant (NUSYIA FY16 P16; R-155-000-180-133). AB acknowledges support from the Leverhulme Trust Prize. This research made use of the Rocket High Performance Computing service at Newcastle University, UK.

A Proof of Proposition 5.1
We will use the following standard results from algorithmic differentation (ad). For more details see for example Griewank (1993) and Griewank and Walther (2008, Chapter 4).
can each be evaluated at O(1) times the operation cost of evaluating f (x) using respectively forward-and reverse-mode ad. For the purposes of analysing the cost of a single constrained integrator step in Algorithm 1 we will ignore the operations associated with the reversibility check: as the additional operations are a repeat of a subset of the 'forward' step, they will only alter the complexity by a constant factor. As our interest is in the scaling of the cost with number of observation times T and number of time steps per interobservation interval S we will express the complexity of the operations making up a step in terms of only these two dimensions using Big O notation O(·), with the other problem dimensions assumed fixed. Algorithm 3 shows a detailed breakdown of the operations involved in a single constrained integrator step, annotated with the computational complexity of each line in terms of S and T and the maximum number of times each line will be called (relevant for lines within loops). Below we justify the stated complexities.
The overall dimensionality of the latent space is Q = U + V 0 + STV + TW = O(ST) and the number of constraints is C = TY = O(T). Vector operations with a linear cost in dimension such as addition and/ scalar multiplication (lines 1 and 19) or evaluating an ∞-norm (line 11) will therefore be of complexity O(ST).
The metric matrix representation M is assumed to be equal to the prior precision matrix as recommended in Section 4.4. As (u, v 0 , v 1:ST , w 1:T ) are independent under the prior, M thus has a block-diagonal structure M = diag(M 1 , M 2 , M 3 , M 4 ) with M 1 a U × U matrix, M 2 a V 0 × V 0 matrix, M 3 a STV × STV diagonal matrix and M 4 a TW × TW block-diagonal matrix with block size W. Multiplication of a Q dimensional vector by M −1 (lines 2 and 20) will therefore have a O(ST) cost.
Each evaluation of the constraint function in Eq.
(3) requires ST evaluations of f δ , T evaluations of h and a single evaluation of each of g z and g x 0 ; evaluating c(q) therefore has a O(ST) cost (line 9). By Lemma A.1 evaluating a Jacobian vector product ∂c(q)v for a Q dimensional vector v therefore also has a O(ST) cost (lines 3 and 21), as does evaluating a vector Jacobian product v T ∂c(q) for a C dimensional vector v (lines 6 and 23). By Corollary A.1 evaluating the full Jacobian of the constraint function has a O(ST 2 ) cost (line 10).
The constraint Jacobian ∂c ( For the Störmer-Verlet splitting ∇h 1 (q) = ∇ (q) while for the Gaussian splitting ∇h 1 (q) = ∇ (q) − q. By Corollary A.1 evaluating ∇ (q) has the same complexity as evaluating (q) = − log dρ /dλQ(q) − 1 /2 log |G M (q)|. Evaluating the log prior density log dρ /dλQ ( (17), all of the operations with quadratic or cubic complexity in T in the previous analysis in Appendix A can be evaluated at only linear cost in T (and S). As in Appendix A, we ignore the operations associated with the reversibility check as these introduce only a constant factor overhead. Under Assumption 5.2, we have that T = BR for some block size R and we also include the scaling of operations with R in the analysis here. Algorithm 4 summarises the operations involved in each constrained integrator step when conditioning on intermediate states and the associated complexities. In the following we describe how the quadratic / cubic in T complexity operations in Algorithm 3 are altered to linear cost in T.
Under Assumption 5.2, each of the B partial constraint functions c 1:B outputs a vector of dimension RY + X = O(R) and requires RS evaluations of f δ and R evaluations of h and so has a complexity of O(RS). By Corollary A.1 the partial Jaco- can therefore be evaluated for each b ∈ 1:B at a O(R 2 S) complexity, resulting in an overall O(RST) complexity to evaluate all B tuples of partial Jacobians. These partial Jacobians correspond to the non-zero Algorithm 4 Annotated constrained integrator step (with blocking / conditioning) Inputs: (current state and cached quantities) q = [u; v; w] : current position, p : current momentum, g = ∇h 1 (q), Outputs: (next state and cached quantities) q = [ū;v;w] : next position, p : next momentum,ḡ = ∇h 1 (q), with U a C × U matrix, V a C × (V 0 + STV) block-diagonal matrix and W a C × TW block-diagonal matrix. Evaluating line 14 therefore has complexity O(RST). As (u, v 0:ST , w 1:T ) are independent under the prior ρ, under the recommendation in Section 4.4 the metric matrix representation is Assuming the gradient of h 1 is computed using ad, by Corollary A.1 the cost is again the same as evaluating h 1 (q). For both the Gaussian and Störmer-Verlet splittings, the dominant cost in evaluating h 1 (q) is in computing the Gram matrix log determinant log |G M (q)|; using the identity in (16) this can be reduced to computing the log determinants of the matrices C and D. Evaluating C and D has complexities O(R 2 T) and O(R 2 ST) respectively, as described previously. As

C Proof of Proposition 3.1
To prove the posterior distribution has a density of the form given in Proposition 3.1 we use the co-area formula Federer (1969), an extension of Fubini's theorem.
Lemma C.1 (Co-area formula). For any measurable function f : R Q → R and continuously differentiable function c : where η M D is the D-dimensional Hausdorff measure on R Q equipped with a metric with positive definite matrix representation M ∈ R Q×Q and c −1 [y] = {q ∈ R Q : c(q) = y} is the preimage of y under c.
Proof. See for example Theorem 1 in §3.4.2 in Evans and Gariepy (1992). Compared to the standard statement, the result here includes a change of variables q = M − 1 2 q with corresponding transform in the Euclidean metric q T q = (q ) T M q .
Although Lemma C.1 states the co-area formula in terms of a Hausdorff measure on the ambient space, an equivalent result can be stated in terms of a Riemannian measure on the submanifold using the correspondence in Lemma C.2 below. We are now in a position to prove Proposition 3.1 in the main text which we restate in a self-contained form below for ease of reference.
Proposition C.1. If q ∼ ρ for ρ ∈ P(R Q ) absolutely continuous to λ Q and c : R Q → R C is of class C 1 with ∂c full row-rank ρ-almost surely. Then Proof. Let f (q) = h(q) dρ dλQ (q) ∂c(q)M −1 ∂c(q) − 1 2 |M | − 1 2 for a measurable function h. Then from Lemma C.1 and the equivalence in Lemma C.2 we have As this applies for arbitrary h, defining ω : P(R C ) as the marginal law of c(q) and : R C → P( Q ) the law of q given c(q), comparing with the law of total expectation we recognise ω(dy) = w(y) λ C (dy) and The density given for π = (0) in the proposition then follows directly.

D Non-degeneracy of the symplectic form
For a general set of constraints φ i : R Q × R Q → R, i ∈ 1 : M on both the position and momenta, there is no guarantee that the restriction of the symplectic form is nondegenerate. In general, the restricted symplectic form will be non-degenerate if and only if all the constraints are second class (Dirac, 2001;Bojowald and Strobl, 2003) which requires that each constraint has a non-zero Poisson bracket with at least one other constraint, that is, for each i ∈ 1:M there exists j = i such that For arbitrary constraints, particularly those involving both position and momentum, this may not be the case however we consider only the special case of holonomic primary constraints φ i (q, p) = c i (q) with c i : R Q → R, i ∈ 1:C with a set of secondary constraints φ i+C (q, p) = ∂c i (q) T M −1 p, i ∈ 1:C, due to the requirement that the time derivative of the constraint is zero. We then have that that is the Poisson bracket between each primary constraint and the corresponding secondary constraint is non-zero almost everywhere under Assumption 3.2 (that ∂c is full row-rank ρ-almost everywhere) and Assumption 3.3 (that M is positive definite). Hence all the constraints are second-class and the restriction of the symplectic form is non-degenerate.

E Proof of Proposition 4.1
It is sufficient to show that the time-derivative of the Hamiltonian is zero under the dynamics described by the flow map Φ hc t . We have that

F Proof of Proposition 4.2
We will make use of the following definitions and lemma.
Definition F.1. The bilinear and skew-symmetric wedge product between the differentials dq i : R Q → R and dp j : R Q → R is (dq i ∧ dp j )(u, v) := dp j (u) dq i (v) − dq i (u) dp j (v), and between the vectors of differentials dq = [(dq i ) Q i=1 ] and dp = [(dp i ) Q i=1 ] is dq ∧dp := Q i=1 dq i ∧ dp i , with dq ∧ dp equal to the symplectic form on R Q × R Q .
Definition F.2. A map Φ : T * M → T * M is symplectic if the symplectic form on T * M, that is the restriction of dq ∧ dp to T * M, is preserved by the map.
Lemma F.1. The vector of differential 1-forms dq on the manifold M satisfies dq ∧ d ∂c(q) T λ(q, p) = 0 for arbitrary λ : M × R Q → R C .
Proof. Omitting arguments to c and λ for compactness we have that For q restricted to M and so satisfying the constraint equation c(q) = 0 the vector of differential 1-forms dq satisfies ∂c dq = 0. Further the Hessians ∇ 2 c 1:C are all symmetric therefore dq∧∇ 2 c i dq = 0 for all i ∈ 1:C due to the skew-symmetry of the wedge product. Therefore dq ∧ d(∂c T λ) = 0 as required.
To prove Proposition 4.2 it is sufficient to show that the time-derivative of dq ∧ dp is identical to zero under the dynamics described by the flow map Φ hc t as from this preservation of the symplectic form directly follows. We have that d dt (dq ∧ dp) = d dq dt ∧ dp + dq ∧ d dp dt As both M −1 and ∇ 2 are symmetric matrices then from the skew-symmetry of the wedge product the first two terms in the last line are zero as is the third term from Lemma F.1. Therefore d dt (dq ∧ dp) = 0.

G Proof of Proposition 4.3
Forp ∼ N (0, M ) and p = P M (q)p we have for any measurable A ⊆ R Q As M is positive definite it has a non-singular symmetric square-root M 1 2 . Further, as ∂c(q) is full row-rank ρ-almost surely, then we can find a decomposition where Q ⊥ and Q are respectively Q × C and Q × (Q − C) matrices with orthonormal columns (i.e. Q T ⊥ Q ⊥ = I C , Q T Q = I Q−C and Q T ⊥ Q = 0) and R is a non-singular C × C upper-triangular matrix. From the definition of P M in Eq. (8) we then have that Defining the linear change of variablesp = M 1 2 Q ⊥ n + M 1 2 Q m we then have that Integrating out the density on n to a constant, defining φ(m) := M 1 2 Q m and using φ(m) T M −1 φ(m) = m T m and ∂φ(m) T M −1 ∂φ(m) = 1 we have Recognising that φ defines a (global) parametrisation of T * q M, comparing to the definition of the Riemannian measure σ M −1 T * q M (see Definition C.1) we then have that As this holds for any measurable A we have that p | q = q is conditionally distributed according to ζ(dq, dp)

H Details of integrator implementation and relation to previous work
In this section we give some additional details on the implementation of the constrained integrator described in the main text in Section 4.3. The iterative solver in Eq. (13) is terminated once the norm of the left-hand-side of the constraint equation in (12) is below a tolerance θ c > 0 and the norm of the change in position between iterations is less than a tolerance θ q > 0. A maximum of J iterations are performed with an error being raised if the solver has not converged by the end of these. The reversibility check is similarly relaxed to requiring the norm of the difference between the initial position and position computed by applying forward and then reversed steps is less than 2θ q , based on the intuition that we control the error norm in the positions in forward and reversed steps to around θ q so we expect the composition to have error norm of approximately 2θ q or less. In our double-precision floating-point arithmetic implementation, the ∞-norm was used in all cases, and J = 50, θ q = 10 −8 and θ c = 10 −9 . These values were found to give a good trade-off between accuracy and robustness. While smaller tolerances closer to machine precision (≈ 10 −16 ) can often be used successfully, in some settings we found that accumulated floating-point error, particularly in cases where the matrices involved in the linear algebra operations in each constrained integrator step were illconditioned, could result in the Newton solver being unable to converge within the prescribed tolerance at some points in the latent space.
Without the reversibility check, the integrator Ξ h 1 practice, however, convergence of the iterative solver is not guaranteed. Empirically we observe that starting both the forward and reverse Ξ h 2 t steps from position-momentum pairs in the co-tangent bundle rather than a pair in which the momentum is not necessarily in the co-tangent space (which is the case for the composition proposed by Reich (1996)), leads to fewer cases of rejections due to the iterative solves failing to converge or non-reversible steps being flagged for a given integrator step size, thus improving sampling efficiency. By performing the reversibility check on the 'inner' Ξ h 2 t sub-step, rather than the whole step as considered by Lelièvre et al. (2019) there is also a computational saving of avoiding additional h 1 gradient computations in the time-reversed step.

I HMC in models with additive observation noise
In models where the observations are subject to additive noise (as defined in Remark 2.1) and W = Y, L : Z → R Y×Y is non-singular µ-almost surely and η is absolutely continuous with respect to λ Y , then the posterior distribution (under the non-centred parametrisation described in Section 3.1) on u, v 0 and v 1:ST given observations y 1:T = y 1:T has a tractable density with respect to the Lebesgue measure.
Specifically if we define the concatenated latent stateq = [u; v 0 ; v 1:ST ] ∈ RQ with Q = U + V 0 + STV then a-priori we have thatq ∼ρ with Lebesgue density Definingȳ 1:T = gȳ : (u, v 0 , v 1:ST ) with gȳ : defined by the posteriorπ onq given y 1:T = y 1:T then has a density with respect toρ In the common special case of additive isotropic Gaussian observation noise, i.e. η = N (0, I Y ) and L(z) = σ y (z) I Y with σ y : Z → R >0 , then we have As this posteriorπ has a density with respect to the Lebesgue measure which we can pointwise evaluate up to an unknown normalising constant, we can generate approximate samples fromπ using any of the large range of mcmc methods which apply to target distributions of this form. Typically the density function will be differentiable and so hmc methods offer a particularly convenient and efficient approach to inference. For the numerical experiments in Section 7 in which hmc is used as a baseline, we use a dynamic integration time hmc implementation (Betancourt, 2017) with a dualaveraging algorithm to adapt the integrator step size (Hoffman and Gelman, 2014). We consider two different options for the metric matrix representation M (mass matrix): the identity matrix and a diagonal matrix with diagonal elements adaptively set in the warm-up chain iterations to online estimates of the precisions ofq under the posterior using the windowed adaptation algorithm used for adapting M in Stan (Carpenter et al., 2017). As the matrix-vector multiplies required if using a dense M would incur a O(S 2 T 2 ) cost in each integrator step we do not consider adapting M based on an estimate of the full posterior precision matrix. All parameters for the algorithms used are summarized in Appendix N.

J FitzHugh-Nagumo model details
The stochastic FitzHugh-Nagumo model we use is defined by the sdes As in Ditlevsen and Samson (2019) we use a strong-order 1.5 Taylor discretisation scheme corresponding for this model to a forward operator with a V = 2 dimensional standard normal input vector v. Note unlike the approach of Ditlevsen and Samson (2019) our approach remains well-defined even when using a Euler-Maruyama discretisation of a hypoelliptic diffusion. We use the more accurate order 1.5 discretisation scheme here however as there is negligible additional computational cost or complexity compared to a Euler-Maruyama discretisation. We use priors log σ ∼ N (0, 1), log ∼ N (0, 1), log γ ∼ N (0, 1), β ∼ N (0, 1) and This formulation corresponds to the sde used in van der . For the comparison with the approach of van der  in the noisy observation case in Section 7.2, we use a time discretisation of the sde in (24) to run the experiments using the Julia code accompanying the article, with priors on the parameters logσ ∼ N (0, 1), log¯ ∼ N (0, 1), logγ ∼ N (0, 1),s ∼ N (0, 1) and on the initial statex 0 ∼ N (0, I 2 ), giving an equivalent generative model in the transformed space to the model in the original space.
K Susceptible-infected-recovered model details The system of sdes for the sir in terms of the log-transformed state x = [log s; log i; log c] can be derived from the sde system in the original space in (18) using Itô's lemma as An Euler-Maruyama scheme was used for the time-discretisation with S = 20 steps per inter-observation interval. The population size is fixed to the known value N = 763 and the initial values s(0) and i(0) to 762 and 1 respectively. The parameters are given priors log γ ∼ N (0, 1), log α ∼ N (0, 1), β ∼ N (0, 1), log σ ∼ N (−3, 1) and log σ y ∼ N (0, 1) and the initial value x 3 (0) = log c(0) ∼ N (0, 1). The observed data y were taken from Anonymous (1978), with the number of cases confined to bed manually digitized using a tool WebPlotDigitizer (Rohatgi, 2021)

L Chain initialisation
To initialise a constrained hmc chain targeting the posterior distribution with density in Eq. (4) supported on the manifold M, we need to find an initial q ∈ M, which in practice we relax to the condition c(q) ∞ < θ c . In our experiments to find one or more initial points satisfying this condition, we use one of the following heuristics, depending on whether observation noise is present or not in the model.

L.1 Additive observation noise
In the case of additive observation noise with h(x, z, w) =h(x) + L(z)w, if we further assume that W = Y and L is non-singular µ-almost surely then we can almost surely find w 1:T as a function of u, v 0 and v 1:ST such that and setting w t = L(z) −1 (y t −h(x St )) for each t ∈ 1:T. We can therefore, for example, sample u ∼μ, v 0 ∼ν and v s ∼ N (0, I V ) ∀s ∈ 1:ST, and compute w 1:T as just described to find a random initial point q = [u; v 0 ; v 1:ST ; w 1:T ] ∈ M.
While this procedure is guaranteed to find a point on the manifold M, sampling u, v 0 and v 1:ST from their priors will tend to produce points that are atypical under the posterior distribution. Often the geometry of the manifold at such points will be more challenging for constrained hmc chains to navigate, with for example regions of high curvature which require a small integrator step size t to control the proportion of trajectories which are terminated due to the Newton iteration failing to converge. This can lead to excessively long computation times for the early chain iterations in the adaptive warm-up phase where the step size is tuned to control an acceptance rate statistic, as the small step size combined with the dynamic hmc implementation which expands the simulated trajectory until a termination condition is satisfied results in very long trajectories of many steps.
Rather than directly initialising the chains at a point computed from a prior sample, we found that first running a number of steps of a gradient-descent optimisation algorithm on the negative logarithm of the Lebesgue density of the posterior distribution defined in Eq. (20) from the prior sample and using the optimisedq = [u; v 0 ; v 1:ST ; w 1:T ] to compute w 1:T , and so q ∈ M, as described above, led to more robust performance with lower variance in the times taken to complete the adaptive warm-up phase. As there is no need to perform iterative Newton solves during such an optimisation, the computational run time overhead of this optimisation was negligible compared to the run times of the warm-up and main chain phases. In our experiments we used the adaptive moments 'Adam' algorithm (Kingma and Ba, 2015) to perform the optimisation, continuing the optimisation until the condition T −1 T t=1 L(z) −1 (y t −h(x St )) < 1 is met. The same procedure was also used to generate the initial states for the standard hmc chains.

L.2 Noiseless observations
In the case of noiseless observations we need to use an alternative approach to find an initial point q on the manifold M. We first find a sequence of T statesx 1:T which is consistent with the observations, that is, for t ∈ 1:T we have that h(x t ) − y t ∞ < θ c . For linear observation functionsh(x) = Hx, given an initial arbitrary sequenceχ 1:T ∈ X T , thenx will satisfyh(x t ) = y t (modulo floating point error) for each t ∈ 1:T. A simple mechanism for generating the initial sequenceχ 1:T is to use a prior sample, i.e. generate u ∼μ, v 0 ∼ν and v s ∼ N (0, I V ) ∀s ∈ 1:ST, compute x 1:ST = g x: (u, v 0 , v 1:St ) and then setx t = x St ∀t ∈ 1:T. For non-linear observation functionsh, we can follow an analogous procedure but replacing the T linear solves in (25) with an iterative method to solve T independent systems of non-linear equations; for example, one choice would be the Gauss-Newton like iteratioñ As the number of equations Y and variables X in each system will be relatively small for most diffusion models, solving the systems will not usually be overly burdensome.
Once we have a state sequencex 1:T consistent with the observations, we can then use either a direct approach to solve for a corresponding constraint satisfying state q if the forward operator f δ meets a linearity condition, or in the more general case we can use an iterative approach. Both methods are described below.
In many settings the forward operator f δ is linear as a function of its third argument (state noise vector v) with the two other arguments fixed, and has a Jacobian with respect to this argument, ∂ 3 f δ , which is full row-rank. This applies for example to the common case of using an Euler-Maruyama scheme to discretise an elliptic diffusion. It also applies to the strong-order 1.5 Taylor discretisation of the hypoelliptic FitzHugh-Nagumo model described in Appendix J.
In these cases, the forward operator f δ can be decomposed as with m δ : Z × X → X and S δ : Z × X → {A ∈ R X×V : rank(A) = X}. If this property is satisfied, then given values for the unconstrained vectors u and v 0 (for example sampled from their priors) and a state sequencex 1:T , we can solve for a sequence of noise vectors v 1:ST such that the full state sequence x 1:ST = g x: (u, v 0 , v 1:ST ) linearly interpolatesx 1:T , that is x S(t−1)+s =x t−1 + s S (x t −x t−1 ) for t ∈ 1:T, s ∈ 1:S, as follows where A\b returns x such that Ax = b. If the state sequencex 1:T is chosen such that y t =h(x t ) for all t ∈ 1:T as detailed above, then by construction the interpolated state sequence x 1:ST will also be consistent with the observations, and q = [u; v 0 ; v 1:ST ] will be on the manifold M.
In more general settings, we can independently sample a point q from the prior ρ on the ambient space (or reuse the existing prior sample [u; v 0 ; v 1:ST ] used to generatẽ χ 1:T ) and run an adaptive gradient-descent algorithm Adam (Kingma and Ba, 2015), to minimise the following objective function γ : with respect to its first argument, initialised at q and with the second argument fixed at the computedx 1:T value. The optimisation is continued until γ(q,x 1:T ) < θ γ with the optimisation restarted from a new q ∼ ρ if this is not satisfied within M g iterations (we used θ γ = 10 −6 and M g = 1000). We then run the Newton iteration in (13) (with p = 0) to project q on to the manifold, within the tolerance c(q) ∞ < θ c .
We found this combination of gradient descent to find a point close to the manifold then Newton iteration to project to within the specified tolerance θ c was more effective than either solely using gradient descent until within the constraint tolerance (with the gradient-descent iteration tending to converge slowly once close to the manifold) or using the iteration in (13) directly on q sampled from the prior, as for points far from the manifold the Newton iteration generally fails to converge.
It is also possible to use gradient-descent optimisation directly on the norm of the (non-conditioned) constraint function, i.e. γ(q) := 1 C c(q) 2 2 , which sidesteps the requirement to find a constraint-satisfying state sequencex 1:T . However we found that as the number of observations times T becomes large this approach begins to suffer from the optimisation getting stuck in local minima, with the conditional independencies introduced by instead fixing the values of the states at the observation times seeming to make the optimisation problem simpler to solve.

M Measuring chain computation times
To compute the chain computation times in the numerical experiments in Section 7, we recorded the number of evaluations in each chain of the key expensive operations in Algorithm 1 and multiplied these by estimated compute times for each operation calculated by separately timing the execution of a compiled loop iterating the operation a large number of times in a single threaded-process. Compared to directly using the wall-clock run times for the chain this approach eliminates the effect of the Python interpreter overhead in the implementation in the computation time estimates, removes the variability in run time estimates due to the effect of other background processes and allowed experiments to be run on multiple machines with differing hardware while remaining comparable. The operations monitored were: evaluations of the constraint function; evaluations of the constraint Jacobian; matrix decompositions to solve linear systems in the Gram matrix and evaluation of the log-determinant of the Gram matrix; evaluation of the gradient of the log-determinant of the Gram matrix.

N Summary of algorithmic parameters
In this section we summarize the values of all algorithmic parameters required for reproducing the numerical experiments. Most of these have been provided elsewhere in the text (or in cited works), but we collate them together here for ease of reference.

Chain initialisation
For the FitzHugh-Nagumo model with noiseless observations, the linear interpolation scheme described in Appendix L.2 was used to initialise the chain states for each experiment. The initial state sequencesx 1:T consistent with observations were generated as x t = [y t , 0.5r t ], r t ∼ N (0, 1) for all t ∈ 1:T. For the FitzHugh-Nagumo model with noisy observations, the same scheme was used to generate the initial states with the observation noise vector components w 1:T all set to zero. The same scheme was use for initialising both the constrained and standard hmc chains, with the standard hmc chain states simply not including the (zeroed) observation noise vector components w 1:T . For the sir model the scheme described in Appendix L.1 was used to initialise the chain states for each experiment. The adaptive moments (Adam (Kingma and Ba, 2015)) optimizer used to minimize the negative log posterior density on u, v 0 and v 1:ST , had parameters (following notation from Algorithm 1 in the original paper) Step size α 0.1 Exponential decay rate for first moment estimate β 1 0.8 Exponential decay rate for first moment estimate β 2 0.999 Additive term for numerical stability 1 × 10 −8 The optimization was terminated when 1 T σ 2 y T t=1 |y t −h(x St )| 2 < 1 where σ y and x 1:ST correspond to the values generated from the current u, v 0 and v 1:ST values. If this criterion was not met in 5000 iterations the optimization was restarted from a new point sampled from the prior. For the constrained hmc chains the observation noise vector components of the initial state were set to w t = 1 σ y (y t −h(x St )) for all t ∈ 1:T from the σ y and x 1:ST values corresponding to the optimized values of u, v 0 and v 1:ST , with this giving a constraint satisfying initial state q = [u; v 0 ; v 1:ST ; w 1:T ]. For the standard hmc chains the same scheme was used, other than the w 1:T components of the state not being included, with initial states q = [u; v 0 ; v 1:ST ]

Reversible constrained leapfrog integrator
For the constrained hmc chains, the constrained leapfrog integrator described in Algorithm 1 in the main paper was used to simulate the constrained Hamiltonian dynamics trajectories. The following parameters were used in all cases Projection solver constraint tolerance θ c 10 −9 Projection solver position tolerance θ q 10 −8 Projection solver maximum iterations J 50 Reversibility check tolerance 2 × 10 −8

Integrator step size adaptation
Both constrained and standard hmc chains used the dual-averaging algorithm described in Section 3.2 of Hoffman and Gelman (2014) to tune the integrator step size during the adaptive warm-up phase of the chains. The initial step size 0 was set using the heuristic described in Algorithm 4 in Hoffman and Gelman (2014). The dual-averaging algorithm was used within a dynamic hmc algorithm which automatically selected the number of integrator steps per iteration (see next subsection for more details), with the algorithmic parameters used (corresponding to the notation in Algorithm 6 in Hoffman and Gelman (2014)) Target acceptance rate δ 0.8 Regularisation scale γ 0.1 Relaxation exponent κ 0.75 Iteration offset t 0 10 Regularization target µ log 10 0 The values for the relaxation exponent κ, iteration offset t 0 and regularization target µ follow the suggested defaults recommended in Hoffman and Gelman (2014). The target acceptance rate value δ = 0.8 was based on the default value used in the current Stan implementation of the dual-averaging step size tuning algorithm (Carpenter et al., 2017). A larger regularisation scale value γ = 0.1 was used compared to the default of γ = 0.05 recommended in Hoffman and Gelman (2014) and used as the default in Stan. Larger values of this parameter give stronger regularisation of the integrator step size towards the (usually relatively large) value exp(µ). We found that a common problem with both constrained and standard hmc chains in this setting, was that there was a tendency for there to be a large proportions of rejections in the initial few iterations of the warm-up phase. This tended to cause the step size to be adapted to very small values initially, which in turn led to many integrator steps being taken per iteration, and so very long computation times for these initial iterations. Increasing γ so that the regularization has a stronger effect in these initial iterations reduced this behaviour while still leading to final adapted integrator step sizes that gave acceptance rates close to the target value.
In all cases the acceptance rate statistic used to tune the step size was the mean of the Metropolis accept probability of a move from the initial state the transition started at, to each of the states on the generated trajectory, with the acceptance rate statistic set to zero if the trajectory is terminated due to a convergence error in the integrator or if a non-reversible step was detected.

Metric (mass matrix) adaptation
For the standard hmc chains, as well as the integrator step size a diagonal metric matrix representation (mass matrix) was also adaptively tuned during the warm-up phase of the chains; the constrained hmc chains used a fixed identity metric in all cases. Directly following the implementation used in Stan (Carpenter et al., 2017), the 500 iterations in the warm-up phases of the chains were split in to a number of sub-intervals. In an initial interval of 75 iterations only the step-size adaptation was active, with the metric initialised to the identity. A sequence of four growing intervals of 25, 50, 100 and 200 iterations were then used to tune the diagonal metric (with step size adaptation also active), with the marginal empirical posterior variances computed using the chain samples from each interval and the reciprocal of these values used to set the diagonal metric at the end of each interval (with the estimates starting afresh in each interval). In a final interval of 50 iterations only the step size was adapted with the metric fixed.

Dynamic Hamiltonian Monte Carlo algorithm
The constrained and standard hmc chains both used a dynamic hmc implementation, that iteratively expands a binary tree corresponding to the current simulated trajectory by integrating forward and backward in time until a termination criterion is satisfied on the outermost states of the current sub-trees, or the overall tree depth reaches some predetermined maximum. The algorithm used is the same as that underlying the default hmc implementation used in Stan as of version 2.21. This is based on the No U Turn Sampler (nuts) algorithm proposed by (Hoffman and Gelman, 2014), with updates to use a more efficient multinomial scheme to sample states from the generated trajectory (Betancourt, 2017), a generalized termination criterion (Betancourt, 2013) and checks of the termination criterion on additional sub-tree for improved robustness.
A maximum binary tree depth of 10 (giving a maximum of 2 10 − 1 = 1023 integrator steps per trajectory) was used for the constrained hmc chains, and a maximum binary tree depth of 20 (giving a maximum of 2 20 − 1 = 1 048 575 steps per trajectory) was used for the standard hmc chains. The larger value used for the tree depth for the standard hmc chains was motivated by the greater range of integrator step sizes required for standard hmc, with smaller step sizes requiring a greater maximum tree depth to avoid repeatedly terminating the trajectory early because of the maximum tree depth being reached. For the constrained hmc chains a maximum tree depth of 10 was already sufficiently large for this saturation to not occur in any of the experiments.
As in the original nuts algorithm (Hoffman and Gelman, 2014), if at any point during the trajectory expansion the Hamiltonian at the current state minus the Hamiltonian at the initial state exceeded a threshold ∆ max = 1000, with this indicative of the numerical integrator becoming unstable and diverging, the trajectory was terminated and the chain state set to the initial state at the beginning of the transition (that is a rejection occurs). Such early terminations and rejections were also triggered in the case of the constrained hmc chains if a projection step failed to converge or a reversibility check indicated a step was non-reversible.