Bayesian Nonparametric Vector Autoregressive Models

Vector autoregressive (VAR) models are the main work-horse model for macroeconomic forecasting, and provide a framework for the analysis of complex dynamics that are present between macroeconomic variables. Whether a classical or a Bayesian approach is adopted, most VAR models are linear with Gaussian innovations. This can limit the model’s ability to explain the relationships in macroeconomic series. We propose a nonparametric VAR model that allows for nonlinearity in the conditional mean, heteroscedasticity in the conditional variance, and non-Gaussian innovations. Our approach differs to that of previous studies by modelling the stationary and transition densities using Bayesian nonparametric methods. Our Bayesian nonparametric VAR (BayesNP-VAR) model is applied to US and UK macroeconomic time series, and compared to other Bayesian VAR models. We show that BayesNP-VAR is a flexible model that is able to account for nonlinear relationships as well as heteroscedas- ticity in the data. In terms of short-run out-of-sample forecasts, we show that BayesNP-VAR predictively outperforms competing models.


Introduction
Introduced by Sims (1980), vector autoregressive (VAR) models provide a systematic way of capturing the dynamics and interactions of multiple time-series. In its basic form, the L-lag VAR model represents a p-dimensional vector of variables measured at time t, y t = (y t,1 , . . . , y t,p ) , as a linear combination of past realisations, where {B l } L l=1 are (p×p)-dimensional matrices of unknown coefficients, and e t = (e 1,t , . . . , e p,t ) is a (p × 1)-dimensional innovation vector with distribution N(0, Σ). VAR models have emerged as a benchmark for the analysis of dynamic macroeconomic problems. The linear representation of the variables' joint dynamic behaviour facilitates the study of the effects of shocks (such as monetary and fiscal policy shocks) through computation of response functions, and forecast error variance decompositions (see Lucas, 1980;Pagan, 1997;Stock and Watson, 1999;Diebold and Rudebusch, 2001).
Despite their popularity, there have been criticisms of the use of VAR models in macroeconomic analysis. When p is large there is the risk of overfitting the data which leads to imprecise inference and erratic model forecasts. In addition, the linearity, stationarity, Gaussian innovations and constant conditional mean and variance of these models can be considered unrealistic. For example, empirical evidence suggests that macroeconomic variables may have nonlinear relationships (see Granger and Terasvirta (1994)), the nature of shocks may not be Gaussian (Weise (1999)), and the effects of these shocks may not be linear (see Ravn and Sola (2004) and Matthes and Barnichon (2015)  In the last two decades these criticisms have been addressed by: using adaptations of the parametric model given in equation (1) such as regime switching and threshold crossing behaviour, introducing time varying coefficient models with or without stochastic volatility, and, more recently, using nonparametric methods and considering non-Gaussian innovations. Both regime switching and threshold models are motivated by empirical evidence that many macroeconomic time series behave differently during different time periods (for example, in economic downturns and in expansions) which are often called regimes. Both models assume that there are a small number of regimes which can be accurately modelled by VAR's.
The mechanisms for the change between regimes is the key difference between the two approaches. Hamilton (1989) popularised Markov-switching regression where the change in regimes is driven by latent (unobservable) stochastic variables, usually with a Markov structure. Literature on these models has subsequently grown, see e.g. Hansen (1992), Chib (1996), Chauvet (1998), Kim and Nelson (1999), Kim et al. (2005), and Sims and Zha (2006). Beaudry and Koop (1993), Teräsvirta (1994), Potter (1995), and Pesaran and Potter (1997) popularised vector threshold autoregressive (VTAR) and vector smooth transition autoregressive (VSTAR) models. Unlike Markov-switching regressions, regime changes in a VTAR model occur if some function (often, a linear function) of the observable macroeconomic variables crosses a threshold. Whereas VSTAR uses a weighted average of VAR models where the weighting depends on a continuous, non-linear function of the previous lags. For a comprehensive survey see Hubrich and Teräsvirta (2013). In contrast, time varying vector autoregressions are a class of models in which the system's conditional mean and/or variance are allowed to vary over time. This is achieved by modelling the VAR parameters (coefficients and innovation covariance matrix) with a linear time series model, often a random walk or an AR(1) process.
Notable work in this area is presented in Stock and Watson (1996, 2001, 2002, Sargent (2001, 2005a), and Primiceri (2005). See Koop and Korobilis (2010) for a recent review 3 of these methods. An alternative approach to modelling the joint dynamic behaviour are nonparametric methods. Härdle et al. (1998)  Geweke and Keane (2007) state that answering interesting questions in economics, from macroeconomic policy to the evaluation of economic welfare, often requires the entire conditional distribution p(y|x). In this paper, we introduce a novel stationary model for multivariate time series where the stationary and transition densities are directly modelled using Bayesian nonparametric methods, which place a prior on an infinite dimensional parameter space and adapt their complexity to the data (see Hjort et al., 2010, for a book length review of Bayesian nonparametric methods). The Bayesian nonparametric approach to density estimation requires a prior on distributions with random smooth densities. We use a Dirichlet process mixture (DPM), the most popular of these priors, which is an infinite mixture model, with 4 the Dirichlet process as the mixing measure. There are several advantages to using Bayesian nonparametric methods. Unlike classical nonparametric methods, there is no need to tune any smoothing parameters. Uncertainty in the unknown density can be expressed through the posterior. The out-of-sample predictive performance of models where the conditional density is estimated using the Bayesian nonparametric approach is superior to other competitive models, see Norets and Pati (2017). Adopting the DPM prior, allows us to construct a mixture of VAR's which can be viewed as a multivariate mixture-of-experts. Introduced by Jacobs et al. (1991) and Jordan and Jacobs (1994), mixture-of-experts models focus on estimating the conditional predictive density p(y|x) for all x where y is univariate (discrete or continuous) and x a high dimensional set of covariates. They are extensions of mixture regression models that allow for covariates in the mixture weights. Geweke and Keane (2007) and Villani et al. (2009)  The paper is organised as follows: Section 2 introduces the Bayesian non-parametric VAR 5 (BayesNP-VAR) model, describes its construction and considers some of its properties. Section 3 provides an overview of the required Markov chain Monte Carlo (MCMC) method for fitting this model (the full steps of the MCMC sampler are described in Appendix A). Section 4 illustrates the ability of the BayesNP-VAR model to identify regimes and changes in regimes using simulated data, provides an empirical illustration using US macroeconomic time series, and compares the out-of-sample predictive performance of the BayesNP-VAR to the parametric BVAR and TVP-VAR-SV using both US and UK macroeconomic series. Section 5 summarises our findings and conclusions.

The Bayesian non-parametric vector autoregressive (BayesNP-VAR) model
We construct a multivariate time series model in which the stationary and transition densities are infinite mixtures. Antoniano-Villalobos and Walker (2016) define such a model for a univariate stationary time series. Their prior has full support for the transition density and stationary density (i.e. any transition density and stationary density can be represented arbitrarily well by the prior). We extend their work to multivariate stationary time series and we call our model Bayesian nonparametric VAR (BayesNP-VAR).
The transition densities of the BayesNP-VAR model are derived from a joint distribution for y t and its L lags y L t which is expressed as an infinite mixture. This ensures that the stationary distribution is known and also has the form of an infinite mixture. Specifically, the joint density of y t and its L lags y L t is where k(y t , y L t |θ j ) is a ((L + 1)p) -dimensional probability density function which does not depend on t and θ j are the locations of the mixture components with θ j iid ∼ H. We assume 6 that k(y t−i , . . . , y t−i−κ |θ j ) for i = 0, . . . , L − κ and κ = 0, . . . , L − 1 depends on κ only (which can be achieved by assuming that k(y t , y L t |θ j ) is the joint distribution of a stationary process) to ensure that the overall process is stationary. The mixture weights w j are defined as Assuming that the locations θ j are independent of the weights, w j , the model in (2) defines a Dirichlet process mixture (Sethuraman, 1994). The distribution of the locations, H, is often referred to as the "base measure", the choice of which determines the likely location of the components. The parameter M con- The joint density in (2) leads to a transition density that is also an infinite mixture with the following form: where k(y t |y L t , θ j ) is the transition density of the j th component and ω j (y L t ) = is the weight of the j th component which depends on previous lags, the key feature of our model. We can therefore refer to the transition density as a multivariate mixture of experts.
Mixtures of experts are extensions of smooth regression models and popular within the machine learning community. They are used in regression to estimate the conditional density p(y|x) of a univariate y for all values of a (often, high-dimensional) covariate x, using mixtures where the component weights depend on a x, see Jacobs et al. (1991), Jordan and Jacobs (1994), Geweke and Keane (2007) and Villani et al. (2012). The weights of the transition density in equation (3)  To complete the BayesNP-VAR we need to choose k(y t , y L t |θ j ). Firstly, we find it easier to where k is the joint density of t , L t . We propose a model which has a structure similar to a factor model and divides the variation of the data into a part which describes the dependence between variables and a part which is idiosyncratic to each variable. The assumed form is the dependence between variables over time and Q describes the idiosyncratic variation over for l = 1, . . . , L, −1 < ρ k < 1 and ξ −1 k ∼ Ga(ν/2, ν/2) for k = 1, . . . , p. This leads to a suitable choice of k(y t , y L t ) for the BayesNP-VAR model to be The component-specific parameters in the mixture model are θ = (µ, S, Λ, ρ, ρ , φ, δ) where ρ = (ρ 1 , . . . , ρ q ), ρ = (ρ 1 , . . . , ρ p ) and δ = (δ 1 , . . . , δ q ). We assume that µ, S, Λ, ρ, ρ , φ, and δ are a priori independent with distribution H which has density, The parameters µ and S are given informative prior densities to avoid the mixture model placing mass in areas which are not plausible. We choose h µ (µ) = N(µ|µ 0 , Σ 0 ). Both parameters can be chosen with prior information but we use the data dependent choices µ 0 =ȳ t , and Σ 0 = 1.5 2 Σ, whereȳ t , and Σ are sample mean and covariance matrix of y t in the empirical examples in this paper. This choice leads to a prior for µ which is slightly overdispersed relative to the distribution of the data, and so the components are located in regions within or close to the data. For the (p × p)-dimensional scaling matrix, S = diag(s 1 , s 2 , . . . , s p ), we choose the hierarchical prior The hyperparameter ζ i is shared by all components and is an estimate of the overall scale of the i-th variable. This hierarchical structure allows different components to have similar scales, the s i 's, for each variable.
We use the multiplicative gamma process shrinkage prior of Bhattacharya and Dunson (2011) for Λ. This allows the complexity of B to adapt to the data. Under this prior, . . , p and z = 1, . . . , q where φ i,z ∼ Ga(ν/2, ν/2) and τ z = z i=1 δ i with δ 1 ∼ Ga(1, 1) and δ z ∼ Ga(3, 1) for z ≥ 2. The δ z 's are independent, and the τ z 's are viewed as the global shrinkage parameters of the columns, while φ i,z 's are the local shrinkage parameters for the z th column. As value of δ z increases, so does the value of τ z favouring smaller values of Λ i,z .
The parameters ρ and ρ control dependence across time in both B and Q. We choose independent uniform priors on the range that implies stationarity and positive autocorrelation represents the density of a uniform distribution between 0 and 1. For the prior of M , the parameter controlling the number of components, we choose the standard exponential distribution. We find that this choice strikes a balance between having too many and/or too few components.
where w j = V j m<j (1 − V m ) and V j iid ∼ Be(1, M ) and θ j iid ∼ H. This finite mixture model defines a sequence of posteriors of the form where η 1:K = (V 1:K , ζ, M ). The adaptive truncation method of Griffin (2016) uses an MCMC algorithm to sample from the posterior, π K 0 (θ 1:K 0 , η 1:K 0 |y), for a user-defined starting value, K 0 , and then uses a sequential Monte Carlo method to sample from the sequence of posterior where D is chosen to avoid large truncation errors. The adaptive truncation scheme follows the algorithm below.
Full details of the MCMC algorithm are provided in Appendix A. The MCMC algorithm uses two types of adaptive Metropolis-Hastings algorithm which are briefly reviewed here. The first is the adaptive random walk Metropolis-Hastings algorithm (Atchadé and Rosenthal, 11 2005) with a normal proposal whose variance is updated during the running of the chain.
Suppose that σ 2 t is the proposal variance used at iteration t, then the proposal variance at where α t is the acceptance probability in the Metropolis-Hastings algorithm at the t-th iteration. Atchadé and Rosenthal (2005) show that this algorithm is ergodic. The second algorithm is Adaptive scaling within the Adaptive Metropolis-Hastings (ASWAM) algorithm (Andrieu and Moulines, 2006;Atchadé and Fort, 2010). This is suitable for updating multiple parameters jointly. Suppose we wish to sample a parameter, λ, then the proposed value λ at the t-th iteration is where s t is a scalar and Σ t is the sample covariance matrix of the first t − 1 sampled values of λ. The scale s t is updated using the recursion s t+1 = s t + t −0.6 (α t − 0.234) where, again, α t is the acceptance probability of the Metropolis-Hastings algorithm at the t-th iteration.

Illustrations
In this section we apply the BayesNP-VAR model to both simulated and empirical data. Our aim is to demonstrate that our model identifies economic regimes where shocks are transmitted in different ways, that it clearly indicates changes between regimes, and provide evidence of the model's good out-of-sample predictive performance. The posterior mode is useful for presenting the inference from our model but it ignores posterior uncertainty. To provide IRF's which include posterior uncertainty, we also produce the IRF's suggested by Koop et al. (1996) which we will refer to as Generalised IRF's (GIRF's).
These are used to study the effect of a shock of size υ at time t on the state of the system at time t + n, given that there are no other shocks to the system after time t. They are defined as follows: for n = 1, 2, 3, . . ., where e t , e t+1 , . . . , e t+n represent the arbitrary shocks at times t, t + 1, . . . , t + n and the expectations are taken conditional on the parameter values. This allows us to look at posterior distributions of GI y (n, υ, Y t−1 ) to assess uncertainty in their estimation.
For the out-of-sample predictive performance illustration we calculate the log-predictive score (LPS) for all the variables (using their joint predictive distribution) and for each variable separately (using their marginal predictive distribution). We also calculate the root mean squared error (RMSE) for each variable. The joint LPS is given by where T is the size of the time series, s is the time from where the prediction starts, and h is the predictive horizon. Similarly, the marginal LPS for the j-th variable is given by log p(y i+h,j |y 1 , . . . , y i ). 13 The RMSE of the j-th variable is given by For both measures, smaller scores indicate better predictive performance. We looked at h = 1, 2, and 4 months. We compare the out-of-sample predictive performance of our BayesNP-VAR model with other Bayesian VAR specifications: the stationary BVAR model with the independent Normal-Wishart prior (with either one, two, three or four lags), and the non-

Simulated data
The following simulated example illustrates the ability of the BayesNP-VAR to correctly identify regimes and the timing of regime switches. We generated data from a threshold VAR(2) model with p = 3 variables and 500 time points. The data had two regimes and followed the VAR in Regime 1 if y t−1,1 > 0 and Regime 2 otherwise. The two regimes were • Regime 1   The second row of plots shows the weight for that regime (and is the same for all variables). The threshold is indicated by a dashed line and was set at y t−1,1 = 0. We can clearly see that the estimated regimes change correctly as the value of y t−1,1 changes. The second row displays the weight of the regime and we can see that the regimes are correctly identified with probability close to 1.
We also simulated a VAR(2) model with p = 3 variables and 500 time points with the specification given in Appendix B. We do not display the results here but our model is able to correctly identify that there is only one regime. Both simulated threshold VAR(2) and VAR (2) data are used in Section 4.3 as part of our out-of-sample forecasting exercise.

Data sets
We constructed two macroeconomic data sets, one for the US and one for the UK, based on the series described and transformations carried out in Carriero et al. (2015). Both data sets include seasonally adjusted monthly time series, the sample period for the US is from 1 st  Table 1 for the US and Table   2 for the UK respectively. We include the series in both growth rates and levels.
For the illustrations in Sections 4.2.2 and 4.2.3 we used the US data, in growth rates. For the out-of-sample predictive exercise we have used both the US and UK data in growth rates and levels.

Component/Regime identification
Applying our BayesNP-VAR model to the US data in growth rates, we identify eight mixture components, and here we present the six which had the largest non-negligible weights. is that the latter was characterised by high inflation.

Impulse Response Functions
Each component of our BayesNP-VAR model follows a VAR and so we can produce component/regime dependent IRF's as polynomial functions of the estimated VAR parameters. All IRF's look at 60 months ahead. Figure 5 and  Figure 6 displays the IRF's of the federal funds rate to a 1% increase in inflation (panel (a)) and to a 1% increase in unemployment rate (panel (b)). During the Volcker disinflation regime, the rate reacts quickly to both an inflationary and an unemployment shock, though the length of time before stabilising is longer for the latter (20 rather than 10 months). A similar pattern can be seen in the "Golden Era" regime, however the response to the unemployment shock is more prolonged with a sharp decline for the first 5 months. The rate response to the inflationary shock is not as extreme. When it comes to the other four regimes the inflationary shock has little impact, but the unemployment shock leads to a steady decline in the federal funds rate in the period of stable inflation and output growth, and to a marginal decline for the first 5 months before levelling off for the 2007 US housing crisis regime.

Out-of-sample predictive performance
We parison metrics are the log-predictive scores described in equations 6 and 7 and the root mean squared error in equation 8. We used two simulated (from threshold VAR(2) and VAR(2) models) and four real data sets (US and UK data in growth rates and levels). We consider three predictive horizons: 1 month, 2 months and 4 months, and look at 48 months out-ofsample. This means that for the US data the prediction starts on 1 st September 2012, and for the UK data on 1 st March 2011.
Tables 3 and 4 display the log-predictive scores and the RMSE's for the simulated VAR (2) data and simulated threshold VAR(2) data respectively. The BayesNP-VAR(1) outperforms the TVP-SV-VAR(1) for all predictive horizons both for all variables jointly and each variable marginally, in both the threshold VAR(2) and VAR(2) simulated data. In the VAR(2) data, the overall log-predictive scores and the RMSE's for the BVAR(1) and the BayesNP-VAR(1) are comparable at all horizons. In the threshold VAR(2) data, the BayesNP-VAR(1) has the lowest overall log-predictive scores. It also outperforms BVAR(1) at horizon 2, and 4 for all variables and horizon 1 for variable 1.
Tables 5, 6, 7 and 8 display the log-predictive scores and the RMSE's for the growth rates of the US and UK data respectively. The BayesNP-VAR (1)  Tables 9, 10, 11 and 12 display the log-predictive scores and RMSE's for the levels of the US and UK data respectively. Once again, the BayesNP-VAR(1) model produces better overall and marginal log-predictive scores and RMSE's for all time horizons when compared to the TVP-SV-VAR(1) for both the US and UK data. When compared to the BVAR models, it performs better on the UK data. In the US data, the BayesNP-VAR(1) model outperforms BVAR(1) overall for horizons 1 and 2 but not horizon 4. The BVAR(1) and BayesNP-VAR(1) model each produce better predictions for some variables but there is no clearly superior model for all variables for log predictive scores but BayesNP-VAR(1) outperforms BVAR (1) for all variables with RMSE.
To summarize the BayesNP-VAR(1) outperforms the TVP-SV-VAR(1) for all predictive horizons, in both the simulated and empirical data sets. The non-stationary TVP-SV-VAR accounts for nonlinearity in the conditional mean, and heteroskedasticity in the conditional variance and therefore one would expect it to be better suited to capturing the nonlinear dynamic relationships between variables, leading to better predictive performance. However, for all data sets considered, it is often outperformed by BVAR models. This may be due to fact that the TVP-SV-VAR does not separate the non-linearity in the conditional mean and heteroskedasticity from non-stationarity, which can be a problem when y t is a stationary process. The transition density of the TVP-SV-VAR depends on the mean µ t and the covariance Σ t which are modelled by non-stationary processes. Therefore, when this transition density is not appropriately chosen then the TVP-SV-VAR will not do a good job in approximating the real transition density.         time. We will investigate these types of models in future work.

Updating V
The full conditional density of V j is proportional to (1 − V j ) M −1 T t=L+1 p K y t |y (t−L):(t−1) .
The parameter can be updated using an adaptive random walk Metropolis-Hastings sampler on the logit scale where a normal proposal is used whose variance is tuned to have an acceptance rate 0.234.

Updating M
The full conditional distribution of M is Ga 1 + K, 1 − K i=1 log(1 − V i ) .