Estimation of a Semiparametric Recursive Bivariate Probit Model with Nonparametric Mixing

We consider an extension of the recursive bivariate probit model for estimating the effect of a binary variable on a binary outcome in the presence of unobserved confounders, nonlinear covariate effects and overdispersion. Specifically, the model consists of a system of two binary outcomes with a binary endogenous regressor which includes smooth functions of covariates, hence allowing for flexible functional dependence of the responses on the continuous regressors, and arbitrary random intercepts to deal with overdispersion arising from correlated observations on clusters or from the omission of non‐confounding covariates. We fit the model by maximizing a penalized likelihood using an Expectation‐Maximisation algorithm. The issues of automatic multiple smoothing parameter selection and inference are also addressed. The empirical properties of the proposed algorithm are examined in a simulation study. The method is then illustrated using data from a survey on health, aging and wealth.


Introduction
Quantifying the effect of a predictor of interest (also referred to as treatment) on a particular response variable is a challenging task in observational studies. This is because it is often the case that confounders which are associated with both treatment and response are either unknown or not readily quantifiable (this problem is known in econometrics as endogeneity of the variable of interest). Moreover, covariate-response relationships can exhibit nonlinear patterns and observations may be overdispersed. In such a context, the use of standard estimators neglecting the aforementioned issues yields inconsistent estimates. In this article, we consider the case in which the researcher is interested in estimating the effect of a binary endogenous variable on a binary outcome in the presence of unobserved confounders, nonlinear covariate-response relationships and overdispersion resulting from either correlations among observations on the same clusters or from the omission of non-confounding covariates.
Instrumental variable techniques are widely used for isolating the effect of a given predictor in the presence of unobserved confounding (e.g. Wooldridge 2010; Marra & Radice 2011b and references therein), and are increasingly used in epidemiological and medical studies (e.g. Goldman et al. 2001 and references therein). In the context of binary responses, it is well known, from both theoretical and empirical results, that bivariate likelihood estimation methods are superior to conventional two-stage instrumental variable procedures (e.g. Bhattacharya et al. 2006;Wooldridge 2010). First introduced by Heckman (1978), the recursive bivariate probit model represents an effective way to estimate the effect a binary regressor has on a binary outcome in the presence of unobservables. The semiparametric version of Heckman's model is an important extension since undetected nonlinearity can have severe consequences on the estimation of covariate effects (e.g. Marra & Radice 2011a). Chib & Greenberg (2007) proposed two Bayesian fitting procedures for the class of instrumental variable models including the semiparametric recursive bivariate probit model. However, as the authors point out, very large sample sizes are required to obtain reasonable estimates of the binary treatment effect, hence undermining the utility of the method for practical modeling. Marra & Radice (2011a) considered the same model and introduced a penalized likelihood based procedure which permits reliable estimation of the model coefficients at reasonably small sample sizes.
The neglect of the possible presence of overdispersion may have a detrimental impact on the estimation of the effect of an endogenous variable. This issue is dealt with by generalising the method of Marra & Radice (2011a) to include random effects, which are generated by unknown densities. The usual parametric approach, which assumes that random effects are generated by a bivariate normal density (Greene 2012), is avoided here as restrictive. Consequences of parametric assumptions have been studied extensively within the class of generalised linear mixed models (GLMMs). Several authors have shown that misspecification of the random effects distribution can affect negatively the estimation of regression parameters; see for instance Neuhau et al. (1992), Heagerty & Kurland (2001), Chen et al. (2002), and Agresti et al. (2004). In addition, the assumed distribution is a very important factor for the prediction of the random effects themselves. In fact, the shape of the distribution of the empirical Bayes estimates tends to have features that are similar to the assumed random effects distribution, even if in reality assumed and true distributions are not close together (Verbeke & Lesaffre 1996;Papageorgiou & Hinde 2012). With a nonparametric approach such pitfalls are avoided. The results of Laird (1978) and Lindsay (1983) have shown that the nonparametric maximum likelihood estimate of a mixing distribution is a discrete distribution. General fitting algorithms have been provided by Laird (1978), Lindsay (1983), Follmann & Lambert (1989) and Lesperance & Kalbfleisch (1992).
The proposed model is fitted by maximizing a penalised likelihood using an Expectation-Maximisation algorithm, where the issues of automatic multiple smoothing parameter selection and inference are also addressed. The empirical properties of the proposed algorithm are examined in a simulation study. The method is then illustrated using data from a survey on health, aging and wealth. Specifically, the aim is to estimate the effect of private health insurance on private medical care utilization. In such data, endogeneity is likely to arise because insurance coverage is not randomly assigned but rather is the result of supply and demand. Moreover, estimation of the effect of private health insurance on private medical care utilization may be adversely affected by overdispersion resulting from the heterogeneity present in the observations due to unobserved covariates related to either the response or the treatment variable. Buchmueller et al. (2005) provide an excellent review of these issues, which, if neglected, can lead to a biased estimate of the relationship of interest.

Model specification
The recursive bivariate probit model consists of a reduced form or treatment equation for the potentially endogenous binary variable and a second structural form or outcome equation for the binary response variable. The mixed effects semiparametric version of this model takes the form where m denotes the number of clusters, n i is the number of observations within the ith cluster, and y Ã vij is a latent continuous variable which determines its observable counterpart y vij through the rule 1ðy Ã vij [ 0Þ, for v ¼ 1; 2, where 1ðÁÞ is the indicator function; # is the coefficient of the endogenous binary variable y 1ij ; vector x 1ij contains P 1 parametric model components (such as dummy and categorical observed confounders, but not intercepts as we do not impose a zero mean on the random effects), with corresponding parameter vector h 1 . The s 1k 1 are unknown smooth functions of the K 1 continuous observed confounders z 1k 1 ij . Varying coefficients models can be obtained by multiplying a smooth term by some predictor (Hastie & Tibshirani 1993). Smooth functions of two covariates such as s 11;12 ðz 11ij ; z 12ij Þ can also be implemented (e.g. Wood 2006, pp. 154-167). Similarly, x 2ij is a vector of dimension P 2 with associated parameter vector h 2 , the s 2k 2 are unknown smooth terms of the K 2 continuous observed confounders z 2k 2 ij . For identification purposes, the smooth functions are subject to the centering constraint P ij s k ðz kij Þ ¼ 0 for all terms (Wood 2006 pp. 167-168). The pair of random effects ðu 1i ; u 2i Þ is cluster specific, hence it induces correlation among multiple observations on the same cluster or can be used to handle overdispersion in case of independent observations, i.e. n i ¼ 1 for all i. For instance, a large value of u 1i will tend to make y Ã 1ij large for all n i observations within the ith cluster. Similar comments hold for u 2i . As in Chib & Greenberg (2007) and Marra & Radice (2011a), we make the assumption that unobserved confounders have a linear impact on the response. That is, the error terms ðe 1ij ; e 2ij Þ are assumed to follow the bivariate distribution where q e is a correlation coefficient and the error variances are set to 1 as the model parameters can only be identified up to a scale coefficient (e.g. Greene 2012). Parameter q e accounts for the correlation between the responses not accounted for by the pair ðu 1i ; u 2i Þ. As in Greene (2012), u 1i and u 2i can be correlated. Further, it is assumed that the error terms and random effects are independent. The recursive structure of (1) follows from the condition of logical consistency. It states that only one observed endogenous variable is allowed on the right-hand side of system (1). This is because the probabilities for the different possible value combinations of the two binary variables have to sum to one (e.g. Maddala 1983, p. 118). To identify the model parameters, it is typically assumed that the exclusion restriction on the exogenous variables holds (e.g. Maddala 1983, p. 122). That is, the exogenous covariates in the first equation of (1) should contain at least one regressor not included in the second equation. Such covariates are regarded as instrumental variables which induce variation in the treatment, do not directly affect the outcome, and are independent of the error terms given the covariates (e.g. Chib & Greenberg 2007). However, under correct model specification, this restriction may not be strictly necessary as pointed out by Wilde (2000) and Marra & Radice (2011a).
The smooth functions are represented using regression splines. The key idea is to approximate a generic function s k ðz kij Þ by a linear combination of known spline basis functions, b kq ðz kij Þ, and regression parameters, b kq , where Q k is the number of bases (hence regression coefficients) used to represent s k , B k ðz kij Þ is a vector containing Q k basis functions evaluated at observation z kij , i.e. B k ðz kij Þ ¼ fb k1 ðz kij Þ; b k2 ðz kij Þ; . . .; b kQ k ðz kij Þg > , and b k is the corresponding parameter vector. Basis functions should be chosen to have convenient mathematical properties and good numerical stability. Many choices are possible within the framework adopted in this article. These include B-splines, cubic and thin plate regression splines (see, e.g. Ruppert et al. 2003;Wood 2006 for a more detailed introduction); we opt for the latter. Based on the above regression spline representation, model (1) is written as vK v Þ, for v ¼ 1; 2, and the linear predictors, g vij , have the obvious definitions.
In the current context, the effect of y 1ij is of primary interest. This is typically calculated using the average treatment effect (ATE). Given estimates for the random effects, parametric and smooth function components, the ATE can be estimated as follows ; Àĝ 1ij ; Àq e 1 À Uðĝ 1ij Þ ; where U and U 2 are the distribution functions of a standardized univariate normal and a standardized bivariate normal with correlation q e , andĝ ðy 1ij ¼rÞ 2ij indicates the linear predictor evaluated at r equal to 1 or 0. Coefficient q e is also of interest as it can be used to ascertain the presence of unobserved confounding (endogeneity). It can be interpreted as the correlation between the unobserved confounders in the two equations (e.g. Monfardini & Radice 2008). If q e ¼ 0 then e 1ij and e 2ij are uncorrelated and hence there is not a problem of endogeneity. Because model (1) can capture, and hence separate, two different sources of variability (represented by e vij and u vi ), estimation of q e will be done more reliably by model (1) than by a model which does not account for overdispersion (e.g. Greene 2012).

Estimation approach
Recall that the error terms ðe 1ij ; e 2ij Þ are assumed to follow a bivariate normal distribution. Define the parameter vector d ¼ ðh > 1 ; h > 2 ; b > 1 ; b > 2 ; #; q e ; m > Þ > , and pairs of random effects u i ¼ ðu 1i ; u 2i Þ > . Vector m contains the parameters pertaining to the random effects distribution (see next section). In the current context, the data identify four possible events, ðy 1ij ¼ e 1 ; y 2ij ¼ e 2 Þ with e v 2 f0; 1g for v ¼ 1; 2, with the following conditional probabilities The penalised log-likelihood function of the observed data y ¼ fy i ; i ¼ 1; . . .; mg, where y i ¼ fy ij ¼ ðy 1ij ; y 2ij Þ > : j ¼ 1; . . .; n i g, is where and the S vk v are positive semi-definite known square matrices measuring the (second-order, here) roughness of the smooth terms in the model, that is b > P 2 v¼1 The k vk v are smoothing parameters controlling the trade-off between fit and smoothness.
Note that because of the presence of smooth components in the model, unpenalised estimation would yield exceedingly wiggly curve estimates which can have a detrimental impact on the estimation of the ATE (Marra & Radice 2011a). This is why the log-likelihood is augmented by a penalty term. In addition, because q e is bounded in ½À1; 1, we use the common transform for correlation q þ e ¼ tanh À1 ðq e Þ ¼ 0:5 logfð1 þ q e Þ=ð1 À q e Þg, so that ½À1; 1 is mapped to the real line.

EM penalised log-likelihood maximisation
We make no assumptions about the form of the density that gives rise to the model's random effects u i . The nonparametric maximum likelihood estimate of a mixing distribution is discrete (Laird 1978;Lindsay 1983) and thus the density of u i can be represented by F bivariate mass points, m 1 ¼ ðm 11 ; m 12 Þ; . . .; m F ¼ ðm F1 ; m F2 Þ, with corresponding probabilities, p 1 ; . . .; p F , where P F l¼1 p l ¼ 1. Hence the parameter vector m, first introduced in Section 3.1, consists of m ¼ ðm > 1 ; . . .; m > F ; p 1 ; . . .; p FÀ1 Þ > . We will treat F as a tuning constant.
An EM algorithm (Dempster et al. 1977) is employed for maximising (6). We consider ðy; uÞ ¼ fðy i ; u i Þ : i ¼ 1; . . .; mg to be the complete data and indirectly maximise ' p ðdjyÞ by iteratively maximising the expectation of the penalized log-likelihood of the complete data, where the expectation is taken with respect to the conditional distribution of the missing given the observed data where d ðaÞ is the current value of the parameter vector. Let d ðaþ1Þ ¼ /ðd ðaÞ Þ be the parameter vector that maximises Q p ðdjd ðaÞ Þ. Under regularity conditions, at convergencê d ¼ /ðdÞ maximises both the complete and the observed data log-likelihoods. Conditionally on the data and current parameter estimates, the distribution of u i is discrete with points m ðaÞ l ; l ¼ 1; . . .; F, and probability masses given by Given the w ðaÞ il , we have the following expression for the penalised complete data log-likelihood where the PrðÁjÁÞ are given in (2)-(5). Note that in (7) the parameter vector d separates into two independent subvectors, namely the vector . . .; m > F Þ > that appears in the triple sum and penalty term and the vector d 2 ðp 1 ; . . .; p FÀ1 Þ > that appears in the double sum. Consequently, maximisation of Q p is achieved in two steps. Firstly, the triple sum has summands that, ignoring fixed w ðaÞ il ; are exactly the same as those that would have been obtained by assuming a model without random effects. It follows that the triple sum with penalty term can be maximized using the algorithm presented in Marra & Radice (2011a). Secondly, the double sum is used to update the masses of the random effects distribution resulting in closed form formulas p

Smoothing parameter selection
If the model has more than two or three smooth terms, then it becomes crucial to estimate the smoothing parameters using an automatic, quick and reliable procedure. There are several techniques for automatic multiple smoothing parameter selection for univariate models (see Ruppert et al. 2003;Wood 2006 for a detailed overview). These include the performance-oriented iteration method first introduced by Gu (1992) which consists of applying the generalized cross validation or unbiased risk estimator (UBRE, Craven & Wahba 1979) to each working linear model of the penalized iteratively re-weighted least squares scheme used to fit the model. In what follows, we employ an adaptation of Gu's approach. Also, we suppress the superscript ðaÞ to avoid clutter.
Given values for k vk v and d 2 , an estimate for d 1 can be obtained by minimisation of where S Ã k is an overall blockdiagonal penalty matrix made up of the k vk v S vk v and zero vectors corresponding to the model parameters which are not penalised, and w l is a vector containing the masses as defined in the previous section. Assuming, for simplicity and without loss of generality, that n i ¼ 1 and Q vk v ¼ Q for each vk v so that the total number of observations is m, d il is a 3-dimensional vector given by f@Q il =@g 1il ; @Q il =@g 2il ; @Q il =@g 3il g > , where each of the vectors c 1il and c 2il contain F zero elements but the lth which is set to 1, and the definitions of the linear predictors in (9) follow from the definition of X il . The square root and inverse of W l are obtained via eigendecomposition.
The smoothing parameter vector k is selected so that the estimated smooth terms are as close as possible to the true functions (Craven & Wahba 1979). Given an estimate for d 1 , multiple smoothing parameter estimation for problem (8) can be achieved by minimization of the approximate UBRE score where the working linear model quantities are calculated using the parameter estimates from the optimisation step mentioned in Section 3.1.1, À1 X Ã is the hat matrix, and trðA k Þ the estimated degrees of freedom of the penalised model. For each working linear model of iteration (8), V w u ðkÞ is minimized with respect to k. In practice, this is implemented employing the approach by Wood (2004), which is based on the Newton-Raphson method. In evaluating score (10) and their derivatives, efficiency and stability are achieved using a combination of pivoted QR and singular value decompositions (see Wood 2004 for full details). Note that because each of the W l is a non-diagonal matrix of dimension n Ã Â n Ã , computation can quickly become prohibitive, hence its sparse structure is exploited in implementation.

Inference
Inference in penalised models is complicated by the presence of smoothing penalties which undermines the usefulness of classic frequentist results for practical modelling. Solutions to this problem have been introduced in the literature (see, e.g., Gu 2002;Wood 2006 for an overview). Here we show how to construct pointwise confidence intervals for the terms of a mixed effects semi-parametric bivariate model by adapting the well known Bayesian confidence intervals, originally proposed by Wahba (1983) and Silverman (1985). An appealing feature of these intervals is that they have close to nominal 'across-the-function' frequentist coverage probabilities (Marra & Wood 2012). This is because the Wahba/Silverman type intervals include both a bias and variance component. Moreover, their empirical performance has little sensitivity to the neglect of smoothing parameter uncertainty. For a generic term s k ðz ki Þ intervals can be constructed by seeking constants C ki and A, such that where 'ACP' denotes 'Average Coverage Probability', a is a constant between 0 and 1, and q a=2 is the a=2 critical point from a standard normal distribution. Defining b k ðz k Þ ¼ Efŝ k ðz k Þg À s k ðz k Þ and v k ðz k Þ ¼ŝ k ðz k Þ À Efŝ k ðz k Þg, so thatŝ k À s k ¼ b k þ v k , and I to be a random variable uniformly distributed on f1; 2; . . .; ng, we have It is then necessary to find the distribution of B k þ V k and values for C ki and A so that requirement (11) is met. As shown in Marra & Wood (2012), in the context of non-Gaussian response models involving several smooth components, such a requirement is approximately met when confidence intervals for theŝ k ðz ki Þ are constructed using the distribution where, in our context, y refers to the binary response vectors,d is an estimate of d, and where V d k and B k ðz ki Þ are the submatrix of V d and the basis functions corresponding to the regression spline parameters associated with s k ðz ki Þ. In addition, intervals for nonlinear functions of the model parameters, such as the ATE, can be conveniently obtained by simulation from (12). In practice, I can be replaced by its observed version J . In the present context, however, J cannot be obtained as a byproduct of the estimation procedure and of the second order derivatives of Q, in (7), used therein. Second derivatives of the log-likelihood of the 'complete' data ðy; uÞ would overestimate the information about the model parameters in the sample. Ultimately, this is attributed to treating the weights w il that appear in Q as fixed. We therefore find the observed information using the method of Louis (1982), by which the observed information matrix is expressed as which makes clear how the second derivative of the complete data likelihood Q is adjusted for the unobserved data u. Details about the approach, including the definition of w i , are provided in Appendix A. It is important to stress that there is no contradiction in fitting the model using the method of Section 3.1 and then constructing intervals following a Bayesian result, and such an approach has been employed many times in the literature (e.g. Gu 2002;Wood 2006 and references therein).

Algorithm
As indicated previously, we treat F, the number of mass points of the nonparametric mixing distribution, as a tuning parameter. It is common practice to find its value as the one that minimizes Akaike's information criterion. This, for a given value of F, takes the following form: AICðFÞ ¼ À2'ðdjyÞ þ 2dimðdjFÞ, where dimðdjFÞ denotes the effective dimension ofd for a fixed value of F, and the log-likelihood is obtained from Equation (6), in which we express f ðy i jdÞ ¼ Having fixed a value of F, we need to choose starting values, d 0 , for the model parameters. Starting values for ðh > 1 ; h > 2 ; b > 1 ; b > 2 ; #; q þ e Þ > are chosen as the maximum likelihood estimates of the corresponding fixed effects model, i.e. the model assuming F ¼ 1, fitted using the method of Marra & Radice (2011a). Starting values for the masses p k ; k ¼ 1; . . .; F, are all set to 1=F, while starting values for the mass points, ðm > 1 ; . . .; m > F Þ > , as set to a multiple (here, square root of two) of the Gauss-Hermite quadrature nodes.
Given F and d 0 , parameter estimates are found using an iterative algorithm. Iteration ða þ 1Þ consists of finding the maximizer of Q p ðdjd ðaÞ Þ using the algorithm described after (7). For a given estimate of d, smoothing parameter selection is achieved by minimisation of (10), as described in Section 3.1.2. The two main steps, one for d the other for k, are iterated until convergence. The rule that we follow for stopping the iterative algorithm is that the maximum absolute change in the parameter estimates from successive iterations is less than ¼ 10 À6 .
At convergence, we calculate log-likelihood and AICðFÞ to guide model choice, standard errors of the estimates by inverting the observed information obtained as described in the previous section, and random effects predictions using (14) as these are needed for estimating the ATE.

Simulation study
To gain insight into the empirical effectiveness of the proposed method, a Monte Carlo simulation study was conducted. All computations were performed in the

Design and model fitting details
The sampling experiments were based on the model where the binary outcomes y 1ij and y 2ij were determined as described in Section 2. The test functions used were s 1 ðz 1ij Þ ¼ 0:6fz 3 1ij þ sinðpz 3 1ij Þg, s 2 ðz 2ij Þ ¼ À1:25z 2ij and s 3 ðz 1ij Þ ¼ 0:5 cosð2pz 1ij Þ (see Fig. 1). Covariates x 1ij , z 1ij and z 2ij were generated as binary, uniform and normal correlated predictors, respectively. This was achieved by drawing standardised multivariate normal random variables with correlation 0.5 (using rmvnorm () in the package mvtnorm) and then transforming the first two of them with round() and pnorm() (e.g. Marra & Radice 2011a). Bivariate normal errors with zero means, standard deviations equal to one, and correlations q e ¼ AEð0:1; 0:5; 0:9Þ were considered. Sample sizes were set to 2000 and 6000 in the following two ways. In the first case, n i was set to a randomly chosen number between 9 and 11 and m was set to 200 and 600. In the second case, n i ¼ 1 and m was set to 2000 and 6000. The pairs of random effects ðu 1i ; u 2i Þ were generated according to three scenarios: bivariate normal variates with mean vector (0,2), standard deviations r 1 ¼ r 2 ¼ 0:5 and correlation q u ¼ 0:5; mixture of two equally weighted bivariate normals with mean vectors ðÀ2; À2Þ and ð2; 2Þ, and with the remaining parameters as above; bivariate gamma variates with shape and scale parameters equal to 0.5. This last was achieved via a normal copula with q u ¼ 0:5 using mvdc() in the package copula. Each scenario was replicated 250 times and the quantities of interest, estimated ATE and q e (see final paragraph of Section 2), recorded. The smooth components of continuous covariates in the model were represented using penalized thin plate regression splines with basis dimensions equal to 10 and penalties based on second-order derivatives (Wood 2006, pp. 154-160). The spline basis representation used here is a low rank eigen-approximation version of the full rank version introduced by Duchon (1977). It represents a general solution to the problem of estimating efficiently, and without having to choose knot locations, a smooth function of multiple predictors from noisy observations of the function. Smoothing parameters were chosen by approximate UBRE as described in Section 3.1.2. The tuning constant F was identified to be 3; further increasing the value of this parameter did not change the results reported in the next section.
True values for the ATEs, under the scenarios detailed above, were obtained via simulation. Specifically, 10000 replicate datasets were generated according to model (13) and ATEs calculated based on the true linear predictors. The simulated average true ATEs for the normal, mixture of normals and gamma cases are À0.43, À0.15 and À0.45, respectively.
Estimates of the ATE were obtained using the proposed mixed model and, for the sake of comparison, the semiparametric bivariate model of Marra & Radice (2011a) which neglects the presence of random effects (henceforth, these two models will be referred to as mixed SRBP and SRBP, respectively). The calculation of the ATE for mixed SRBP requires an estimate of the random effects distribution. This was obtained, using empirical Bayes, as weighted averages of the estimated mass points,m 1 ; . . . ;m F , with respective weights Prðm 1 jy i Þ; . . . ; Prðm F jy i Þ. That is, for each i,

Results
Tables 1 and 2 display the percentage biases and the root mean squared errors (RMSEs) of the estimated ATEs and q e 's obtained using SRBP and mixed SRBP, when n i is a randomly chosen number between 9 and 11 and m is set to 200 and 600, and random effects are generated using bivariate normal (N), mixture of normals (MN) and gamma (G) distributions. Tables 4 and 5, reported in Appendix B, provide the same information but for the case in which n i is set 1 and m set to 2000 and 6000.
The main results can be summarized as follows: • Table 1 shows that, under the N and G scenarios, mixed SRBP is only slightly better than SRBP, in terms of accuracy and precision of the estimated ATEs. This suggests that, under the N and G cases, the model neglecting cluster specific random effects can still yield good estimates of the average treatment effect. A likely explanation is that the parameter that links the two equations of the bivariate model (i.e. q e ) captures correlations due to both unobserved confounders and cluster or 'litter' effect. However, this is not true when the bivariate random effects distribution is not unimodal, the case in which mixed SRBP considerably outperforms SRBP. These conclusions are in agreement with previously reported findings on the impact of misspecification of the random effects distribution on parameter estimation within the class of GLMMs; see Heagerty (1999), Chen et al. (2002) and Agresti et al. (2004).   • Table 2 shows that, under all random effects distribution scenarios, mixed SRBP performs considerably better than SRBP, in terms of accuracy and precision of the estimated q e s. The unsatisfactory performance of SRBP can be attributed to the fact that a model neglecting the presence of overdispersion will not be able to disentangle different sources of variability (in this case, one due to endogeneity and the other due to overdispersion). This finding is important because the parameter linking the two model equations is useful to ascertaining the presence of endogeneity, and the estimates produced using SRBP can clearly lead to erroneous conclusions.
• The findings for the more computationally challenging scenarios, in which n i ¼ 1 and m ¼ 2000; 6000, are essentially the same as those reported above, except that, as expected, the estimates obtained using mixed SRBP are more variable. These results can be found in Tables 4 and 5 given in Appendix B. Figure 1 provides an example of estimated smooths with corresponding 95% Bayesian pointwise confidence intervals obtained using the mixed SRBP model. The function estimates recover the true functions reasonably well. This is a good result given the complexity of the model.

Empirical illustration
The modeling framework described in this article is illustrated using data from an Italian population based survey. The aim of this study is to estimate the causal effect of private health insurance on private medical care utilization in the presence of unobserved confounding and overdispersion. The problem of unobserved confounding arises in such data because insurance coverage is not randomly assigned as in a controlled trial but rather is the result of supply and demand, including individual preferences and health status. As a consequence, differences in outcomes for insured and uninsured individuals might be due not only to the effect of health insurance but also to the effect of unobserved characteristics that are associated with insurance coverage and medical care utilization. If we do not account for the endogeneity of coverage insurance then the estimated effect will be biased, hence leading to distorted assessments of health policy implications. Overdispersion, which in this study can result from unobserved predictors of either private health insurance or private medical care utilization, can also bias the effect of interest. Buchmueller et al. (2005) provide an excellent review of these issues. The direction of the bias due to unobserved confounding is unclear a priori. Specifically, standard economic models of insurance markets point to the problem of adverse selection: individuals with a greater demand for medical care, because of poor health for instance, are expected to have a greater demand for insurance. In this case, adverse selection would impart a positive bias on the estimate of the insurance effect on medical care utilization. On the other hand, there could be a problem of moral hazard; once insured, individuals consume more care than optimal. Here, moral hazard would contribute to bias in the opposite direction.

Data
We used data from the Survey on Health, Aging and Wealth (SHAW; Brugiavini et al. 2002) which was conducted by the leading Italian polling agency DOXA in 2001. The SHAW sample consists of 1068 households whose head is over 50 years old and mainly provides information about individual health status, utilization of health services, types of insurance coverage, as well as socio-economic features. The response is utilization of private health care (util): an indicator variable that takes value 1 if the subject has private examinations and 0 otherwise. The treatment variable is private health insurance (ins): a dummy variable with value 1 if the respondent has private insurance coverage and 0 otherwise. The observed confounders are the continuous covariates age (age), income (inc), body mass index (bmi), the binary variables indicating whether the individual is a male (male), is unmarried or widower (single), is unemployed (unemp), suffers from chronic conditions (cond), has a condition that limits activities of daily life (lim), suffers from hearing and/or eyesight troubles (heey), has ever smoked (smoke), and a factor indicating self-reported health status (poor, good and excellent self-perceived health: poor, good and exc, respectively).

Health care modeling
The methodology presented here is suitable to tackle both endogeneity and overdispersion; the bivariate model allows us to account for unobserved confounding and for the source of variation due to the heterogeneity in the households. Following previous work on the subject (e.g. Holly et al. 1998;Fabbri & Monfardini 2003;Marra & Radice 2011b), we specified a mixed SRBP model with main terms only. Specifically, the equations for ins and util are: The parameters in the model have the obvious definitions and thin plate regression splines of the continuous covariates with the same settings as those used for the simulation study were employed. The optimal value for tuning parameter F was identified to be 2, that is the random effects distribution is represented by a two point discrete distribution. The non-linear specification for age, inc and bmi arises from the fact that these covariates embody productivity and life-cycle effects that are likely to affect ins and util non-linearly. In fact, Holly et al. (1998) and Fabbri & Monfardini (2003) considered a model for health care utilization that contains linear and quadratic terms in age, inc and bmi, whereas Marra & Radice (2011b) specified a model containing smooth functions of them. For comparison purposes, we also employ the SRBP model and a classic univariate probit model using the same functional form specification. Mixed SRBP can account simultaneously for unobserved confounding and overdispersion, SRBP accounts for unobserved confounding only whereas the probit model cannot account for either of these issues.
Results are displayed in Table 3. Bayesian confidence intervals for the ATE and correlation coefficient were obtained using 1000 coefficient vectors simulated from the posterior distribution of the estimated model parameters (see Section 3.2).
For the mixed SRBP model, the estimated bivariate mass points are m 1 ¼ ð1:50; À0:24Þ and m 2 ¼ ðÀ8:27; 0:20Þ, with probabilities 0:71 and 0:29, suggesting the absence of relevant predictors of private health insurance. The estimates of q are both negative and statistically significant, suggesting the presence of endogeneity. Specifically, the point estimate obtained with mixed SRBP is larger than that of SRBP, although their intervals overlap. This confirms the finding by Holly et al. (1998) which is consistent with the interpretation that unobserved confounders are present and have an opposite significant effect on ins and util.
Moving on to the estimated ATE, the result obtained with the univariate probit model suggests that the effect of private health care insurance is not significant. However, this estimate may be biased due to the unmodelled effects. If we look at the results obtained with the SRBP models, that is models which account for unobserved confounding, private health care insurance has a significant positive impact on the probability of using private health care services. Specifically, the mixed SRBP estimate suggests that the probability of using private medical services increases by 0.38 points for an individual with private health coverage as compared to an individual without private insurance. The point estimate obtained with mixed SRBP is larger than that obtained using SRBP, although their intervals overlap. Results for the other parametric coefficients (not reported here) are in agreement with those found in the literature. The change in the correlation coefficient and ATE of mixed SRBP suggests that decomposing the disturbance in the model into a part attributed to endogeneity and another attributed to overdispersion might have led to a more accurate estimate of the effect of interest. Figure 2 shows the impacts of age, inc and bmi for the treatment and outcome equations obtained using the mixed SRBP model. These results support the presence of nonlinear effects in the outcome equation.
In summary, if we employ a univariate probit model to estimate the effect of private health insurance, the impact appears not to be statistically significant. However, this result is likely to be biased by the presence of unobserved confounding and overdispersion. The estimates obtained with the SRBP models, which account for unobserved confounding, are likely to be more realistic, with mixed SRBP also accounting for overdispersion.

Discussion
In this paper, we introduce an algorithm for the simultaneous estimation of the equations of a semiparametric recursive bivariate probit model with nonparametric mixing. Estimation is carried out by maximising a penalised likelihood function using an Expectation-Maximisation algorithm. We also address the issues of automatic multiple smoothing parameter selection and inference. Results from our simulation study suggest that the approach is effective for estimating the effect of an endogenous binary predictor on a binary outcome. Interestingly, the model neglecting overdispersion yields average   TABLE 3 Estimates of the ATE and q e in the health care study obtained using the univariate probit model, and semiparametric recursive bivariate probit without and with nonparametric random effects (SRBP and mixed SRBP, respectively). treatment effect estimates exhibiting substantial bias only for the case of bimodal random effects densities. However, this is not true for the estimation of the parameter linking the two model equations (which is important for ascertaining the presence of endogeneity), where substantial bias is observed in all simulation settings. The methodology was illustrated using data from a survey on private medical care utilization. For this application, differences in the point estimates of the average treatment effects were found between the models with and without random effects, and a classic univariate probit model. Maximum likelihood estimators are typically sensitive to model error misspecifications. This creates a need for considering different joint distributions of the model errors. A copula approach can be used to that end (e.g. Nelsen 2006). As for the nonparametric approach to the estimation of the random effects distribution, although it yields reasonably efficient parameter estimates, it has several drawbacks. For instance, the resulting discrete estimate of the distribution is not satisfactory as it is more likely to be continuous than discrete. A more relevant drawback is the amount of information required to obtain an accurate estimate of the nonparametric mixing distribution (Carroll & Hall 1988), which can ultimately affect the precision of the effect of interest. We plan on extending the approach presented here in order to include random effects generated by flexible densities that avoid the restrictive assumption of normality but also allow for smooth estimates of the random effects densities. Such densities can, for instance, be represented by mixtures of Gaussians.

Appendix A: Observed information matrix
We briefly describe the method we used to obtain the observed information matrix. First, the log-likelihood, 'ðdjyÞ ¼ P m i¼1 log f ðy i jdÞ, of the hierarchical model is written in terms of both the observed data and random effects, as P m i¼1 logf P F l¼1 f ðy i jd; u i ¼ m l Þ Prðu i ¼ m l jdÞg, which for the sake of notational convenience is expressed as 'ðdÞ ¼ P m i¼1 logf P F l¼1 f ðy i ; u l jdÞg. From this, we obtain the score function, Wðy; dÞ @'ðdÞ=@d; as where w il and w il have the obvious definitions. Note that the above score function and the one obtained by differentiating (7) are exactly the same (except for the penalty term). Now, the observed information matrix J À@ 2 'ðdÞ=@d@d > ¼ À@W=@d > , is obtained as w il @ log f ðy i ; u l jdÞ @d @ log f ðy i ; u l jdÞ @d > þ X m i¼1 X F l¼1 w il @ log f ðy i ; u l jdÞ @d ( ) X F l¼1 w il @ log f ðy i ; u l jdÞ @d Varðw i Þ; with expressions involving model parameters being evaluated at parameter estimates obtained at convergence of the fitting algorithm.

TABLE 4
Percentage biases and root mean squared errors of the estimated average treatment effects (ATEs) obtained using the semiparametric recursive bivariate probit model without and with random effects (SRBP and mixed SRBP, respectively).    Table 4 for further details.