Identification region of the potential outcome distributions under instrument independence

This paper examines identication power of the instrument exogeneity assumption in the treatment e¤ect model. We derive the identication region: The set of potential outcome distributions that are compatible with data and the model restriction. The model restrictions whose identifying power is investigated are (i) instrument independence of each of the potential outcome (marginal independence), (ii) instrument joint independence of the potential outcomes and the selection heterogeneity, and (iii) instrument monotonicity in addition to (ii) (the LATE restriction of Imbens and Angrist (1994)), where these restrictions become stronger in the order of listing. By comparing the size of the identication region under each restriction, we show that the joint independence restriction can provide further identifying information for the potential outcome distributions than marginal independence, but the LATE restriction never does since it solely constrains the distribution of data. We also derive the tightest possible bounds for the average treatment e¤ects under each restriction. Our analysis covers both the discrete and continuous outcome case, and extends the treatment e¤ect bounds of Balke and Pearl (1997) that are available only for the binary outcome case to a wider range of settings including the continuous outcome case. Keywords: Partial Identication, Program Evaluation, Treatment E¤ects, Instrumental Variables JEL Classication: C14, C21. Email: t.kitagawa@ucl.ac.uk Webpage: http://www.homepages.ucl.ac.uk/~uctptk0/ yAn earlier version of this paper appears in a chapter of my Ph.D. dissertation at Brown University. I thank Guido Imbens, Frank Kleibergen, and the seminar and conference participants at Harvard Econometrics Lunch and the 2009 SETA/CeMMAP conference in Kyoto for valuable comments. Financial Support from CeMMAP and the Brown University Merit Dissertation Fellowship are gratefully acknowledged.


Introduction
This paper studies identification of the potential outcome distributions using an instrumental variable in settings where data exhibits imperfect compliance and selection is an issue. A motivating example is a randomized control trial with two treatment arms and where trial-subjects are observed following a different treatment arm to the one that they are allocated by the experimental design (Imbens and Angrist, 1994;Angrist et al., 1996). Of interest throughout this paper is identification of the effect of treatment on the potential outcomes, and the models that we study feature an instrumental variable that facilitates identification. The potential outcomes can be dichotomous, discrete or continuous, and the effect of treatment can be heterogeneous in the population. The models that we study differ in the statistical independence conditions and monotonicity restrictions that they embed, and so in the relationships between the potential outcomes and instrumental variable that they are compatible with.
We make three contributions to the existing literature on partial identification of treatment effects by models featuring selection on unobservables. Firstly, for each model that we study, we derive the identification region of the potential outcome distributions, which we emphasize are counterfactual distributions that describe the outcomes that would be realized if each treatment arm were applied uniformly to the population. Since the statistical independence conditions and monotonicity restrictions that we consider are made successively stronger, we show that these identification regions are nested and provide closed-form expressions for them. Secondly, we similarly derive sharp bounds for the Average Treatment Effect and provide closed-form expressions for these bounds. The expressions that we provide include a novel and non-trivial extension of existing results, with our extension of Balke and Pearl (1997) and its Average Treatment Effect bounds a leading example of this. Thirdly, we show that each model is falsifiable by deriving a sharp testable implication of each set of assumptions that we consider, and that equate to a set of conditions such that the identification region of the potential outcome distributions is empty.
The statistical independence conditions and monotonicity restrictions that we consider in this paper are as follows. Firstly, that the instrumental variable is statistically independent of each potential outcome (marginal statistical independence). Secondly, that the instrumental variable is statistically independent of the potential outcomes and selection heterogeneity jointly (joint statistical independence). Thirdly, that the instrumental variable is statistically independent of the potential outcomes and selection heterogeneity jointly, and that each unit in the population exhibits a weakly monotonic selection response to the instrumental variable (instrument monotonicity). We note that the third assumption is the so-called LATE restriction (Imbens and Angrist, 1994).
The remainder of the paper is organized as follows. In the remainder of this section, we present a brief review of the existing literature. In Section 2, we introduce the notation that we use and provide a formal definition of the identification region. In Section 3, we derive the identification region of the potential outcome distributions for each model. In Section 4, we compare the size of the obtained identification regions, and present sharp bounds for the Average Treatment Effect. In Section 5, we conclude. Proofs and further discussion are included as appendices.

Related literature
We identify several papers in the econometrics literature that are related to this paper, which we collect into three broad categories according to their content and their relation to this paper.
Firstly, we recognize the contribution of papers that consider identification of treatment effects under a similar set of assumptions to those that we consider here, and that propose bounds on these treatment effects. Chief amongst these is Manski (1990), which reports sharp bounds on mean outcomes and the Average Treatment Effect when mean independence is assumed to hold. Balke and Pearl (1997) similarly considers identification of treatment effects in settings where data exhibits imperfect compliance and selection is an issue, but restricts attention to the case where the potential outcomes are dichotomous. When outcomes are dichotomous, the mean independence condition that is present in the analysis of Manski (1990) coincides with marginal statistical independence. Balke and Pearl (1997) strengthen marginal statistical independence to full statistical independence, and show that the Manski bounds are not sharp in this case. Balke and Pearl (1997) provide closed-form expressions for the sharp bounds on mean outcomes and the Average Treatment Effect by solving a linear program, which is of finite dimension when there are a finite number of treatment arms and both the instrumental variable and outcomes are discrete with finite supports. We extend Balke and Pearl (1997) by allowing for non-scalar and continuous outcomes, providing closed-form expressions for the identified sets of the potential outcome distributions and the Average Treatment Effect. These closed-form expressions complement the general characterizations that are reported in Beresteanu et al. (2012). Gunsilius (2020a) also extends Balke and Pearl (1997) but goes further than this paper in allowing for an infinite number of treatment arms and a continuous instrumental variable (in addition to continuous outcomes). To facilitate this extension, Gunsilius (2020a) notes that it is necessary to regularize heterogeneity concerning individual responses to an infinite number of treatment arms. No such regularization is required if the number of treatment arms is finite and, like Balke and Pearl (1997), we allow for unrestricted heterogeneity and rich behavior. Additionally imposing instrument monotonicity, Heckman and Vytlacil (2001) (andHeckman andVytlacil, 1999, 2005) consider identification of the Average Treatment Effect when outcomes are continuous, and show that the obtained bounds coincide with the Manski bounds (Manski, 1990(Manski, , 1994(Manski, , 2003 under mean independence. If data is not compatible with instrument monotonicity though, then the bounds that are derived in Heckman and Vytlacil (2001) can be wider than the sharp bounds that are derived under marginal or joint statistical independence, which we provide in this paper. Heckman andVytlacil (2005, 2007) and Mogstad et al. (2018) extend the analysis of Heckman and Vytlacil (2001) to consider identification of the Marginal Treatment Effect and other policy relevant parameters, while Huber et al. (2017) and Huber and Mellace (2015a) focus on partial identification of treatment effects for sub-populations including that of the compliers (Imbens and Angrist, 1994). Chen and Flores (2015) and Cheng and Small (2006) also consider identification of the Average Treatment Effect under instrument monotonicity, but allow for sample selection and three (rather than two) treatment arms, respectively. Bhattacharya et al. (2008), Mourifié (2015), Shaikh and Vytlacil (2011) and Vytlacil and Yildiz (2007) each study a special case where instrumental variables are statistically independent of the potential outcomes, which are dichotomous and monotonic in treatment. Chiburis (2010) also studies the special case of dichotomous outcomes, considering identification of treatment effects under a variety of semiparametric restrictions. Lafférs (2019) adopts a linear programming approach to identification of treatment effects that is similar to the approach taken in Balke and Pearl (1997), adding constraints and restrictions that are not present in that analysis. A comprehensive review of (partial) identification of the Average Treatment Effect is found in Swanson et al. (2018).
Secondly, we recognize the contribution of papers that consider the failure and testing of identifying assumptions. Pearl (1995a) derives a testable implication for instrument independence, which is the so-called (Pearl) Instrument Inequality. Pearl (1995a) shows that this implication is a necessary condition for emptiness of the identification region or, equivalently, for falsification of instrument independence. We show in this paper that this testable implication is, in fact, both necessary and sufficient for emptiness of the identification region when the instrumental variable is dichotomous (the case that Pearl, 1995a andBalke andPearl, 1997 consider). As such, there does not exist a stronger testable implication than the Instrument Inequality unless further restrictions are maintained. For instance, Kédagni and Mourifie (2020) show that the Pearl Inequality can be strengthened if the instrumental variable takes more than two values. Gunsilius (2020b) shows that the testability of instrument independence is reliant on there being a finite number of treatment arms, and this assumption is untestable when there are instead an infinite number. Provided that there are a finite number of treatment arms, Heckman and Vytlacil (2005) and Balke and Pearl (1997) provide testable implications for instrument independence and instrument monotonicity jointly. Kitagawa (2015) and Mourifié and Wan (2017) build upon these implications to propose formal tests of these restrictions, which leads to identification of the complier outcome distribution (Imbens and Rubin, 1997) and of the Local Average Treatment Effect (Imbens and Angrist, 1994). Huber and Mellace (2015b) proposes a complementary testing procedure for the weaker statistical independence condition of mean independence. Complementary to this work on falsifiability of a model is de Chaisemartin (2017), Flores and Flores-Lagunes (2013) and Kédagni (2021) that consider identification in instances where various common restrictions of a model are inappropriate. For instance, de Chaisemartin (2017) considers identification in the presence of instrument non-monotonicity, Flores and Flores-Lagunes (2013) consider identification in the absence of exclusion, and Kédagni (2021) considers identification in the absence of instrument independence. Machado et al. (2019) propose testable implications when outcomes are dichotomous and maintained assumptions can reveal the sign of the Average Treatment Effect.
Thirdly, we recognize the contribution of papers that study the identification of treatment effects by incomplete structural models that do not feature a selection equation. Particular examples include Beresteanu et al. (2012), Chernozhukov andHansen (2005) and Chesher and Rosen (2017). Chesher and Rosen (2017) is notable since it provides a sharp characterization of the identification region of treatment effects for a broad class of models using tools from random set theory.  illustrate what such incomplete models can deliver in practice by means of a simple application (we refer to Clarke and Windmeijer, 2012 for further evidence of what partially identifying models can deliver in practice in comparison to conventional models). We also recognize the contribution of papers studying complete structural models that impose additional restrictions on their constituent structural equations, including Chesher (2003Chesher ( , 2005Chesher ( , 2010, , Imbens and Newey (2009) and Vuong and Xu (2017) to list but a few. These additional restrictions constrain the association of the potential outcomes and are a source of additional identifying power in the model.

Data generating process and the population
Consider identification of the causal effect of a binary treatment on some outcome of interest. We use D ∈ {1, 0} as an indicator for treatment, where D = 1 indicates a treated individual and where D = 0 indicates an untreated individual.
Following the Neyman-Rubin potential outcome framework, let Y 1 denote the outcome that would be observed if the individual receives treatment and let Y 0 denote the outcome that would be observed if the individual does not receive treatment. The observed outcome in data is then Y ≡ DY 1 + (1 − D)Y 0 , which need not be scalar. To this end, we let the support of Y 1 and Y 0 be a subset of Y, which we can take to be an arbitrary space equipped with the Borel σ -algebra B(Y) and a probability measure µ. We focus on a situation where treatment status is not randomized and selection is an issue of concern (i.e., treatment status can depend upon the underlying potential outcomes). We suppose that a non-degenerate binary variable Z ∈ {1, 0} is available in data, and that Z qualifies as an instrumental variable (Imbens and Angrist, 1994;Angrist et al., 1996). In particular, we suppose that Z satisfies an exclusion restriction prohibiting it as a (direct) cause of Y , and our notation reflects this. For example, initial assignment to treatment is often used as an instrumental variable in experimental settings with non-compliance.
We denote a conditional distribution of (Y , D) given Z by where B is an arbitrary subset of Y. Except for the marginal distribution of Z , P = (P Y 1 (·), P Y 0 (·)) and Q = (Q Y 1 (·), Q Y 0 (·)) uniquely characterize the distribution of data. We represent the data generating process by (P, Q ) ∈ P, where P is the class of data generating processes. Throughout our analysis, we do not restrict the class of data generating processes P other than to assume the existence of probability density functions with respect to the dominating measure µ, which the researcher has knowledge of. We denote the probability sub-density functions of P Y j (·) and Q Y j (·) with respect to µ by p Y j (·) and q Y j (·), j = 1, 0. That is, for every subset B, we have It is important to keep in mind that the integration of the sub-density functions p Y j (·) and q Y j (·) over Y yield the conditional probabilities Pr(D = j|Z = 1) and Pr(D = j|Z = 0), which can be less than one. Sub-distribution functions (the integral of a sub-density function over subsets) are common in competing risks analysis, where they are often alternatively referred to as cumulative incidence functions.
Our identification framework features a selection equation with unobserved selection heterogeneity V , Here, u(Z , V ) is latent utility and rationalizes the individual's choice of treatment status, and V is unobserved heterogeneity that affects the individual's choice and is possibly dependent on the potential outcomes. We interpret this equation as structural in the sense that, with V fixed, u(z, V ) yields a counterfactual treatment status for each z = 1, 0. Provided that D and Z are binary, there are at most four distinct selection behaviors, which we refer to as types. The role of unobserved heterogeneity V is to randomly categorize individuals into one of these four types. A random category variable T is used to indicate type (Angrist et al., 1996), with If we do not impose any restriction on the distribution of T , then we are also free of any assumption on the functional form of latent utility and on the dimensionality of unobserved heterogeneity V (Pearl, 1995b).
Every individual in the population of interest possesses a non-random value of (Y 1 , Y 0 , T , Z ) and the parameter of interest is defined on the distribution of (Y 1 , Y 0 , T , Z ). We define the population as a joint probability distribution of (Y 1 , Y 0 , T , Z ) ∈ Y×Y×{c, n, a, d}×{1, 0}. Hereafter, f denotes the probability density or sub-density function of population variables, distinguished by subscripts such as f Y 1 , f Y 1 ,T |Z , etc. We use F to denote the class of populations. In the following analysis, equalities or inequalities for density or sub-density functions are interpreted as almost everywhere with respect to the measure µ.

Defining the identification region
Model restrictions take the form of statistical relationships for the population random variables (Y 1 , Y 0 , T , Z ). Let M be the model restriction(s) and let F M ⊂ F be the sub-class of populations satisfying the imposed restriction M.
For each data generating process (P, Q ) ∈ P, the class of observationally equivalent populations F o (P, Q ) ⊂ F is defined as the collection of distributions of (Y 1 , Y 0 , T , Z ) that generate (P, Q ). Given a particular data generating process (P, Q ), the identification region under restriction M, which we denote by IR(P, Q |M), is defined as the set of populations that are compatible with (P, Q ) and restriction M. That is, IR(P, Q |M) is formulated as the intersection of F M and F o (P, Q ), When IR(P, Q |M) is empty, restriction M is not compatible with observed data and is refutable (Manski, 2003). 1 If interest instead lies in θ : F → Θ, a feature or parameter of the population, then the identification region of θ under restriction M, which we denote by IR θ (P, Q |M), is defined as the range of θ(·) for the domain IR(P, Q |M). When IR(P, Q |M) is empty, we also define IR θ (P, Q |M) as empty so as to reflect the refutability property of the identification region. As such, the identification region of θ under restriction M is defined as In words, IR θ (P, Q |M) is defined as the set of θ such that we can construct a population F that is compatible with (P, Q ) and the imposed restriction M.
Here, our construction of the identification region does not assume that the true population satisfies the imposed restriction M, which matters when M is observationally restrictive (Koopmans and Reiersøl, 1950). If we assume that the true population satisfies restriction M and M is observationally restrictive, we a priori exclude the possibility of IR(P, Q |M) being empty, even if data provides evidence to refute M. If we then derive sharp bounds on θ under the assumption that the true population satisfies restriction M, the bound formula and its sharpness break-down if IR(P, Q |M) is empty. Moreover, the bound formula does not correspond to an empty set, despite the fact that IR(P, Q |M) is empty. This breakdown gives rise to a misspecification of the sharp bounds for θ. As we discuss further in Section 4, the bounds on the Average Treatment Effect under instrument independence provide an example of this type of misspecification problem. In order to avoid such a misspecification problem, we do not vary the class of data generating processes P and construct the bounds for each restriction that we impose by explicitly applying definition (2). 1 Since this rule for refuting restriction M is based on emptiness of IR(P, Q |M), no other testable implication is more powerful in detecting violations of M.

Instrumental variable restrictions
We consider the following three model restrictions in turn.

Restriction MSI:
Marginal Statistical Independence Restriction: Z is marginally independent of each of Y 1 and Y 0 .

Restriction RA:
Random Assignment Restriction: Z is jointly independent of (Y 1 , Y 0 , T ).

Restriction LATE:
The notion of instrument exogeneity is represented in all three model restrictions by statistical independence of the potential outcomes and the instrument. The restrictions are nested and are listed in terms of their strength, from weak to strong. The first restriction, MSI, imposes marginal independence between the instrument and each of the potential outcomes. Since selection heterogeneity T is unaffected by the model restriction, the analysis corresponding to this case is robust to dependence between the instrument and selection heterogeneity. 2 The second restriction, RA, embodies a stronger version of instrument exogeneity such that the instrument is jointly independent of both outcome heterogeneity and selection heterogeneity. RA is justified if the researcher believes that the instrument is generated through some randomization mechanism as in the (quasi-) experimental setting. The final restriction, LATE, is due to Imbens and Angrist (1994) and Angrist et al. (1996), and is crucial to identifying the potential outcome distributions for the sub-population of compliers.
We assert that, although MSI is theoretically interesting, it is of limited practical use. 3 It is difficult to think of instances where MSI can be justified but RA cannot. Nonetheless, we study MSI here for its simplicity, as a stepping-stone to analysis under RA and LATE.
Our primary interest lies in identifying f Y 1 and f Y 0 , which describe the marginal distributions of Y 1 and Y 0 . The marginal distributions are of interest if the goal of analysis is to assess the effect of intervention by comparing various features of the marginal distributions of the potential outcomes. For example, the Average Treatment Effect is defined as the difference between the mean of f Y 1 and of f Y 0 . As a further example, we may be interested in the τ -th quantile differences, defined as the difference between the τ -th quantiles of the two potential outcome distributions. As a final example, we may be interested in the effect of intervention on the inequality of outcomes, and so in the variances of f Y 1 and f Y 0 or in some other measure of inequality of outcome such as the Gini index. In all three examples, the parameters of interest are defined in terms of the marginal distributions of Y 1 and Y 0 . We focus on constructing the sharp identification region of f Y 1 and f Y 0 , which we denote by IR (f Y 1 ,f Y 0 ) (P, Q |·), instead of the identification region for the (full) population distribution. We note that if interest lies instead in a parameter that is defined on the distribution of the individual causal effects is less useful, as the distribution of Y 1 − Y 0 is sensitive not only to the marginals of Y 1 and Y 0 but also to dependence between Y 1 and Y 0 . Identification of the distribution of Y 1 − Y 0 is beyond the scope of this paper. 4

Construction of the identification region
For the construction of IR (f Y 1 ,f Y 0 ) (P, Q |·), our first step is to formulate the conditions for F ∈ F o (P, Q ) (i.e., for compatibility of a distribution F of (Y 1 , Y 0 , T , Z ) with observed data (P, Q )). These conditions are obtained by rewriting the right-hand side of the identities (1) in terms of the distribution of (Y 1 , Y 0 , T , Z ).
The law of total probability implies that We use these identities to relate the distribution f Y j |Z to the distribution f Y j ,T |Z .

Identification region under marginal independence (MSI)
Therefore, we substitute f Y 1 and f Y 0 (the unconditional sub-densities) for f Y 1 |Z and f Y 0 |Z (the conditional sub-densities) in the left-hand side of (4). We have Given (P, Q ) ∈ P, any population contained in IR(P, Q |MSI) satisfy (3) and (5). That is, by noting that the right-hand side of every equation of (5) is non-negative, we find necessary conditions for , that any f Y 1 and f Y 0 that are above the density envelopes constitute the identification region of ( This result can be viewed as a straightforward extension to the treatment effect model of Corollary 2.2.1 of Manski (2003) for the missing data model.

Proposition 3.1 (Identification Region Under Marginal Independence). Denote the density envelopes by f Y
Define the sets of probability density functions that cover f Y 1 and f Y 0 respectively, by The identification region under MSI is non-empty if and only if δ Y 1 ≤ 1 and δ Y 0 ≤ 1, and is given by

Proof. See Appendix A. ■
The density envelope f Y 1 provides the maximal identifying information for the Y 1 -distribution. Under MSI, each of the observed sub-densities p Y 1 and q Y 1 must be a part of the common underlying density of the treated outcome f Y 1 . An interpretation is that the density envelope then fills f Y 1 as much as is possible with the identified sub-densities p Y 1 and q Y 1 (and similarly for the untreated outcome). That implies that marginal independence does not provide a channel through which p Y 1 and q Y 1 contribute to identifying f Y 0 or through which p Y 0 and q Y 0 contribute to identifying f Y 1 . As such, we can, without loss of identifying information, separate identification analysis of f Y 1 from identification analysis of f Y 0 .
The refutability condition for marginal independence when both the outcome and treatment are binary coincides with the testability result for the instrument exclusion restriction analyzed in Bonet (2001) and Pearl (1995a). Manski (2003) obtained an analogous refutability condition in the context of missing data. 5

Identification region under random assignment (RA)
If we strengthen MSI to RA, we replace the conditional distributions that appear on the right-hand side of (3) and (5) with their unconditional equivalents. With this in mind, we claim that 6 that a pair of marginal 5 Kitagawa (2010) considers estimation and inference for the integrated envelope parameter, so as to develop a specification test for instrument independence.
6 See Lemma A.1 in Appendix for a formal justification of this claim. and the compatibility constraints Subject to (6) and (7), we propose a compatible population as, 7 for t = c, n, a, d, By construction, the proposed population satisfies RA, and is compatible with the data generating process as it satisfies (3). Accordingly, The next proposition provides a closed-form expression of IR (f Y 1 ,f Y 0 ) (P, Q |RA).

Proof. See Appendix A. ■
The proof of this proposition, which is provided in Appendix A, proceeds by the method of ''guess and verify,'' and so the reader might think that the origins of the inequalities that appear in the definitions of F * f Y 1 (P, Q ) and F * f Y 0 (P, Q ) are rather obscure. In Appendix B, with the intent of providing intuition for this result, we present a geometric illustration of the additional identification gain of RA relative to MSI.
The above proposition makes clear that the identification region under RA can be strictly smaller than the identification region under MSI. In particular, such an identification gain arises only if the data reveals that and F * f Y 0 (P, Q ) can be strictly smaller than F env f Y 1 (P, Q ) and F env f Y 0 (P, Q ) respectively, due to the inequality constraints appearing in their definitions. 8 For the case of 1 − δ Y 0 < λ Y 1 , the fact that the inequality in the definition of F * involves δ Y 0 implies that p Y 0 and q Y 0 can contribute to identifying f Y 1 despite RA not explicitly constraining the association between Y 1 and Y 0 . 9 Symmetrically, for the case of 1 − δ Y 0 > λ Y 1 , p Y 1 and q Y 1 can contribute to identifying f Y 0 through the parameter λ Y 1 .
7 There are many ways to combine the density of (Y 1 , T ) and (Y 0 , T ) to obtain the joint density of (Y 1 , Y 0 , T ). The one employed here is called the conditional independence coupling: the association of Y 1 and Y 0 satisfies Y 1 ⊥ Y 0 |T . 8 This condition is a necessary but not sufficient condition: if λ Y 1 ̸ = 1 − δ Y 0 then the identification region under RA can be strictly smaller than the identification region under MSI but such an identification gain is not guaranteed. We further discuss this in Section 4 and in a supplementary appendix. 9 To be clear, RA constrains the association between Y 1 and Y 0 via the statistical independence condition. The use of explicitly here is intended to reflect the absence of any overt mechanism, such as a structural equation or a specified family of distributions, by which knowledge of one marginal distribution implies knowledge of the other.   10 We also draw marginal distributions of the potential outcomes There, the subgraphs of f Y 1 and f Y 0 are partitioned into (c(1), n(1), a(1), d(1)) and (c(0), n(0), a(0), d(0)) respectively. If the identification region of f Y 1 were F env f Y 1 (P, Q ), then the area of n(1), which equals 1 − δ Y 1 , would coincide with the fraction of never-takers. If not, then F env f Y 1 (P, Q ) could not be spanned by the Y 1 -distribution of never-takers f Y 1 ,T (·, n), the shape of which is not constrained by data. However, this would violate the third and fourth equations of (7) since the fraction of never-takers cannot be greater than the area of n(0), which is smaller than 1 − δ Y 1 for the drawn data generating process. Hence, we claim that the identification region for f Y 1 must be strictly smaller than F env f Y 1 (P, Q ). To summarize, the source of the identification gain of RA relative to MSI is that RA allows us to learn the feasible type distributions from the observed sub-densities of Y 0 and that further constrain the feasible marginal distribution of Y 1 . 11

Identification region under the LATE restriction
Proposition 3.2 makes clear that if the observed data satisfies 1 − δ Y 0 = λ Y 1 , then the difference between MSI and RA does not matter for identification of f Y 1 and f Y 0 . This condition is satisfied when the data generating process reveals nested sub-densities p Y 1 (y 1 ) ≥ q Y 1 (y 1 ) and q Y 0 (y 0 ) ≥ p Y 0 (y 0 ), or p Y 1 (y 1 ) ≤ q Y 1 (y 1 ) and q Y 0 (y 0 ) ≤ p Y 0 (y 0 ).
Nested sub-densities come into play once we consider imposing the LATE restriction.
The LATE restriction further constrains the population by eliminating one of the selection types from the population.
Specifically, in the case of Pr(D = 1|Z = 1) ≥ Pr(D = 1|Z = 0), the LATE restriction implies the no-defiers condition f T (T = d) = 0. Since analysis of the no-compliers case and the no-defiers case is symmetric, we consider the case of Pr(D = 1|Z = 1) ≥ Pr(D = 1|Z = 0) without loss of generality.
Under the LATE restriction (equivalent to RA plus the no-defiers condition), (7) simplify to  The first four of the above constraints imply that when the population satisfies the LATE restriction, the data generating process must reveal nested sub-densities since p Y 1 (y This is equivalent to saying that observing non-nested sub-densities must yield an empty identification region under the LATE restriction. On the other hand, when data reveals nested sub-densities then, for every , we can uniquely solve the above constraints to obtain the (non-negative) probability density functions of (Y 1 , T ) and (Y 0 , T ), and these can be combined to obtain the probability density function of (Y 1 , Y 0 , T ) independent of Z . Accordingly, it can be seen that any

Proof.
A proof is given in the preceding paragraphs of this section. ■ If the data generating process reveals nested sub-densities then the identification region under the LATE restriction coincides with the identification region under MSI. Moreover, the fact that nested sub-densities satisfy 1 − δ Y 0 = λ Y 1 implies that the identification region under the LATE restriction also coincides with the identification region under RA. If nested sub-densities are not observed, then the LATE restriction is refuted but the identification region under RA or MSI can be non-empty. In other words, as far as the distributions of the potential outcomes are concerned, adding instrument monotonicity 12 to the instrument independence restriction only constrains the data generating process without helping us to learn more about (f Y 1 , f Y 0 )than under MSI or RA. In this sense, we can safely drop instrument monotonicity from the LATE restriction and still acquire the maximal identifying information for the potential outcome distributions. Note that the refutability result of the LATE restriction is not new in the literature. Heckman and Vytlacil (2005) demonstrate a testable implication for the LATE restriction, which is equivalent to the nested sub-density condition given here.

Bounding causal parameters
Since the analysis of the previous section does not rely on the choice of dominating measure µ, the constructed identification regions are applicable for discrete, continuous, unbounded or multi-dimensional outcomes. Moreover, for a parameter (vector) θ that maps (f Y 1 , f Y 0 ) to Θ, we can make a comparison of the size of the sharp bounds of θ under the different model restrictions without explicitly computing them.
Theorem 1. Let θ be a parameter (vector) that maps (f Y 1 , f Y 0 ) to Θ. Then, for each layer of the data generating process (see Fig. 2), the sharp bounds of θ under MSI, RA, and the LATE restriction have the following properties. (A) If δ Y 1 > 1 or δ Y 0 > 1, then IR θ (P, Q |·) = ∅ for all of MSI, RA, and the LATE restriction.
Proof. By the definition of IR θ (P, Q |·) given in (2), Propositions 3.1, 3.2, and 3.3 directly imply the results. ■ Provided that the outcome is scalar with compact support Y = [y l , y u ], this theorem clearly applies to the sharp bounds of the Average Treatment Effect In order to present a closed-form expression of the sharp ATE bounds, we define the α-th left-or right-trimming of a non-negative integrable function g : and we define the α -th left-trimming of g as and we define the α-th right-trimming of g as The α-th (right-) left-trimming is obtained by trimming the (right-) left-tail part of the function g so that the trimmed mass is exactly equal to α. Note that if the underlying measure is atomic then the second terms on the right-hand sides of the above definitions can be non-zero, and these adjustment terms are needed to make the trimmed area exactly equal to α.
Proposition 4.1 (The Sharp ATE Bounds). Assume that Y 1 and Y 0 have compact support Y = [y l , y u ] and that their marginal distributions are absolutely continuous with respect to the measure µ that allows point mass at y l and y u . Further assume that the data generating process satisfies δ Y 1 ≤ 1 and δ Y 0 ≤ 1 so as to exclude Case (A) of Theorem 1.

(i) The sharp ATE bounds under MSI are
and, for 1 − δ Y 0 > λ Y 1 , (iii) The sharp ATE bounds under the LATE restriction are for nested sub-densities, ∅ otherwise.

Proof. See Appendix A. ■
The identification region for (f Y 1 , f Y 0 ) under MSI or RA collapses to a singleton if and only if δ Y 1 = 1 and δ Y 0 = 1, and determines whether ATE is non-parametrically (point-) identified or not. We emphasize that this condition is weaker than the well-known argument of identification at infinity (Chamberlain, 1986;Heckman, 1990). Whereas identification at infinity requires that the propensity score is zero or one at some instrument values, the above condition on the integrated envelopes can be satisfied even when the propensity score is away from zero and one for all instrument values. However, when (P, Q ) reveals nested sub-densities, the integrated envelopes equal the maximum propensity score (or one minus it) and so identification is attained only at infinity.
When the data generating process reveals 1 − δ Y 0 ̸ = λ Y 1 , the ATE bounds under RA can be strictly narrower than the bounds under MSI. Absolute continuity of the sub-density functions p Y j (·) and q Y j (·) with respect to the Lebesgue measure over Y is a sufficient condition for the identification region under RA to be strictly smaller than the identification region under MSI, and for the ATE bounds under RA to be strictly narrower. If, instead, these functions have point mass then such an identification gain is not guaranteed. The conditions under which strengthening MSI to RA delivers strictly narrower bounds and the source of this identification gain is discussed in a supplementary appendix. When (P, Q ) reveals nested sub-densities, the sharp ATE bounds are given by (9), irrespective of the imposed restrictions, as claimed in Theorem 1. Moreover, with nested sub-densities, (9) reduces to the expression in (iii) of Proposition 4.1. This expression is identical to the ATE bounds of Manski (1994) under the mean independence restriction, . This observation supports the result of Heckman and Vytlacil (1999, 2001, which says that the sharp ATE bounds under the LATE restriction coincide with Manski's mean independence bounds. However, this statement is no longer valid if the data reveals non-nested sub-densities. Furthermore, a naïve implementation of the expression of the ATE bounds under the LATE restriction does not necessarily yield the emptyset even if IR (f Y 1 ,f Y 0 ) (P, Q |LATE) is empty. As such, there is arguably some advantage to explicit statement of the model and its associated identification region, rather than working solely with the expression for the ATE due to this misspecification problem.
In the special case where the outcome variables are binary, the sharp ATE bounds under RA that are presented above coincide with the treatment effect bounds of Balke and Pearl (1997) (a proof of this claim is provided in a supplementary appendix). Since the analysis of Balke and Pearl (1997) relies on a linear optimization procedure with a finite number of choice variables, such an approach cannot be directly applied to the case in which the outcome variable has continuous variation. Thus, the bound formula obtained here can be seen as a non-trivial generalization of the Balke and Pearl bounds to a more general case (see Gunsilius, 2020a for more recent advances).
As is discussed elsewhere, the potential outcomes have a structural equation analog (see Athey and Imbens, 2006;Chernozhukov and Hansen, 2005;Pearl, 2009), and requiring that this equation is monotonic can lead to substantial identification gains. Like Balke and Pearl (1997), we do not rely on any type of assumption on the functional form of the structural equation analog, nor on the dimension or distribution of the unobserved heterogeneity that it features. In contrast, the analyses of Chesher (2003Chesher ( , 2005 and Chernozhukov and Hansen (2005) impose what is referred to asoutcome monotonicity in unobservables orrank invariance, which necessarily restrict the structural equation and unobserved heterogeneity. 13 In the special case where the outcome variables are binary, Chesher (2010) obtains bounds on the Average Treatment Effect that are substantially narrower than the ones that are presented in this paper (Hahn, 2010). Moreover, in the continuous outcome case, Chernozhukov and Hansen (2005) show that rank invariance and random assignment leads to (point-) identification of the potential outcome distributions. In each case, the imposed assumption limits individual behavior through association of the potential outcomes and requires justification that it is appropriate for the studied economic environment.

Concluding remarks
With partial-identification in mind, this paper clarifies the identifying power of instrument independence assumptions in the heterogeneous treatment effect model. We derive the identification regions of the marginal distributions of the potential outcomes under each restriction that we consider, and compare their size. For some data generating processes, strengthening instrument independence from marginal independence to joint independence results in a tightening of the identification region. We clarify which data generating processes exhibit this property and which processes do not. We find that instrument monotonicity is redundant for identification of the potential outcome distributions when assumed in conjunction with instrument independence since monotonicity constrains the data generating process without further identifying the potential outcome distributions (see also Heckman and Vytlacil, 1999, 2001. We also present sharp bounds for the Average Treatment Effect under each restriction that we consider. Our analysis covers binary, discrete, and continuous support of an outcome of interest, and our bounds under joint independence amount to a generalization of the bounds of Balke and Pearl (1997) from the binary outcome case to the continuous outcome case.

Acknowledgments
An earlier version of this paper appears in Kitagawa (2009

Appendix A. Proofs
Proofs for constructing IR (f Y 1 ,f Y 0 ) (P, Q |·) proceed in the manner of ''guess and verify''. We first propose IR guess (f Y 1 ,f Y 0 ) (P, Q |·) as a guess for IR (f Y 1 ,f Y 0 ) (P, Q |·). In order to verify that the guess IR guess (f Y 1 ,f Y 0 ) (P, Q |·) is correct, we need to show two things. Firstly, for an arbitrary , we shall show that there exists a distribution of (Y 1 , Y 0 , T , Z ) that is compatible with (P, Q ) and the imposed model restrictions. This first step proves IR guess , the proof of Proposition 3.1). Alternatively, we may demonstrate that any |·) delivers a contradiction of some of the imposed restrictions (e.g., the proof of Proposition 3.2.). In either fashion, we can conclude that IR . By combining them, we conclude that the guess is correct, . Throughout the proof, we do not explicitly state µ-a.e but any equalities or inequalities between the density functions should be interpreted in the sense of almost everywhere with respect to the measure µ.
Proof of Proposition 3.1. Fix (P, Q ) ∈ P, and guess the identification region under MSI to be IR guess is non-empty if and only if δ Y 1 ≤ 1 and δ Y 0 ≤ 1, as otherwise no probability density functions can cover the entire density envelopes. Let us pick an arbitrary . Consider the distribution of (Y 1 , Y 0 , T ) given Z as follows.
By integrating out y 1 or y 0 from these densities, we can see that the constructed population meets the constraints (3).
Furthermore, by plugging the constructed population densities into the identities, f Y 1 |Z = ∑ t∈{c,n,a,d} must hold because the right-hand side of (5) is always non-negative. Hence, Q |MSI), and this completes the proof. ■ The following lemmata are used for the proof of Propositions 3.2 and 4.1.
Lemma A.1. Let the data generating process (P, Q ) ∈ P be given. Consider a pair of marginal probability density functions , t = c, n, a, d} that satisfy the following constraints, p Y 1 (y 1 ) = h Y 1 ,c (y 1 ) + h Y 1 ,a (y 1 ), Proof of Lemma A.1. The ''only if'' part is implied by (7) in the main text, by substituting h for f . So, we focus on proving the ''if'' part of the lemma. Given the non-negative functions {(h Y 1 ,t , h Y 0 ,t ), t = c, n, a, d} satisfying the above constraints, Consider the conditional densities of (Y 1 , Y 0 , T ) given Z constructed as By construction the proposed population satisfies RA. Also, the constraint (A.1) and the construction of the population implies that A similar result holds for p Y 0 , q Y 1 , and q Y 0 . Hence, the proposed population is compatible with the data generating process. Lastly, this way of constructing the population distribution yields the provided f * Y 1 as the population marginal distribution of Y 1 since ∑ t=c,n,a,d , as implied by the constraints (A.1) and (A.5). This is also the case for f * Y 0 and the population marginal distribution of Y 0 , as implied by the constraints (A.3) and (A.7). Thus, the given (f *

Proof of Lemma A.2.
On the other hand, Proof of Proposition 3.2. As shown in Proposition 3.1, if the data generating process reveals δ Y 1 > 1 or δ Y 0 > 1, no population is compatible with MSI, and this clearly implies that IR (f Y 1 ,f Y 0 ) (P, Q |RA) is empty. So, we preclude this trivial case from the proof and focus on a data generating process with δ Y 1 ≤ 1 and δ Y 0 ≤ 1.
Firstly, consider a data generating process with 1−δ Y 0 < λ Y 1 , and guess the identification region to be IR guess Pick an arbitrary f Y 1 from F * f Y 1 (P, Q ) and an arbitrary f Y 0 from F env f Y 0 (P, Q ). Define a non-negative function (A.13) and consider the following choice of {(h Y 1 ,t , h Y 0 ,t ), t = c, n, a, d}, (A.14) Since the first multiplicative term on the right-hand side of (A.13) is less than or equal to one, h Y 1 ,t (y 1 ), t = c, n, a, d } constructed above are all non-negative functions. It can be seen that the constraints (A.1) through (A.8) are all satisfied. Also, by utilizing Lemma A.2, we can confirm that the area constraints (A.9) through (A.12) are satisfied. By Lemma A.1, we conclude that the proposed (f Y 1 , f Y 0 ) belongs to IR (f Y 1 ,f Y 0 ) (P, Q |RA), and hence IR guess (f Y 1 ,f Y 0 ) (P, Q |RA) ⊆ IR (f Y 1 ,f Y 0 ) (P, Q |RA).
Next, consider f Y 1 that does not satisfy (P, Q ). In order to find a contradiction of RA, suppose that the non-negative functions {(h Y 1 ,t , h Y 0 ,t ), t = c, n, a, d} satisfying the constraints (A.1) through (A.8) exist. Then, the constraints (A.7) and (A.8) imply that Now, since f Y 1 / ∈ F * f Y 1 (P, Q ), it follows that (P, Q ). By applying the decomposition (A.14) proposed in the proof of Proposition 3.2, we can decompose f lower Y 1 into non-negative functions {h lower Y 1 ,t , t = c, n, a, d}. Specifically, for t = a and t = n, we obtain h lower Note that by using h lower Y 1 ,t , t = c, n, a, d, we can express f lower where in the second line we use the constraints (A.1) and (A.2). Letf Y 1 be an arbitrary element of F * f Y 1 (P, Q ). By Lemma A.1 and Proposition 3.2, there exist non-negative functions {h Y 1 ,t , t = c, n, a, d} such thatf Y 1 can be represented as (A.18) and, again, by applying decomposition (A.14) of the proof of Proposition 3.2,h Y 1 ,a andh Y 1 ,n can be expressed as Regarding the second term of (A.19), asf Y 1 − f Y 1 −g Y 1 ≥ 0, it can be bounded from above by Regarding the third term of (A.19), if t is strictly less than the (1−δ Y 0 )-th right-trimming point q t]. So the integral is non-negative. On the other hand, if t ≥ q right  Step 1 -Imputation of h Y 1 ,a and h Y 0 ,a .