Classification and Reconstruction of High-Dimensional Signals from Low-Dimensional Noisy Features in the Presence of Side Information

This paper offers a characterization of fundamental limits in the classification and reconstruction of high-dimensional signals from low-dimensional features, in the presence of side information. In particular, we consider a scenario where a decoder has access both to noisy linear features of the signal of interest and to noisy linear features of the side information signal; while the side information may be in a compressed form, the objective is recovery or classification of the primary signal, not the side information. We assume the signal of interest and the side information signal are drawn from a correlated mixture of distributions/components, where each component associated with a specific class label follows a Gaussian mixture model (GMM). By considering bounds to the misclassification probability associated with the recovery of the underlying class label of the signal of interest, and bounds to the reconstruction error associated with the recovery of the signal of interest itself, we then provide sharp sufficient and/or necessary conditions for the phase transition of these quantities in the low-noise regime. These conditions, which are reminiscent of the well-known Slepian-Wolf and Wyner-Ziv conditions, are a function of the number of linear features extracted from the signal of interest, the number of linear features extracted from the side information signal, and the geometry of these signals and their interplay. Our framework, which also offers a principled mechanism to integrate side information in high-dimensional data problems, is also tested in the context of imaging applications. In particular, we report state-of-the-art results in compressive hyperspectral imaging applications, where the accompanying side information is a conventional digital photograph.

provide perfect recovery with a number of projections that grows linearly with the dimension of the manifold s, logarithmically with the product of signal size n and parameters that characterize the volume and the regularity of the manifold [37].
However, it is often the case that one is also presented at the encoder, at the decoder, or at both with additional information -known as side information -beyond signal structure, in the form of another signal that exhibits some correlation with the signal of interest. The key question concerns how to leverage side information to enhance the classification and reconstruction of high-dimensional signals from low-dimensional features. This paper proposes to study this aspect by using models that capture key attributes of high-dimensional signals, namely the fact that such signals often live on a union of low-dimensional subspaces or affine spaces, or on a union of approximately low-dimensional spaces. The high-dimensional signal to be measured and the side information are assumed to have distinct low-dimensional representations of this type, with shared or correlated latent structure.

A. Related Work
Our problem connects to source coding with side information and distributed source coding, as the number of features extracted from high-dimensional signals can be related to the compression rate, whereas performance metrics for classification and reconstruction can be related to distortion. The foundations of distributed source coding theory were laid by Slepian and Wolf [40], whereas those of source coding with side information by Ahlswede and Körner [41], and by Wyner and Ziv [42]. Namely, [40] characterized the rates at which two discrete input sources can be compressed independently by guaranteeing lossless reconstruction at the decoder side. Perhaps surprisingly, the rates associated with independent compression at the two sources are shown to be identical to those associated with joint compression at the encoders. On the other hand, [41] determined the rate at which a discrete source input can be compressed without losses in the presence of coded side information. In the lossy compression case, Wyner and Ziv [42] proposed an encoding scheme to achieve the optimum tradeoff between compression rate and distortion when side information is available at the decoder. In contrast with the result in [40], they proved that lossy compression without side information at the encoder suffers in general a rate loss compared to lossy compression with side information both at the encoder and the decoder [43]. However, such loss was shown to be vanishingly small for the case of memoryless Gaussian sources and squared-error distortion metrics [42].
Our problem also relates to the problems of compressive sensing with side information/prior information [44]- [49], distributed compressive sensing [50]- [56] and multi-task compressive sensing [57]. The problem of compressive sensing with side information or prior information entails the reconstruction of a sparse signal in the presence of partial information about the desired signal, using reconstruction algorithms akin to those from CS. For example, [44], [45] consider the reconstruction of a signal by leveraging partial information about the support of the signal at the decoder side; [46] considers the reconstruction of the signal by using an additional noisy version of the signal at the decoder side. [47] takes the side information to be associated with the previous scans of a certain subject in dynamic tomographic imaging. In this case, 1 -norm based minimization is used for recovery, by adding an additional term that accounts for the distance between the recovered image and the side information snapshot.
A similar approach has been adopted recently in [48], that is shown to require a smaller number of measurements than traditional CS in recovering magnetic resonance images. A theoretical analysis of the number of measurements sufficient for reliable recovery with high probability in the presence of side information for both 1 / 1 and mixed 1 / 2 reconstruction strategies is provided in [49].
The problem of distributed compressive sensing, which has been considered by [50]- [56], involves the joint reconstruction of multiple correlated sparse signals. In [50], [51] necessary and sufficient conditions on the minimum number of measurements needed for perfect recovery (via 0 -norm minimization) are derived. Multiple signals are described there via joint sparsity models that involve a common component for all signals and innovation components specific to each signal. [53] also provides conditions on the number of measurements for approximately zerodistortion recovery using an inversion procedure based on a generalized, multi-terminal approximate message passing (AMP) algorithm. Reconstruction via AMP methods for distributed CS was also considered in [54], where the minimum number of measurements needed for successful signal recovery was derived assuming that measurements extracted from different signals are spatially coupled. Reconstruction obtained via 1 -norm minimization methods is considered in [55], where restricted isometry property (RIP) conditions for block-diagonal, random linear projection matrices are discussed. Namely, such matrices are shown to verify the RIP if the total number of rows scales linearly with the signal sparsity s and poly-logarithmically with the signal ambient dimension n. [56] considers the problem of distributed recovery of two signals that are related through a sparse time-domain filtering operation, and it derives sufficient conditions on the number of samples needed for reliable recovery as well as a computationally-efficient reconstruction algorithm.
Multi-task compressive sensing [57] involves the description of multiple signals through a hierarchical Bayesian framework, where a prior is imposed on the wavelet coefficients for the different signals. Such a prior is inferred statistically from features extracted from the data and then used in the recovery process, thus demonstrating reconstruction reliability and robustness with various types of experimental data.

B. Contributions
This paper studies the impact of side information on the classification and reconstruction of a high-dimensional signal from low-dimensional, noisy, linear and random features, by assuming that both the signal of interest and the side information are drawn from a joint Gaussian mixture model (GMM). Unlike distributed and multi-task CS, here we are generally only interested in recovering or classifying the primary signal, and not necessarily interested in recovering the underlying side information that is represented compressively.
There are multiple reasons for adopting a GMM representation, which is often used in conjunction with the Bayesian CS formalism [58]: • A GMM model represents the Bayesian counterpart of well-known high-dimensional signal models in the literature [32]- [35], [38]. In particular, signals drawn from a GMM can be seen to lie in a union of (linear or affine) subspaces, where each subspace is associated with the translation of the image of the (possibly low-rank) covariance matrix of each Gaussian component within the GMM. Moreover, low-rank GMM priors have been shown to approximate signals in compact manifolds [38]. Also, a GMM can represent complex distributions subject to mild regularity conditions [59].
• A GMM model has also been shown to provide state-of-the-art results in practical problems in image processing [60]- [62], dictionary learning [38], image classification [6] and video compression [63].
• Optimal inversion of GMM sources from noisy, linear features can be performed via a closed-form classifier or estimator, which has computational complexity proportional to the number of Gaussian classes within the GMM. Moreover, moderate numbers of classes have been shown to model reliably real-world data as, for example, patches extracted from natural images or video frames [5], [63], [64].
Of particular relevance, the adoption of GMM priors also offers an opportunity to analyze phase transitions in the classification or reconstruction error: in particular, and in line with the contributions in [64]- [67], it is possible to adopt wireless communications-inspired metrics, akin to the diversity gain or the measurement gain [68], [69], in order to characterize performance more finely in certain asymptotic regimes.
Our main contributions, which generalize the analysis carried out in [64], [67] to the scenario where the decoder has access to side information, include: • The definition of a joint GMM model both for the signal of interest and the side information, that generalizes the joint sparsity models in [50], [51].
• Sufficient conditions for perfect signal classification in the asymptotic limit of low-noise that are a function of the geometry of the signal of interest, the geometry of the side information, their interaction, and the number of features.
• Sufficient and necessary conditions for perfect signal reconstruction in the asymptotic limit of low-noise that are also a function of the geometries of the signal of interest, the side information, as well as the number of features.
• A range of results that illustrate not only how theory aligns with practice, but also how to use the ideas in real-world applications, such as high-resolution image reconstruction and compressive hyperspectral imaging in the presence of side information (here a traditional photograph constitutes the side information).
These contributions differ from other contributions in the literature in various aspects. Unlike previous works on the characterization of the minimum number of measurements needed for reliable reconstruction in distributed compressive sensing [50], [51], our Bayesian framework allows consideration of signals with different sizes that are sparse over different bases; our model also allows characterization of phase transitions in the classification error and in the reconstruction error. In addition, and unlike previous studies in the literature associated with 1 -norm minimization or AMP algorithms for reconstruction, the analysis carried out in this work is also valid in the finite signal length regime, providing a sharp characterization of signal processing performance as a function of the number of features extracted from both the input and the side information. To the best of our knowledge, this work represents the first contribution in the context of structured or model-based CS to consider both classification and reconstruction of signals in the presence of side information.

C. Organization
The remainder of the paper is organized as follows: Section II defines the signal and the system model used throughout the article. Section III provides results for classification with side information, containing an analysis of an upper bound to the misclassification probability, that also leads to a characterization of sufficient conditions for the phase transition in the misclassification probability. Section IV provides results for reconstruction with side information, most notably sufficient and necessary conditions for the phase transition in the reconstruction error in the asymptotic limit of low-noise; the sufficient and necessary conditions differ within a single measurement.
Section V highlights the relation between classification/reconstruction with side information and distributed classification/reconstruction. Numerical examples both with synthetic and real data are presented in Section VI. Finally, conclusions are drawn in Section VII. The Appendices contain the proofs of the main theorems.

D. Notation
In the remainder of the paper, we adopt the following notation: boldface upper-case letters denote matrices (X) and boldface lower-case letters denote column vectors (x); the context defines whether the quantities are deterministic or random. The symbols I n and 0 m×n represent the identity matrix of dimension n × n and the all-zero-entries matrix of dimension m × n, respectively (subscripts will be dropped when the dimensions are clear from the context). (·) T , tr(·), rank(·) represent the transpose, trace and the rank operators, respectively. (·) † represents the Moore-Penrose pseudoinverse of a matrix. Im(·) and Null(·) denote the (column) image and null space of a matrix, respectively, and dim(·) denotes the dimension of a linear subspace. E [·] represents the expectation operator. The Gaussian distribution with mean µ and covariance matrix Σ is denoted by N (µ, Σ). The symbol Cov(·) denotes the covariance matrix of a given random vector.

II. MODEL
We consider both the classification and reconstruction of a high-dimensional signal from noisy linear features in the presence of side information, as shown in Fig. 1. In particular, we assume that the decoder has access to a set of noisy linear features y 1 ∈ R m1 associated with the desired signal x 1 ∈ R n1 given by: where Φ 1 ∈ R m1×n1 is the projection kernel 1 and w 1 ∼ N (0, I·σ 2 ) is additive Gaussian noise that models possible distortion introduced by the feature extraction system (or model mismatch). We also assume that the decoder has access to another set of features y 2 ∈ R m2 -called side information -associated with another signal x 2 ∈ R n2 given by: 1 In the remainder of the paper, we will use interchangeably the terms projection/measurement/sensing kernel or matrix. where Φ 2 ∈ R m2×n2 is the projection kernel associated with the side information and w 2 ∼ N (0, I·σ 2 ) is Gaussian additive noise, which is assumed to have the same covariance as the noise w 1 , for simplicity 2 . For the sake of compact notation, we often re-write the models in (1) and (2) as: where and We focus on random projection kernels, where both matrices Φ 1 and Φ 2 are assumed to be drawn from left rotation-invariant distributions 3 .
We consider underlying class labels C 1 ∈ {1, . . . , K 1 } and C 2 ∈ {1, . . . , K 2 }, where C 1 is associated with the signal of interest x 1 and C 2 is associated with the side information signal x 2 . We assume that x 1 and x 2 , conditioned on the underlying class labels C 1 = i and C 2 = k, are drawn from a joint distribution p(x 1 , with the class labels drawn from probability p C1,C2 (i, k). We assume that the decoder, for both classification and reconstruction purposes, knows perfectly the joint probability mass function (pmf) p C1,C2 (i, k) of the discrete random variables corresponding to the class labels of x 1 and x 2 , and the conditional distributions p(x 1 , . For the problem of classification with side information, the objective is to estimate the value of the index C 1 that identifies the distribution/component from which x 1 was drawn, on the basis of the observation of both vectors y 1 and y 2 . The minimum average error probability in classifying C 1 from y 1 and y 2 is achieved by the maximum a posteriori (MAP) classifier [1], given bŷ where p(C 1 = i|y 1 , y 2 ) is the a posteriori probability of class C 1 = i conditioned on y 1 and y 2 .
For the problem of reconstruction with side information, the objective of the decoder is to estimate the signal x 1 from the observation of y 1 and y 2 . In particular, we consider reconstruction obtained via the conditional mean where p(x 1 |y 1 , y 2 ) is the posterior pdf of x 1 given the observations y 1 and y 2 , which minimizes the reconstruction error.
We emphasize the key distinction between the previously studied problems of distributed [50], [51] or multi-task compressive sensing [57]: our goal is to recover x 1 or its label C 1 , based upon compressive y 1 and y 2 , while previous work considered joint recovery of x 1 and x 2 (or joint estimation of C 1 and C 2 ). Note that our theory allows the special case for which Φ 2 is the identity matrix, in which case y 2 = x 2 and the side information is not measured compressively. In addition, and more closely connected to previous work, we will consider the objectives of jointly determining the pair of index values (C 1 , C 2 ) that identify the distributions/components 4 from which x 1 and x 2 were drawn, in the case of classification, and to estimate jointly x 1 and x 2 , in the case of reconstruction.
The solution to these latter problems follows immediately from the solution to the side information problems.
Moreover, as is evident from (3), the distributed classification and reconstruction problem can be mapped into a standard classification or reconstruction problem of a single signal, in which we are forcing the sensing matrix to obey a block diagonal structure.

A. Signal, Side Information and Correlation Models
The key aspect now relates to the definition of the signal, side information, and the respective correlation models.
In particular, we adopt a multivariate Gaussian model for the distribution of x 1 and x 2 , conditioned on (C 1 , where µ (ik) so that p( x1 are the mean and covariance matrix of x 1 conditioned on the pair of classes (i, k), respectively, µ x2 are the mean and covariance matrix of x 2 conditioned on the pair of classes (i, k), respectively, and Σ (ik) x12 is the cross-covariance matrix between x 1 and x 2 conditioned on the pair of classes (i, k).
The motivation for this choice is associated by the fact that this apparently simple model can accommodate a wide range of signal distributions. In fact, note that the joint pdf of x 1 and x 2 follows a GMM model: so that we can in principle approximate very complex distributions by incorporating additional terms in the decomposition [59]. Note also that the conditional marginal pdfs of x 1 and x 2 also follow GMM models: and where p C2|C1 (k|i) = are the conditional pmfs of C 2 and C 1 . Therefore, our model naturally subsumes the standard GMM models used in the literature to deliver state-of-the-art results in reconstruction and classification problems, hyperspectral imaging and digit recognition applications [6].
We also adopt a framework that allows common and innovative components in the representation of x 1 and x 2 conditioned on (C 1 , C 2 ) = (i, k), generalizing the one in [50], [51]. In particular, note that is equivalent to expressing x 1 and x 2 conditioned on the pair of classes (i, k) as for an appropriate choice of the matrices P ) are independent. Note that (17) and (18)  2 . So the model may be viewed from the perspective of generalizing previous union-ofsubspaces models [32]- [35].
In our scenario, the covariance matrix of x 1 and x 2 conditioned on the pair of classes (i, k) can be also written as Σ where P and P (ik) 2 are such that 5 We refer to the vectors x c1 ∼ N (0, P c1 (P c2 , respectively) but with the same weights, that are contained in the vector z c , and therefore can be seen to model some underlying phenomena common to both x 1 and x 2 (conditioned on the classes). On the other hand, we refer to 2 ) T ) as innovation components: these components are statistically independent and thus can be seen to model phenomena specific to x 1 and x 2 (conditioned on the classes). 6 Therefore, we can now express the ranks of the matrices appearing in the model in (10) as a function of ranks of the matrices appearing in the models in (17) and (18) as follows: which represents the dimension of the subspace spanned by input signals x 1 drawn from the Gaussian distribution 5 Note that the common and innovation component representation proposed here is redundant, i.e., there are various choices of matrices P (ik) that satisfy (20). We also emphasize that the results obtained in the following analysis hold irrespective of the particular choice of the matrices P (ik) that satisfy (20). Then, although the adoption of the common and innovation component representation is not required to prove the results contained in this work, we leverage such representation in order to give a clear interpretation of the interaction between x 1 and x 2 and to underline the connection of our work with previous results in the literature. 6 The representation in (17) and (18) is reminiscent of the joint sparsity models JSM-1 and JSM-3 in [50], where signals sensed by multiple sensors were also described in terms of the sum of a common component plus innovation components. However, fundamental differences characterize our formulation with respect to such models: i) we consider a Bayesian framework in which the input signal and side information signal are picked from a mixture of components, where each component is described by a GMM distribution, whereas in JSM-1 and JSM-3 all the components are deterministic; ii) in our model, the common components are correlated, but they are not exactly the same for x 1 and x 2 , as it is instead for signals in JSM-1 and JSM-3; iii) in our case, the common and innovation components can be sparse over four different bases, corresponding to the ranges of the matrices P (ik) ; on the other hand, all signals in JSM-1 and JSM-3 are assumed to be sparse over the same basis. corresponding to the indices C 1 = i, C 2 = k; which represents the dimension of the subspace spanned by side information signals x 2 drawn from the Gaussian distribution corresponding to the indices C 1 = i, C 2 = k; which represents the dimension of the sum of the subspaces spanned by input signals drawn from the Gaussian distribution corresponding to the indices C 1 = i, C 2 = k and those from the Gaussian distribution corresponding to the indices C 1 = j, C 2 = ; which represents the dimension of the sum of the subspaces spanned by side information signals drawn from the Gaussian distribution corresponding to the indices C 1 = i, C 2 = k and those from the Gaussian distribution corresponding to the indices C 1 = j, C 2 = ; and finally, the corresponding dimensions spanned collectively by input and side information signals are given by where we have introduced the compact notation P (ik,j ) c1 . We also define the rank: that represents the dimension of the subspace in R m1+m2 spanned collectively by the projections of input signals and the projections of side information signals drawn from the Gaussian distribution identified by the component that represents the dimension of the subspace obtained by summing the subspace in R m1+m2 spanned collectively by the projections of input signals and the projections of side information signals drawn from the Gaussian distribution identified by the component indices C 1 = i, C 2 = k with the subspace spanned by the projections of input signals and the projections of side information signals drawn from the Gaussian distribution identified by the component The quantities in (21)- (28), which provide a concise description of the geometry of the input source, the side information source, and the geometry of the interaction of such sources with the projections kernels, will be fundamental to determining the performance of the classification and reconstruction of high-dimensional signals from low-dimensional features in the presence of side information.

III. CLASSIFICATION WITH SIDE INFORMATION
We first consider signal classification in the presence of side information. The basis of the analysis is an asymptotic characterization -in the limit of σ 2 → 0 -of the behavior of an upper bound to the misclassification probability associated with the optimal MAP classifier (rather than the exact misclassification probability which is not tractable).
In particular, for a two class problem 7 , i.e., when K 1 = 2, via the Bhattacharyya bound [1], the misclassification probability can be upper bounded as follows: For a multiple class problem, via the Bhattacharyya bound in conjunction with the union bound, the misclassification probability can be upper bounded as follows: The asymptotic characterization that we discuss below -akin to that in [67] -is based on two key metrics. The The second metric offers a more refined description of the behavior of the upper bound to the misclassification probability by considering the slope at which logP err decays (in a log σ 2 scale) in the low-noise regime. Such 7 The number of classes corresponding to the side information signal, K 2 , can be arbitrary.
value is named the diversity-order and is given by Note also that the diversity-order associated with the upper bound of the error probability represents a lower bound on (the absolute value of) the slope of the true error probability in the low-noise regime.
We next characterize these quantities as a function of the number of features/measurements m 1 and m 2 and as a function of the underlying geometry of the signal and the side information, both for zero-mean classes (signal lives in a union of linear subspaces) and nonzero-mean ones (signal lives in a union of affine spaces). We also characterize the quantities in (32) and (33) in terms of the diversity-order associated with the classification of two x ) from the observation of the noisy linear features y in Moreover, all the pairs of indices (i, k) such that p C1,C2 (i, k) = 0 clearly do not affect the diversity-order associated to classification with side information. Therefore, we can define the set of index pairs of interest as We also define the set of index quadruples that play a key role in the computation of the diversity-order associated to classification with side information.

A. Zero-Mean Classes
We now provide a low-noise expansion of the upper bound to the misclassification probability associated with the system with side information in (1) and (2), when assuming that the signals involved are all zero-mean, i.e., Theorem 1: Consider the model in (1) and (2), where the input signal x 1 is drawn according to the classconditioned distribution (13), the side information x 2 is drawn according to the class-conditioned distribution (15), and the class-conditioned joint distribution of x 1 and x 2 is given by (9) with µ . Then, with probability 1, in the low-noise regime, i.e., when σ 2 → 0, the upper bound to the misclassification probability (31) can be expanded asP for a fixed constant A > 0, where and and r (j ) is obtained as r (ik) .
Proof: See Appendix A.
Theorem 1 provides a complete characterization of the slope of the upper bound to the misclassification probability for the case of zero-mean classes, in terms of the number of features and the geometrical description of the sources.
In particular, observe that: • The diversity-order d associated with the estimation of the component index C 1 from noisy linear features with side information is given by the worst-case diversity-order term d(ik, j ) associated with pair-wise classification problems for which the indices corresponding to C 1 are not the same (i = j).
• The diversity-order in (38), which depends on the pairwise diversity-order in (39), can also be seen to depend on the difference between the dimension of the sum of the linear spaces collectively spanned by signals Φ 1 x 1 and Φ 2 x 2 drawn from the Gaussian distributions with indices (i, k) and (j, ) and the dimension of those spaces taken individually. This dependence in the presence of side information is akin to that in the absence of side information: the additional information, however, provides subspaces with increased dimensions over which it is possible to discriminate among signals belonging to different classes.
• The effect of the correlation between x 1 and x 2 is embodied in the rank expressions (41) and (43). In particular, we note that, in case x 1 and x 2 are conditionally independent given any pairs of classes (C 1 , C 2 ), i.e., p(x 1 , . Then, the diversity-order is given by the sum of the diversity-order values corresponding to the classification of x 1 from y 1 and that corresponding to the classification of x 2 from y 2 . From a geometrical point of view, when x 1 and x 2 are conditionally independent, the linear spaces spanned by the side information offer new dimensions over which the decoder can discriminate among classes, which are completely decoupled from the dimensions corresponding to linear spaces spanned by the realizations of x 1 . Otherwise, when x 1 and x 2 are not conditionally independent, the diversity-order can be in general larger than, smaller than, or equal to the sum of the diversity-order values corresponding to the classification of x 1 from y 1 and that corresponding to the classification of x 2 from y 2 . A direct consequence of the asymptotic characterization of the upper bound to the misclassification probability in (31) is access to conditions on the number of features m 1 and m 2 that are both necessary and sufficient to drive the upper bound to the misclassification probability to zero when σ 2 → 0, that is, in order to achieve the phase transition of the upper bound to the misclassification probability, and hence a condition on the number of features m 1 and m 2 that is sufficient to drive the true misclassification probability to zero when σ 2 → 0.
Corollary 1: Consider the model in (1) and (2), where the input signal x 1 is drawn according to the classconditioned distribution (13), the side information x 2 is drawn according to the class-conditioned distribution (15), and the class-conditioned joint distribution of x 1 and x 2 is given by (9) with µ If there exists an index quadruple (i, k, j, ) ∈ S SIC such that r x , then, d = 0 and the upper bound to the misclassification probability (31) exhibits an error floor in the low-noise regime. Otherwise, x , ∀(i, k, j, ) ∈ S SIC , then, with probability 1, the upper bound to the misclassification probability (31) approaches zero when σ 2 → 0 if and only if the following conditions hold ∀(i, k, j, ) ∈ S SIC : : : : Proof: See Appendix B.
The characterization of the numbers of features m 1 and m 2 that are both necessary and sufficient to achieve the phase transition in the upper bound to the misclassification probability is divided in 4 cases, depending on whether the range spaces Im(Σ x } + 1. The shaded regions represent values of m 1 and m 2 that satisfy the conditions (44)- (47). lie in the intersection of the regions corresponding to index quadruples (i, k, j, ) ∈ S SIC .
In case 1), the range spaces associated to the input covariance matrices are all distinct, and by observing (44) we can clearly determine the beneficial effect of the correlation between x 1 and x 2 in guaranteeing the phase transition for the upper bound to the misclassification probability. Namely, we note that the phase transition is achieved either when error-free classification is possible from the observation of y 1 alone (m 1 > min{r x } shows the benefit of side information in order to obtain the phase transition with a lower number of features. In fact, when r x2 , joint classification of y 1 and y 2 leads to a clear advantage in the number of features needed to achieve the phase transition with respect to the case in which classification is carried independently from y 1 and y 2 , despite the fact that linear features are extracted independently from x 1 and x 2 .
In case 2), the range spaces associated to the input covariance matrices are such that Im(Σ x2 ) so that classification based on the observation of y 1 or y 2 alone yields an error floor in the upper bound of the misclassification probability [67]. In other terms, input signals and side information signals from classes (i, k) and (j, ) are never perfectly distinguishable. In this case, the impact of correlation between the input signal and the side information signal is clear when observing (45). In fact, when combining features extracted independently from the vectors x 1 and x 2 , it is possible to drive to zero the misclassification probability, in the low-noise regime, provided that the number of features extracted m 1 and m 2 verify the conditions in (45).
Finally, cases 3) and 4) represent intermediate scenarios in which range spaces associated to x 1 are distinct, but those related to x 2 are completely overlapping, and vice versa. We note then how the necessary and sufficient conditions for phase transition in (46) and (47) are given by combinations of the conditions in (44) and (45).
We further note in passing that the conditions in (45) are reminiscent of the conditions on compression rates for lossless joint source coding in [40].

B. Nonzero-Mean Classes
We now provide a low-noise expansion of the upper bound to the misclassification probability associated with the feature extraction system with side information in (1) and (2), for the case of nonzero-mean classes, i.e., µ The presence of non-zero mean classes -as already noted in [67,Theorem 3], for compressive classification without side information -offers a unique characteristic, that is, the misclassification probability can decay exponentially with 1/σ 2 (i.e., the diversity-order tends to infinity) under certain conditions on the number of linear features extracted and the geometrical description of the source.
Theorem 2: Consider the model in (1) and (2), where the input signal x 1 is drawn according to the classconditioned distribution (13), the side information x 2 is drawn according to the class-conditioned distribution (15), and the class-conditioned joint distribution of x 1 and x 2 is given by (9).
x ), then, with probability 1, in the low-noise regime, i.e., when σ 2 → 0, the upper bound to the misclassification probability for classification with side information (31) can be expanded as for fixed constants B, C > 0, if and only if the following conditions hold ∀(i, k, j, ) ∈ S SIC : Otherwise, denote by S the set of quadruples (i, k, j, ) ∈ S SIC for which either µ or conditions (49)-(52) do not hold. Then, with probability 1, in the low-noise regime, i.e., when σ 2 → 0, the upper bound to the misclassification probability for classification with side information (31) can be expanded as for a fixed constant A > 0, and where d(ik, j ) is obtained as in Theorem 1.
Proof: See Appendix C.
Note that classification based on the joint observation of y 1 and y 2 can guarantee infinite diversity-order even when classification based on y 1 or y 2 alone cannot. In particular, if there exists an index quadruple for which both x2 ), then, irrespective of the number of features m 1 and m 2 and of the specific values of the projection kernels Φ 1 and Φ 2 , we have and, therefore, the conditions in [67, Theorem 3] are not verified, thus implying that both the upper bounds to the error probability associated to classification based on y 1 or y 2 do not decay exponentially with 1/σ 2 when then classification based on both y 1 and y 2 is characterized by an exponential decay of the upper bound to the misclassification probability, provided that conditions (49)

IV. RECONSTRUCTION WITH SIDE INFORMATION
We now consider signal reconstruction in the presence of side information. We are interested in the asymptotic characterization of the minimum mean-squared error (MMSE) incurred in reconstructing x 1 from the observation of the signal features y 1 and the side information features y 2 , given by 9 9 We emphasize that MMSE 1|1,2 (σ 2 ) is a function of σ 2 .
wherex 1 (y 1 , y 2 ) is the conditional mean estimator in (8). In particular, we are interested in determining conditions on the number of linear features m 1 and m 2 that guarantee perfect reconstruction in the low-noise regime, i.e., thus generalizing the results in [64] to the case when side information is available at the decoder; the misclassification results will be key to address this problem.

A. Preliminaries: Gaussian Sources
We first consider the simplified case in which K 1 = K 2 = 1, i.e., when the signals x 1 and x 2 obey the joint and with ranks r x1 = rank(Σ x1 ), r x2 = rank(Σ x2 ) and r x = rank(Σ x ).
For this case, the conditional mean estimator is given by [70] x where Moreover, the MMSE in this case can be expressed as In the following, we provide necessary and sufficient conditions on the number of features m 1 , m 2 that guarantee that, in the low-noise regime, the reconstruction MMSE for Gaussian sources approaches zero. Sufficient conditions are based on the analysis of two different upper bounds to MMSE G (σ 2 ). The first upper bound is obtained by considering the MMSE associated with the reconstruction of the signal x 1 from the observation of y 1 alone, i.e., without side information, which is denoted by wherex 1 (y 1 ) = E [x 1 |y 1 ] and whose behavior in the low-noise regime has been analyzed in [64].
The second upper bound is obtained by considering the MMSE associated with the distributed reconstruction problem, i.e., the joint recovery of x 1 and x 2 from the observation of both y 1 and y 2 (i.e., the reconstruction of x from y), which is denoted by wherex Note that the analysis of the second upper bound cannot be directly performed on the basis of the results in [64], due to the particular block diagonal structure of Φ.
Based on the properties of the MMSE [70], it is straightforward to . On the other hand, necessary conditions are derived from the analysis of the lower bound to the MMSE obtained by feeding the decoder not only with the noisy features y 1 and y 2 , but also with the values of the realizations of the noise vectors w 1 and w 2 . The following theorem stems from the fact that the necessary and sufficient conditions for the phase transition of the MMSE coincide.
Without side information, it is known that m 1 ≥ r x1 represents a necessary and sufficient condition on the number of features needed to drive the MMSE to zero in the low-noise regime [64]. With side information, it is possible to reliably recover the input signal x 1 with a lower number of features, as described by the conditions in (65). In fact, whenever r x < r x1 + r x2 , it is possible to perfectly reconstruct x 1 when σ 2 → 0 even with less than r x1 features, provided that m 1 + m 2 ≥ r x . This happens when the dimension of the overall space spanned by the projected signals obtained by concatenating the input signal and the side information signal, i.e., Φx, is greater than or equal to the dimension of the space spanned by x in the signal domain. Moreover, the m 1 features extracted from x 1 need to be enough to span a space with dimension equal to the difference between the dimension of the space spanned by x and that spanned by x 2 alone. In this sense, linear projections extracted from the input signal must be enough to capture signal features that are characteristic of x 1 and are not "shared" with x 2 , meaning that they are not correlated.
The values of m 1 and m 2 that satisfy the necessary and sufficient conditions (65) are reported in Fig. 3.

B. GMM Sources
We now consider the more general case where the signals x 1 and x 2 follow the models in Section II-A. It is possible to express the conditional mean estimator in closed form, but not the MMSE, which we denote by MMSE GM 1|1,2 (σ 2 ). Therefore, we will determine necessary and sufficient conditions on the numbers of features m 1 and m 2 that guarantee MMSE GM 1|1,2 (σ 2 ) → 0 in the low-noise regime, by leveraging the result in Theorem 3 together with steps akin to those in [64, Section IV].

1) Sufficient Conditions:
In order to provide sufficient conditions for the MMSE to approach zero in the lownoise regime, we analyze the upper bound to the MMSE corresponding to the mean squared error (MSE) associated with a (sub-optimal) classify and reconstruct decoder, which we denote by MSE CR (σ 2 ). This decoder operates in two steps as follows: • First, the decoder estimates the pair of class indices associated to the input signal and the side information signal via the MAP classifier 10 • Second, in view of the fact that, conditioned on (C 1 , C 2 ) = (i, k), the vectors x 1 and x 2 are jointly Gaussian distributed with mean µ (ik) x and covariance Σ (ik) x , the decoder reconstructs the input signal x 1 by using the conditional mean estimator corresponding to the estimated classesĈ 1 ,Ĉ 2 where The optimality of the MMSE estimator immediately implies that MMSE GM 1|1,2 (σ 2 ) ≤ MSE CR (σ 2 ). Therefore, we can immediately leverage the analysis of the misclassification probability carried out in Section III and the result in Theorem 3 in order to characterize the behavior of MSE GM CR (σ 2 ) in the low-noise regime, in order to determine sufficient condition for the phase transition of MMSE GM 1|1,2 (σ 2 ). Theorem 4: Consider the model in (1) and (2). Assume that the input signal x 1 is drawn according to the class-conditioned distribution (13), x 2 is drawn according to the class-conditioned distribution (15) and the classconditioned joint distribution of x 1 and x 2 is given by (9). Then, with probability 1, we have Proof: See Appendix E.
The sufficient conditions in (69) show that -akin to the Gaussian case -the numbers of features extracted from x 1 and x 2 have to be collectively greater than the largest among the dimensions of the spaces spanned by Appendix E shows that the conditions in (69) guarantee that the decoder can reliably estimate the class indices (C 1 , C 2 ) and hence reliably reconstruct the signal x 1 in the low-noise regime.
2) Necessary conditions: We now derive necessary conditions for the phase transition of the MMSE of GMM sources with side information. We obtain such conditions from the analysis of a lower bound to the MMSE that is obtained by observing that where MMSE 1|1,2 (σ 2 ) denotes the MMSE associated with the reconstruction of the Gaussian signal x 1 corresponding to class indexes (i, k) from the observation of the vector y 1 and the side information y 2 . Note that the equality in (71) is obtained via the total probability formula and the inequality in (72) is a consequence of the optimality of the MMSE estimator for joint Gaussian input and side information signals.
The analysis of MSE LB (σ 2 ) leads to the derivation of the following necessary conditions on the number of features m 1 and m 2 needed to drive MMSE GM 1|1,2 (σ 2 ) to zero when σ 2 → 0. Theorem 5: Consider the model in (1) and (2). Assume that the input signal x 1 is drawn according to the class-conditioned distribution (13), x 2 is drawn according to the class-conditioned distribution (15) and the classconditioned joint distribution of x 1 and x 2 is given by (9). Then, with probability 1, we have Proof: The proof is based on the result in Theorem 3, which implies that, if MMSE 1|1,2 (σ 2 ) → 0 when σ 2 → 0, ∀(i, k) ∈ S, then, with probability 1, the conditions on the numbers of features m 1 and m 2 in (73) must be satisfied for all (i, k) ∈ S.
It is interesting to note that the necessary conditions for the phase transition of the MMSE of GMM inputs are one feature away from the corresponding sufficient conditions, akin to our previous results for the case without side information [64]. In this way, Theorems 4 and 5 provide a sharp characterization of the region associated to the phase transition of the MMSE of GMM inputs with side information.

V. DISTRIBUTED CLASSIFICATION AND RECONSTRUCTION
The problem of classification and reconstruction with side information are intimately related to the problems of distributed classification and reconstruction, and our tools immediately generalize to provide insight about these problems.

A. Distributed Classification
The problem of distributed classification involves the estimation of the pair of classes (C 1 , C 2 ) from the lowdimensional feature vectors y 1 and y 2 , where Via the Bhattacharyya bound and the union bound, it is possible to write an upper bound to the misclassification probability of distributed classification as where is a set corresponding to all couples of class index pairs that are different in at least one of the values for C 1 or C 2 . We can also explore the previous methodology to construct an asymptotic characterization of the upper bound to the misclassification probability. We first consider the case of zero-mean classes, i.e., we assume µ Theorem 6: Consider the model in (1) and (2), where x 1 is drawn according to the class-conditioned distribution (13), x 2 is drawn according to the class-conditioned distribution (15), and the class-conditioned joint distribution of x 1 and x 2 is given by (9) with µ (ik) x = 0, ∀(i, k). Then, with probability 1, in the low-noise regime, i.e., when σ 2 → 0, the upper bound to the misclassification probability for distributed classification (75) can be expanded as for a fixed constant A > 0, where and where d(ik, j ), r (ik,j ) and r (ik) are as in (39), (41) and (43) The main difference between the asymptotic expansion in Theorem 1 and the asymptotic expansion in this Theorem relates to the computation of the diversity-order. In particular, the diversity-order for classification with side information is obtained as the worst-case diversity-order associated with the classification between Gaussian distributions identified by the couples of index pairs (i, k) and (j, ) among all couples for which the indices corresponding to C 1 are not the same (i = j), and among all possible choices for the indices corresponding to C 2 . On the other hand, the diversity-order of distributed classification is dictated by the lowest diversity-order in classifying all possible pairs of Gaussian distributions, whether they are associated or not to different input classes C 1 . This reflects the fact that a misclassification event occurs even when the estimate on C 1 is correct, that is when In other terms, the diversity-order for the classification with side information problem is entirely determined by the inter-class diversity, which means the diversity among Gaussian distributions pertaining to different classes C 1 . On the other hand, the performance in discriminating among all possible pairs (C 1 , C 2 ), i.e., the distributed classification problem, depends also from the intra-class diversity. Therefore, we can also note that, as expected, the diversity-order associated to distributed classification is a lower bound to the diversity-order corresponding to classification with side information.
We can also leverage the previous analysis -together with the expansion in Theorem 1 -to determine conditions on the number of features that are both necessary and sufficient for the upper bound to the misclassification probability to approach zero and hence conditions on the number of features that are sufficient for the true misclassification probability to approach zero.
Corollary 2: Consider the model in (1) and (2), where x 1 is drawn according to the class-conditioned distribution (13), x 2 is drawn according to the class-conditioned distribution (15), and the class-conditioned joint distribution of x 1 and x 2 is given by (9) with µ If there exists an index quadruple (i, k, j, ) ∈ S DC such that r x , then, d = 0 and the upper bound to the misclassification probability for distributed classification (75)

B. Distributed Reconstruction
The problem of distributed reconstruction involves the estimation of the pair of vectors x 1 and x 2 from the low-dimensional feature vectors y 1 and y 2 via the conditional mean estimator in (64) [38]. We can immediately specialize the results in Theorem 3 to distributed reconstruction for Gaussian sources, thus providing necessary and sufficient conditions for the low-noise phase transition of MMSE G 1,2|1,2 (σ 2 ). Theorem 7: Consider the model in (1) and (2). Assume that the vectors x 1 , x 2 are jointly Gaussian, with distribution N (µ x , Σ x ), with mean and covariance matrix specified in (58), and with r x1 = rank(Σ x1 ), r x2 = rank(Σ x2 ) and r x = rank(Σ x ). Then, with probability 1, we have Proof: The proof of this result is presented in Appendix D as part of the proof of Theorem 3. It is based on the low-noise expansion of the MMSE for Gaussian sources provided in [64] and on the rank characterization offered by Lemma 1.
We immediately observe that the conditions (79) mimic the Slepian-Wolf condition for joint source coding [40], where r x − r x2 is the counterpart of the conditional entropy of x 1 given x 2 , r x − r x1 is the counterpart of the conditional entropy of x 2 given x 1 , and r x corresponds to the joint entropy of x 1 and x 2 .
It is also interesting to note that these conditions (which apply to signals that are sparse in different basis) immediately specialize (for the two-sensor case) to the conditions in [51, Theorem 1 and 2], which pertain to signals that are sparse in the same basis, and provide necessary and sufficient conditions for the joint reconstruction of sparse signals when the common/innovation ensemble sparsity model is known at the decoder.
We can also specialize the results in Theorems 4 and 5 to distributed reconstruction for GMM sources, thus providing sufficient and necessary conditions for the low-noise phase transition of the MMSE, which is denoted in this case by MMSE GM 1,2|1,2 (σ 2 ). Theorem 8: Consider the model in (1) and (2). Assume that the input signal x 1 is drawn according to the class-conditioned distribution (13), x 2 is drawn according to the class-conditioned distribution (15) and the classconditioned joint distribution of x 1 and x 2 is given by (9). Then, with probability 1, we have Proof: We briefly outline the proof for this theorem, which follows steps similar to those in Appendix E. Also in this case, the derivation of sufficient conditions for the low-noise phase transition of the MMSE is based on the analysis of the MSE associated to a classify and reconstruct approach akin to that specified in Section IV-B, where the decoder first derives an estimate (Ĉ 1 ,Ĉ 2 ) of the pair of classes from which x 1 and x 2 are drawn and then uses the Wiener filter associated to the estimated classes in order to recover x from y.
Then, the analysis of such upper bound is carried out by leveraging the characterization of the phase transition of the upper bound to the misclassification probability given by Corollary 2 and the conditions for the phase transition of the MMSE for Gaussian sources in Theorem 7.
Theorem 9: Consider the model in (1) and (2). Assume that the input signal x 1 is drawn according to the class-conditioned distribution (13), x 2 is drawn according to the class-conditioned distribution (15) and the classconditioned joint distribution of x 1 and x 2 is given by (9). Then, with probability 1, we have Proof: The proof is based on the analysis of a lower bound to the MMSE akin to that described in (71)  Note that conditions (80) imply that conditions for reliable reconstruction of Gaussian inputs (79) must be met for all possible Gaussian distributions within the GMM, with a further gap of one feature.
We also note that conditions (80) can be interpreted as a generalization of the result in [51, Theorem 3], for the two sensors scenario, which encompasses also the case where the signals have different sizes. Namely, on considering the joint GMM prior (15) and its common and innovation component representation in (17) and (18), if each class (i, k) is chosen so that the corresponding P (ik) in (19) can be mapped into a matrix of the common/innovation ensemble sparsity model in [51], then conditions (80) show that only an additional feature for each signal is needed in order to reconstruct both x 1 and x 2 , in case their joint sparsity pattern is not known a priori.

VI. NUMERICAL RESULTS
We now report a series of numerical results, both with synthetic and real data, that cast further light on the role of side information to aid signal classification or reconstruction. Results with synthetic data aim to showcase how theory is able to predict the phase transition and the diversity-order of the true misclassification probability for classification problems, or to approximate well the phase transition of the MMSE.

A. Synthetic Data: Classification
We first present numerical results that showcase how the predictions on the diversity-order characterization based on the upper bound (from Theorem 1) match well the behavior of the experimental misclassification probability.
We consider x 1 and x 2 with dimensions respectively n 1 = 20 and n 2 = 12, with K 1 = K 2 = 2, so that the marginal pdfs for both signals are given by the mixture of two GMMs, each of them consisting of two Gaussian classes. All Gaussian classes are assumed to be zero-mean, i.e., µ  Table I. The projection kernels Φ 1 , Φ 2 are taken with i.i.d., zero-mean, Gaussian entries, with fixed variance.
We compare the phase-transition and the diversity-orders yielded by the Bhattacharyya-based upper bound (31) with the error probability obtained by numerical simulation. We report in Fig. 4(a) the experimental error probability and in Fig. 4(b) the upper boundP U err in (94) for the case in which no side information is available to the decoder (cf. [67]), i.e., m 2 = 0. In this case, the phase transition for the misclassification probability is obtained when m 1 > 7 [67], and we note how the analysis based on the upper bound reflects well the behavior of the true error probability both in terms of phase transition and diversity-order.      We now evaluate the impact of the side information y 2 in the classification of the input signal x 1 . We consider the case in which the number of features representing the side information is m 2 = 4 and for different values of m 1 . In Fig. 5(a) we show the experimental error probability and in Fig. 5(b) the upper boundP U err in (94). We observe how the presence of side information can be leveraged in order to obtain the phase transition for the error probability with only m 1 > 5 features on the input signal. In fact, when m 1 + m 2 > 9, the linear spaces spanned collectively by the projections of signals x 1 and x 2 drawn from different Gaussian components are not completely overlapping, since they are 9-dimensional spaces in R m1+m2 . Moreover, increasing the number of linear features extracted above 4 leads to increased diversity-order values. Also in this case, we note how the behavior analytically predicted from the characterization of the Bhattacharyya-based upper bound matches well the true behavior of the actual error probability both in terms of phase transition and diversity-order.

B. Synthetic Data: Reconstruction
We now aim to show how numerical results for reconstruction of synthetic signals also align well with the analysis reported in Section IV, in particular for what regards the characterization of the number of features needed to drive the MMSE to zero when σ 2 → 0. We start by considering the case in which x 1 and x 2 are described by a single Gaussian joint distribution. In particular, we set the signal sizes to n 1 = 5 and n 2 = 4, and we build the joint input covariance matrix using the common/innovation component representation in (17) and (18), where P c1 ∈ R 5×2 , P c2 ∈ R 4×2 , P 1 ∈ R 5×1 and P 2 ∈ R 4×1 have i.i.d., zero-mean, unit-variance Gaussian entries, thus obtaining r x1 = 3, r x2 = 3 and r x = 4. We also assume that the projection kernels Φ 1 and Φ 2 have i.i.d., zero-mean, Gaussian entries with fixed variance. We now consider signal reconstruction for GMM inputs. In particular, we assume that the vectors x 1 and x 2 are drawn from the joint GMM prior described in Section VI-A for the case of signal classification, and we assume again to use projection kernels with i.i.d., zero-mean, Gaussian entries. Reconstruction is performed via the conditional mean estimator, that is now given bŷ and we have used the notation N (x; µ, Σ) to express explicitly the argument of the Gaussian distribution. Then, on marginalizing out x 2 , we obtain and, as expected from the properties of the MMSE estimator [70],x 1 (y) can be also obtained by retaining the first n 1 entries of the joint conditional mean estimatorx(y) = E [x|y] [38].

C. Experimental Results: Natural Imaging
In this section, we consider a two-dimensional image compressive sensing example. The input signal is represented in this case by the image "Lena" with resolution 512 × 512 and the side information is given by a low-resolution version of the same subject, which consists of 128 × 128 pixels (see Fig. 8). Note also that, as a low-resolution  version of the signal of interest, side information in this case can be related to the scaling coefficients of the wavelet transform of the input signal. Both images are then partitioned into non-overlapping patches, so that the input vector x 1 represents 8 × 8 patches extracted from the 512 × 512 picture and x 2 represents 2 × 2 patches extracted from the low-resolution image. In this way, the input image and the side information are divided into the same number of patches, which represent the same spatial portion of the subject (see [5], [64] for a detailed description of the imaging setup) .
We assume that the signals x 1 and x 2 are described by a joint GMM as that in Section II-A, with K 1 = K 2 = 20.
The class variables C 1 and C 2 are assumed to be perfectly correlated, so that p C1,C2 (i, k) = 0 if i = k. The parameters of the joint GMM, i.e., the prior probabilities, the mean and the covariance matrices associated to each Gaussian distribution within the GMM, are learned from the "Caltech 101" dataset [71], that does not include the test image, via the expectation-maximization (EM) algorithm [72]. The covariance matrices Σ with the training algorithm have full rank. Therefore, in order to fit the trained model to an exactly low-rank GMM, to showcase further how theory aligns with practice, we modify the covariance matrices by retaining only the first 15 principal components and setting to zero the remaining 53 eigenvalues. In this way, we obtain r We notice that this procedure does not introduce substantial distortion, and the relative peak signal-to-noise ratio (PSNR) values for the test image and the side information image are equal to 34 dB and 46 dB, respectively. This is due to the fact that the eigenvalues of the trained covariance matrices decay exponentially fast, thus implying that "almost low-rank" GMM priors model well patches extracted from natural images.
We consider the problem of reconstructing the test image "Lena" from noisy linear features, in the presence of the side information represented by the low-resolution image. Linear features are extracted from the input signal via the matrix Φ 1 ∈ R m1×n1 , which is generated with i.i.d., zero-mean, Gaussian entries with fixed variance. On the other hand, we assume that the side information image is not compressed, i.e., we assume Φ 2 = I. Reconstruction is performed via the closed-form, conditional mean estimator (88). Fig. 9 reports the PSNR vs. 1/σ 2 curves obtained for different numbers of linear features extracted from the test image patches. For comparison, we also show the corresponding curves for the case of reconstruction without side information [64]. We clearly observe how the phase transition for reconstruction obtained with natural images matches the predictions obtained with the mathematical analysis developed in Section IV. In particular, when side information is not available at the decoder, the phase transition occurs when m 1 > max (i,k) r (ik) x1 = 15 [64]. On the other hand, the presence of side information allows for reliable reconstruction in the low-noise regime with a reduced number of features extracted from each patch. Namely, in the case under consideration, the conditions (69) are equivalent to m 1 > 11, which is shown to match well the PSNR behavior in Fig. 9(b).
As already observed in [64], we underline the fact that the theory developed in Sections III and IV yields significant results only for the case of low-rank GMMs rather than "approximately low-rank" models. On the other hand, GMM priors obtained via learning algorithms applied to natural images are not exactly low-rank. In fact, input covariance matrices obtained with such methods are typically described as is low-rank and the term εI accounts for the model mismatch between the real data and their projection onto the principal components containing the majority of the information associated to the data. In this way, we can observe that the feature extraction model in (3) with full-rank covariances Σ and where the additive noise w is substituted by additive noise with distribution N (0, εΦΦ T + σ 2 I). Therefore, on noticing that the matrix ΦΦ T approximate well the identity matrices when Φ has non-zero entries that are i.i.d., zero-mean, Gaussian with unit variance, we can conclude that performance associated to "approximately low-rank" models for a given noise variance σ 2 is comparable to the performance of the corresponding exactly low-rank models for noise variance equal to σ 2 + ε.

D. Experimental Results: Compressive Hyperspectral Imaging
Finally, we present an example to showcase how the proposed framework also offers a principled approach to design systems able to leverage effectively side information in reconstruction tasks. In this case, we do not reveal the presence of phase transitions in the reconstruction error, due to the fact that the signal model describing the data is a full-rank model. However, we can notice how side information can be used in order to improve reconstruction performance.
We consider a compressive hyperspectral imaging example, in which hyperspectral images of a subject are recovered from compressive measurements in the presence of side information. In particular, we consider measurements collected by the coded aperture snapshot spectral imager (CASSI) apparatus described in [73]. Side information is represented in this case by an RGB snapshot of the same scene, which can be easily obtained without requiring expensive hyperspectral imaging devices. The information contained in the RGB image is expected to improve the reconstruction quality of the input signal, also due to the fact that, in contrast to the measurements taken by the CASSI camera, the RGB image is not affected by coded aperture modulation.
In this case the vector x 1 represents patches extracted from the hyperspectral image, whereas x 2 represents patches extracted from the corresponding RGB image (see [73] for details on how data from this system are analyzed). The vectors x 1 and x 2 are assumed to be modeled by the joint GMM described in Section II-A with K 1 = K 2 = 20.
The parameters of the joint GMM are learned from the hyperspectral image dataset used in [74] 11 , again via the EM algorithm. Note that the images in the training dataset are associated to wavelength values that do not match perfectly those characterizing the CASSI camera. Therefore, the training algorithm is run by selecting each time wavelengths that are closest to the nominal values of the CASSI camera. We consider real data captured by the CASSI camera, so that the entries of the projection kernel Φ 1 reflect the physical implementation of the compressive imaging system [75], and they are constrained to belong to the interval   Table II. to 24 different wavelengths from 398.6 nm to 699.5 nm are compressed into a single snapshot of the same size. In order to evaluate the reconstruction accuracy, reference images are acquired using a different (and non-compressive) hyperspectral imaging setup. Therefore, the reference images and the side information image are not perfectly aligned with the CASSI measurement shown in the right part of Fig. 10. The reconstructed hyperspectral images without and with side information are shown in Fig. 11. It can be seen clearly that the reconstruction with side information has better quality. Furthermore, though the reference is not aligned well with the CASSI measurement, we can still compare the reconstruction PSNR in correspondence of some selected blocks in the image. Fig. 12 shows the reconstruction of six channels and the corresponding PSNR values are reported in Table II. It can be noticed that the PSNR improvement due to side information is significant.

VII. CONCLUSIONS
We have developed a principled framework that can be used not only to study fundamental limits in the classification and reconstruction of high-dimensional signals from low-dimensional signal features in the presence of side information, but also to obtain state-of-the-art results in imaging problems.
In particular, we have considered a linear feature-extraction model, where a decoder has access to noisy linear features of both the signal of interest and the side information signal, in order to carry out either classification or reconstruction. We have also considered a model where the joint distribution of the signal of interest and the side information, conditioned on some underlying class labels is a multivariate Gaussian, which embodies the correlation between these signals. The marginal distribution of the signal conditioned on a class label is a Gaussian mixture, and likewise the marginal distribution of the side information conditioned on the class label is also a Gaussian mixture.
This modeling approach, which can be used to encapsulate a wide range of distributions, has then offered the opportunity to capitalize on tractable bounds to the misclassification probability and the reconstruction error, to construct an asymptotic characterization of the behavior of these quantities in the low-noise regime. In addition, this modeling approach has also led to a characterization of sharp sufficient conditions for a phase transition in the misclassification probability and necessary and sufficient conditions for the phase transition of the reconstruction error (the performance quantities under consideration), as a function of the geometry of the sources, the geometry of the linear feature extraction process and their interplay, reminiscent of the Slepian-Wolf and the Wyner-Ziv conditions.
It has been shown that our theory is well aligned with practice via a range of numerical results associated with lowrank data models. Of particular relevance, it has also been shown that our framework offers a principled mechanism to integrate side information in data classification and reconstruction problems in the context of compressive hyperspectral imaging in the presence of side information.
This work also points to various possible future directions: • It is of interest to extend the results from consideration of only one side information source to settings where there are multiple sources of side information. It is possible to generalize the models immediately, but the analysis is considerably more complex (as pointed out in Appendix A).
• There is interest in generalization of the results from the scenario where the linear features are extracted randomly to scenarios where the linear features are designed [5], [6], [64] (or indeed nonlinear features are designed [21]) is relevant. This could lead to additional gains in the phase transition.
• The generalization of the results from scenarios where only the decoder has access to the side information to scenarios where both the encoder and the decoder have access to the side information is also relevant. This may also lead to additional gains both in the presence of random linear features or designed ones.
• Finally, it is believed that the framework, which applies to settings where both the signal of interest and the side information signal follow correlated Gaussian mixture models, can also be generalized to other data models -this can then translate into applications of the framework to scenarios where signals conform to different modalities.
APPENDIX A PROOF OF THEOREM 1 We start by considering the case, K 1 = 2. We recall that the Batthacharyya upper bound for the misclassification probability of C 1 is given bȳ An upper and a lower bound to the expression in (91) are simply obtained by considering the following fact.
Given n non-negative numbers a 1 , . . . , a n ≥ 0, it holds where the first inequality derives from the concavity of the function f (x) = √ x and the second inequality can be simply proved by induction starting from n = 2.
Then, an upper bound toP err is obtained as and, similarly, a lower bound is given byP L err =P U err /K 2 . The generalization of this result to the case K 1 > 2, is based on the evaluation of the union bound (31), which, together with (92), yields the upper bound and the corresponding lower boundP L err =P U err /K 2 . Note that the lower and the upper bounds differ only by the multiplicative constant K 2 . Therefore, they are tight bounds in terms of the diversity-order, and it is possible to derive the diversity-order associated toP err from the analysis of such bounds.
We now observe that the integral in (94) also appears in the analysis of the upper bound to the misclassification probability associated to the classification between two Gaussian distributions without side information as described in [67]. In particular, such integral can be expressed as where . (96) For the case of zero-mean classes, i.e., assuming µ = 0, a low-noise expansion for the integral in (94) is given by [67,Theorem 1] for a fixed constant A (ik,j ) > 0, and with d(ik, j ) given by where Therefore, we can conclude that a low-noise expansion for the upper bound of the misclassification probability (31) is given byP where A > 0 is a fixed constant and is the worst-case diversity-order associated to the misclassification of pairs of Gaussian distributions identified by the index pairs (i, k) and (j, ), such that i = j.
It is then clear that the computation of the expansion ofP err for classification with side information requires the computation of the diversity-order terms (98), and, therefore, the computation of the ranks r (ik) = rank(Γ (ik) ) and r (ik,j ) = rank(Γ (ik,j ) ), with Γ (ik) = ΦP (ik) and Γ (ik,j ) = ΦP (ik,j ) , where and Therefore, in the following we will provide a characterization of such ranks as a function of the numbers of features m 1 and m 2 . For ease of a compact notation, we drop superscripts when results hold for all possible choices of index pairs (i, k) or quadruples (i, k, j, ). For the ease of notation, we will assume in the following n 1 ≥ m 1 and n 2 ≥ m 2 . However, the extension to the case where n 1 < m 1 or n 2 < m 2 is straightforward.
in (104), such that the row spaces associated to Φ 1 and Φ 2 are m 1 − and m 2 −dimensional spaces, isotropically distributed at random in R n1 and R n2 , respectively, and let P as Then, with probability 1, the rank of the matrix Γ = ΦP is given by where r x1 = rank[P c1 P 1 ], r x2 = rank[P c2 P 2 ] and r x = rank(P).
Proof: It is easy to observe that the expression in (106) represents an upper bound to the rank r = rank(ΦP) as rank(ΦP) ≤ rank(P) and rank(Γ) is always less than or equal to the sum of the ranks of the matrices obtained by considering separately its first m 1 and the remaining m 2 rows, i.e., rank(Φ 1 [P c1 P 1 0]) and rank(Φ 2 [P c2 0 P 2 ]).
Therefore, in the rest of the proof, we will aim at showing that such upper bound is actually tight, by proving that we can find at least r linear independent columns in Γ.
We start by considering the special case in which we impose Φ 2 = I n2 , and we show that in this case it holds On recalling Sylvester's rank theorem [76], which states we can write Then, we consider the matrix Ψ 1 ∈ R n1×(n1−m1) , whose columns form a basis for the null space Null(Φ 1 ), which is isotropically distributed among the (n 1 − m 1 )-dimensional spaces in R n1 . It is then straightforward to show that the columns of the matrix [Ψ T 1 0 T n2×(n1−m1) ] T span the null space and we can write in which we have leveraged the rank equality for block matrices [77], and the fact that rank(Ψ 1 ) = n 1 − m 1 and rank(P) = r x . Consider now the computation of the rank In order to compute such rank, we will leverage the generalized singular value decomposition (GSVD) as described in [78]. In particular, consider two matrices A ∈ R n×p and B ∈ R m×p , with the same number of columns, and with r A = rank(A), r B = rank(B), r AB = rank[A T B T ] T and s AB = r A + r B − r AB . Then, there exist two orthogonal matrices U ∈ R n×n , V ∈ R m×m and a non-singular matrix X ∈ R p×p such that where and D A = diag(α 1 , . . . , α s AB ), D B = diag(β 1 , . . . , β s AB ), such that 1 < α 1 ≤ · · · ≤ α s AB < 0 and 0 < β 1 ≤ · · · ≤ β s AB < 1, and α 2 i + β 2 i = 1, for i = 1, . . . , s AB . Therefore, on applying the GSVD to the two matrices [P c1 P 1 0] and [P c2 0 P 2 ], we can write where and where Ψ 1 = U T Ψ 1 is a matrix whose column space is still isotropically distributed at random among the (n 1 − m 1 )−dimensional subspaces of R n1 . Now, by considering the first r x − r x2 columns of the matrix in (118) together with its last n 1 − m 1 columns, given the fact the columns in Ψ 1 form a random space in R n1 , we can conclude that, with probability 1, we can pick from such columns min{r x − r x2 + n 1 − m 1 , n 1 } independent columns, which are also independent from the remaining (r x1 + r x2 − r x ) + (r x − r x1 ) = r x2 non-zero columns of the same matrix. Therefore, we have and then where the last equality is obtained by observing that r x ≤ r x1 + r x2 .
Consider now the general case, in which Φ 2 is not forced to be equal to the identity matrix. In this case, by leveraging (123), we can write in which we have introduced the symbol and where we have used the fact that rank[Φ 2 P c2 0 Φ 2 P 2 ] = min{m 2 , r x2 }. Then, with a procedure similar to that used to compute r I2 , it is possible to show that r I1 = min{r x , r x1 + min{m 2 , r x2 }}, thus leading to r = min {min{r x , r x1 + min{m 2 , r x2 }}, min{m 1 , r x1 } + min{m 2 , r x2 }} (129) = min{r x , min{m 1 , r x1 } + min{m 2 , r x2 }}.
Finally, note that Lemma 1 can be immediately applied to compute r (ik) , r (j ) and r (ik,j ) , thus concluding the proof of Theorem 1.
We also note in passing that the generalization of Lemma 1 to the case of multiple side information sources, . , x L , seems to be considerably more complex, due to the absence of a transform akin to the GSVD in Then, r x implies that ImΣ Assume now r x . We can then use the rank expression (106) and consider separately the following cases: x }, then we have r (ik) = min{r (ik) x , m 1 + m 2 }, r (j ) = min{r (j ) x , m 1 + m 2 } and r (ik,j ) = min{r Then, since m 1 + m 2 > min{r (ik) x , r (j ) x }, we have immediately that r (ik,j ) > min{r (ik) , r (j ) }, and thus d(ik, j ) > 0. Such sufficient conditions on the minimum number of measurements m 1 , m 2 needed to guarantee d(ik, j ) > 0 are also necessary. In fact, if m 1 ≤ min{r x }, then r (ik,j ) = r (ik) = r (j ) = m 1 + m 2 .
2) r x }. Then, we can split the analysis in further subcases as follows: x , r (j ) x } and therefore d(ik, j ) > 0.
Finally, we can combine the previous expressions to obtain necessary and sufficient conditions to guarantee x2 }. Finally, on combining these expressions, we can write necessary and sufficient conditions for d(ik, j ) > 0 as The characterization of the low-noise expansion of the upper bound to the misclassification probability in (31) for the case of nonzero-mean classes starts from the analysis of its lower and upper bounds presented in Appendix A.
We focus on the expressions in (94), (95) and (96), and we leverage the low-noise expansion of the integral in (95) presented in [67,Theorem 3] for the case of two nonzero-mean Gaussian classes. Namely, we recall that for fixed constants B (ik,j ) , C (ik,j ) > 0 if and only if Otherwise, the integral in (95) can be expanded as in (97). Therefore, if condition (135) is verified for all the index quadruples (i, k, j, ) ∈ S SIC , then we can expand the upper bound to the misclassification probability in (31) as for fixed constants B, C > 0. Otherwise, the upper bound of the misclassification probability is expanded as for a fixed A > 0 and where where S is the set of the index quadruples (i, k, j, ) ∈ S SIC for which (135) is not verified and d(ik, j ) is as in (98).
We can now provide necessary and sufficient conditions on m 1 and m 2 such that (135) Assume first that µ ,j ) , and, therefore (135) does not hold, irrespectively of the exact matrix Φ.
. We can use the rank expression (106) and similar steps to those in the proof of Corollary 1 in order to consider separately the following cases: Finally, we can combine the previous expressions and write necessary and sufficient conditions to guarantee 3) µ . We can combine the previous expressions and write necessary and sufficient conditions to guarantee (135) in this case as x2 ): the proof for this case follows steps similar to the case µ

APPENDIX D PROOF OF THEOREM 3
We start by proving that conditions (65) are sufficient in order to drive the MMSE to zero in the low-noise regime. The first condition in (65) is trivial, as it reflects the fact that it is possible to drive the reconstruction MMSE to zero in the low-noise regime from the observation of y 1 alone, provided that m 1 ≥ r x1 [64].
Consider now the upper bound associated to the distributed classification problem, i.e., the MMSE incurred in recovering both x 1 and x 2 from y 1 and y 2 (or, equivalently, x from y). Then, we can write and we can follow steps similar to those in [64,Appendix B] to show that, in the low-noise regime, MMSE G 1,2|1,2 (σ 2 ) approaches zero if and only if rank ΦΣ x Φ T = rank (Σ x ) .
We can now determine conditions on the number of features m 1 and m 2 needed in order to verify (145) by leveraging the rank expression (106). In particular, note that (145) holds if and only if min{m 1 , r x1 } + min{m 2 , r x2 } ≥ r x .
We can consider separately four different cases, and observe that, if m 1 ≤ r x1 and m 2 ≤ r x2 , then (145) Then, the proof of sufficiency is concluded by simply considering the union of the set of values (m 1 , m 2 ) which verify (152) with the set m 1 ≥ r x1 .
We now prove that conditions (65) are also necessary to guarantee that the MMSE approaches zero when σ 2 → 0.
In the remainder of this proof, we denote the MMSE associated to the estimation of the random vector u from the observation vector v by where the expectation is taken with respect to the joint distribution of (u, v). Then, we obtain a lower bound to MMSE G 1|1,2 (σ 2 ) = MMSE(x 1 |y 1 , y 2 ) ≥ MMSE(x 1 |y 1 , y 2 , w 1 , w 2 ) = MMSE(x 1 |Φ 1 x 1 , Φ 2 x 2 ).
On the other hand, by observing that the MMSE does not depend on the value of the mean of the input signal to estimate, and by taking the expectation in the right hand side of (154) with respect to the random variables x 1 |Φ 2 x 2 and Φ 2 x 2 , separately, it is possible to show that where z ∈ R n1 is a Gaussian vector with covariance matrix equal to the conditional covariance of x 1 given Φ 2 x 2 , i.e., z ∼ N (0, Σ z ), where Then, by leveraging the result in [64, Theorem 1], or by simply considering the set of linear equations corresponding to the rows of the matrix Φ 1 Σ 1/2 z , a necessary condition for MMSE(z|Φ 1 z) = 0, and therefore, a necessary condition for lim σ 2 →0 MMSE G 1|1,2 (σ 2 ) = 0, is given by We complete the proof by computing the rank r z using a result on the generalized Schur complement of a positive semidefinite matrix [80]. Namely, Σ z can be viewed as the generalized Schur complement of the block Φ 2 Σ x2 Φ T 2 of the positive semidefinite matrix and, with probability 1, we have [80] rank(Σ x1Φ2x2 ) = r z + rank(Φ 2 Σ x2 Φ T 2 ) = r z + min{m 2 , r x2 }.
In addition, on considering the matrix and on applying the same rank computation, we also have rank(Σ Φ2x2x1 ) = rank(Σ x1Φ2x2 ) = r x1 + rank(Cov(Φ 2 x 2 |x 1 )) = r x1 + rank Then, on recalling that the projection kernel Φ 2 is rotation-invariant, and by using again the generalized Schur complement rank computation, with probability 1, we have rank(Cov(Φ 2 x 2 |x 1 )) = min{m 2 , r x − r x1 }. (162) Finally, by substituting (161) and (162) in (159), we can rewrite (157) as which can be immediately shown to be equivalent to conditions (65), thus concluding the necessity part of the proof.

APPENDIX E PROOF OF THEOREM 4
This proof in based on steps similar to those in [64,Appendix C]. Nevertheless, we report here the key ideas used in the proof for completeness. On defining where W (ik) x1 is as in (68), and by using the law of total probability, we can write p(Ĉ 1 = j,Ĉ 2 = |C 1 = i, C 2 = k) · E x 1 − W (j ) x1 (y) 2 |Ĉ 1 = j,Ĉ 2 = , C 1 = i, C 2 = k .
Therefore, observe that the misclassification probability p(Ĉ 1 = j,Ĉ 2 = |C 1 = i, C 2 = k) is the measure of the set representing the decision region of the MAP classifier for the distributed classification problem associated to the classes (j, ) with respect to the Gaussian measure induced by the Gaussian distribution of classes (i, k). Then, it is also possible to show that, in the limit σ 2 → 0, the product in (166) is upper bounded by the integral of a measurable function over a set with measure zero, and then it converges to zero.
In the second case, instead, we have Im(Σ (ik) x ) = Im(Σ (j ) x ), and we can consider separately further two cases. If then Theorem 2 states that p(Ĉ 1 = j,Ĉ 2 = |C 1 = i, C 2 = k) approaches zero in the low-noise regime, and we can prove that (166) holds by following a similar procedure to that used for case i). On the other hand, if then the misclassification probability is not guaranteed to approach zero in the low-noise regime. However, on using the law of total probability and the definition of MSE, we can notice that the argument of the limit in (166) is where W (j ) x (y) = µ (j ) x + W (j ) and Then, we can show that the right hand side of (170) approaches zero when σ 2 → 0 by using steps similar to those in [64,. This reflects the fact that the mismatched MSE for Gaussian sources reaches zero in the low-noise regime, provided that the estimated input covariance has the same range space than the true input covariance. In particular, on denoting by Σ (ik) y = σ 2 I + ΦΣ (ik) x Φ T the covariance matrix of y conditioned on (C 1 , C 2 ) = (i, k), and on introducing the symbol M (ik,j ) = (µ x ) T , we can write and we can prove that The proof is based on the use of the inversion Lemma [79] A(Ic −1 + BA) −1 B = I − (I + cAB) −1 , in which we choose A = Φ T and B = ΦΣ (j ) x and we write Then, on noting that the matrix Φ T ΦΣ (j ) x is diagonalizable with probability 1, and by following steps similar to those adopted in the proof of Theorem 3, we are able to prove (176). Finally, also (177), (178) and (179) are proved by following a completely similar approach.