Bounds on the Number of Measurements for Reliable Compressive Classification

This paper studies the classification of high-dimensional Gaussian signals from low-dimensional noisy, linear measurements. In particular, it provides upper bounds (sufficient conditions) on the number of measurements required to drive the probability of misclassification to zero in the low-noise regime, both for random measurements and designed ones. Such bounds reveal two important operational regimes that are a function of the characteristics of the source: i) when the number of classes is less than or equal to the dimension of the space spanned by signals in each class, reliable classification is possible in the low-noise regime by using a one-vs-all measurement design; ii) when the dimension of the spaces spanned by signals in each class is lower than the number of classes, reliable classification is guaranteed in the low-noise regime by using a simple random measurement design. Simulation results both with synthetic and real data show that our analysis is sharp, in the sense that it is able to gauge the number of measurements required to drive the misclassification probability to zero in the low-noise regime.


I. INTRODUCTION
Compressive sensing (CS) is an emerging paradigm that offers the means to simultaneously sense and compress a signal without significant loss of information [3]- [5] (under appropriate conditions on the signal model and measurement process). The sensing process is based on computing the inner product of the signal of interest with a set of vectors, which This paper was presented in part at the 2013 IEEE International Symposium on Information Theory [1] and the 2013 IEEE Global Conference on Signal and Information Processing [2]. The work of H. Reboredo  are typically constituted randomly [3]- [5], and the recovery process is based on the resolution of an inverse problem. The result that has captured the imagination of the signal and information processing community is that it is possible to perfectly reconstruct an n-dimensional s-sparse signal (sparse in some orthonormal dictionary or frame) with overwhelming probability with only O (s log (n/s)) linear random measurements [3], [5], [6] using tractable 1 minimization methods [4] or iterative methods, like greedy matching pursuit [7].
The focus of compressive sensing has been primarily on exact or near-exact signal reconstruction from a set of linear signal measurements. However, it is also natural to leverage the paradigm to perform other relevant information processing tasks, such as detection, classification and estimation of certain parameters, from the set of compressive measurements. One could in fact argue that the paradigm is a better fit to decision support tasks such as signal detection, signal classification or pattern recognition rather than signal reconstruction, since it may be easier to discriminate between signal classes than reconstruct an entire signal using only partial information about the source signal.
This paper concentrates on the classification of signals from a set of compressive linear and noisy measurements. In particular, we consider the case where signals associated to different classes lie on low-dimensional linear subspaces. This problem is fundamental to the broad fields of signal and image processing [8]- [10], computer vision [11], [12] and machine learning [13], [14], as pre-processing often relies on dimension reduction to increase the speed and reliability of classification as well as reduce the complexity and cost of data processing and computation.
Compressive classification appears in the machine learning literature as feature extraction or supervised dimensionality reduction. For example, linear dimensionality reduction methods based on geometrical characterizations of the source have been developed, with linear discriminant analysis (LDA) [15] and principal component analysis (PCA) [15] just depending on second order statistics. In particular, LDA, which is one of the most well-known supervised dimensionality reduction methods [16], addresses simultaneously the between-class scattering and the within-class scattering of the measured data. Linear dimensionality reduction methods based on higherorder statistics of the data have therefore also been developed [14], [17]- [23]. In particular, an information-theoretic supervised dimensionality reduction inspired approach, which uses the mutual information between the data class labels and the data measurements [14] or approximations of the mutual information via the quadratic Rényi entropy [18], [23], [24] as a criterion to linearly reduce dimensionality, have been shown to lead to state-of-the-art classification results. More recently, learning methods for linear dimensionality reduction based on nuclear norm optimization have also been proposed [25], which have been shown to lead to state-of-the-art results for face clustering, face recognition and motion segmentation applications. Low-dimensional random linear measurements have also been used in conjunction with linear classifiers in scenarios where the number of training samples is smaller than the data dimension [26]. In particular, [26] derives bounds on the generalization error of a binary Fisher linear discriminant (FLD) classifier with linear random measurements.
Compressive classification also appears in the compressive information processing literature in view of recent advances in compressive sensing [13], [27]- [33]. Reference [27] presents algorithms for signal detection, classification, estimation and filtering from random compressive measurements. References [28], [29], [30] and [31] study the performance of compressive detection and compressive classification for the case of random measurements. References [32] and [33] consider the problem of detection of spectral targets based on noisy incoherent projections. Reference [13] notes that a small number of random measurements captures sufficient information to allow robust face recognition. The common thread in this line of research relates to the demonstration that the detection and classification problems can be solved directly in the measurement domain, without requiring the transformation of the data from the compressive to the original data domain, i.e. without requiring the reconstruction of the data.
Other works associated with compressive classification that have arisen in the computational imaging literature, and developed under the rubric of task-specific sensing, include [8]- [10], [34]- [37]. In particular, task-specific sensing, which advocates that the sensing procedure has to be matched to the task-specific nature of the sensing application, has been shown to lead to substantial gains in performance over compressive sensing in applications such as localization [34], target detection [8], (face) recognition [9], [10], and reconstruction [35].
The majority of the contributions in the literature to date has focused on the proposal of linear measurement design algorithms for two-and multiple-class classification problems (e.g. [14], [15], [17]- [23], [38], [39]). Such algorithms -with the exception of two-class problems [38], [39] -do not typically lead to closed-form measurement designs thereby not providing a clear insight about the geometry of the measurements and preventing us to understand how classification performance behaves as a function of the number of measurements. This paper attempts to fill in this gap by asking the question: What is the number of measurements that guarantees reliable classification in compressive classification applications?
We answer this question both for the scenario where the measurements are random and the more challenging scenario where the measurements are designed, when the distribution of the signal conditioned on the class is multivariate Gaussian with zero mean and a certain (rank-deficient) covariance matrix. In addition, our answer to this question also leads to simple and insightful closed-form measurements designs both for two-class and multi-class classification problems. 1 Analytical bounds on the number of measurements required for reliable classification are derived in this work for the asymptotic regime of low noise. On the other hand, the validity of such predictions also for positive noise levels is showcased by numerical results.
We adopt this data model for three main reasons: first, our classification problem corresponds to the Bayesian counterpart of low-dimensional subspace classification problems that are ubiquitous in practice; then, this model often leads to state-ofthe-art results in the compressive classification applications such as character and digit recognition as well as image classification [14], [23], [24]; in addition, this model -which entails that the source distribution is a Gaussian mixture model (GMM) -also relates to various well-known models in the literature including union of sub-spaces [40], [41], wavelet trees [40], [42] or manifolds [43], [44], that aim to capture additional signal structure beyond primitive sparsity in order to yield further gains. The framework based on GMM priors has also been used for the problem of signal recovery. In such case, the objective is not to determine from which Gaussian distribution the observed signal was drawn, but to reconstruct its value from compressive, noisy, linear measurements. Analytical bounds on the number of measurements needed for reliable signal reconstruction in the low-noise regime have been derived in [45], [46].
The remainder of this paper is organized as follows: Section II defines the problem, including the measurement model, source model, and performance metrics. Section III presents an upper bound to the misclassification probability and its expansion at low noise that is the basis of our analysis. Sections IV and V derive upper bounds on the number of measurements sufficient for reliable classification. In Section VI we report numerical results that validate the theoretical analysis with both synthetic data and real data from video segmentation and face recognition applications. Section VII contains a discussion on the impact of model mismatch in real data scenarios and, finally, we draw conclusions in Section VIII. The proofs of some of the results are relegated to the Appendices.
The article adopts the following notation: boldface uppercase letters denote matrices (X), boldface lower-case letters denote column vectors (x) and italics denote scalars (x); the context defines whether the quantities are deterministic or random. I N represents the N × N identity matrix, 0 M ×N represents the M × N zero matrix (the subscripts that refer to the dimensions of such matrices will be dropped when evident from the context) and diag (a 1 , a 2 , . . . , a N ) represents an N × N diagonal matrix with diagonal elements a 1 , a 2 , . . . , a N . The operators (·) T , rank (·), det (·), pdet (·) and tr (·) represent the transpose operator, the rank operator, 1 Note that the problem of compressive classification of signals drawn from Gaussian distributions has been also considered in the preliminary papers [1], [2], where the behavior of the misclassification probability in the low-noise regime was studied for the case of random measurements [1] and for the case of designed measurements [2], but offering a closed-form characterization only for binary classifiers.
the determinant operator, the pseudo-determinant operator and the trace operator, respectively. Null (·) and Im (·) denote the null space and the (column) image of a matrix, respectively, and dim (·) denotes the dimension of a linear subspace. We also use the symbol (·) ⊥ to denote the orthogonal complement of a linear space. The multivariate Gaussian distribution with mean µ and covariance matrix Σ is denoted by N (µ, Σ) and the symbol P[E] is used to denote the probability of the event E. log (·) denotes the natural logarithm. For the sake of a compact notation, we also use the symbols N i = Null(Σ i ) and R i = Im(Σ i ), as well as N ij = Null(Σ i +Σ j ) = N i ∩N j and R ij = Im(Σ i + Σ j ) = R i + R j , where + denotes the sum of linear subspaces. The article also uses the symbol [x] + = max{x, 0}, the floor operator x , which represents the larger integer less than or equal to x, and the little o

II. PROBLEM STATEMENT
We consider the standard measurement model given by: where y ∈ R M represents the measurement vector, x ∈ R N represents the source vector, Φ ∈ R M ×N represents the measurement matrix or kernel 2 and n ∼ N 0, σ 2 I ∈ R M represents white Gaussian noise. 3 We also consider that the source model is such that: A. 1 The source class C ∈ {1, . . . , L} is drawn with probability p i , i = 1, . . . , L. A.2 The source signal conditioned on the class C = i is drawn from a multivariate Gaussian distribution with zero mean and (rank deficient) covariance matrix Σ i ∈ R N ×N . We should point out that we use a low-rank modeling approach even though many natural signals (e.g. patches extracted from natural images, face images, motion segmentation features, handwritten digits images, etc.) are not always lowrank but rather "approximately" low-rank [44]. The justification for the use of such low-rank modeling approach is two-fold: first, a low-rank representation is often a very good approximation to real scenarios, particularly as the eigenvalues of the class conditioned covariances often decay rapidly; second, it is then standard practice to account for the mismatch between the low-rank and the "approximately" low-rank model by adding extra noise in the measurement model in (1) (see [46]). 4 It is assumed that the classifier -which infers the true signal class from the signal measurements using a maximum a posteriori (MAP) classifier -is provided with the knowledge of the true model parameters, i.e., the prior probabilities p i , i = 2 We refer to Φ as the measurement or sensing matrix/kernel interchangeably throughout the paper. 3 The results presented in the remainder of the paper can be easily generalized to the case when the noise covariance matrix is a positive definite matrix Σn. 4 We also note that our analysis focuses on zero-mean models, since various datasets (e.g., face images and motion segmentation features) can be well represented via zero-mean classes [44], [47]. However, some of the results in the paper can also be generalized to the case of nonzero-mean classes.
1, . . . , L, the source covariance matrices Σ i , i = 1, . . . , L, the measurement matrix Φ and the noise variance σ 2 . 5 In particular, the signal class estimate produced by the classifier is given by: where p(C = i|y) is the a posteriori probability of class C = i given the measurement vector y and p(y|C = i) represents the probability density function of the measurement vector y given the class C = i, which is zero-mean Gaussian, with covariance matrix ΦΣ i Φ T + Iσ 2 .
Our objective is to characterize the number of measurements sufficient for reliable classification in the asymptotic limit of low noise, i.e. such that where P e is the misclassification probability of the MAP classifier. Note also that the asymptotic regime of low-noise plays a fundamental role in various signal and image processing scenarios, e.g., digit recognition and satellite data classification [14]. In particular, by using the law of total probability, we can write where D i is the decision region associated to class i, that is, the set of values y corresponding to the outputĈ = i. Moreover, we can express the set R M D i in terms of the unit step function u(·) and, by leveraging the definition of the MAP classifier in (2), we can write the misclassification probability as The low-noise characterization of the misclassification probability will be carried out both for random measurements and designed measurements.
Our characterization will also be based on the following additional assumptions: A.3 The linear spaces R i = Im(Σ i ), i = 1, . . . , L are of equal dimension, i.e. dim(R i ) = r Σ < N, i = 1, . . . , L; 6 A.4 The linear spaces R i are independently drawn from a continuous probability density function (pdf) over the Grassmann manifold of subspaces of dimension r Σ in R N , so that the null spaces N i = Null(Σ i ) are also of equal dimension, i.e. dim(N i ) = N − r Σ , and are also drawn independently from a continuous pdf over the Grassmann manifold of subspaces of dimension N − r Σ in R N . 7 The assumptions A.3 and A.4 impliy that with probability 1 and Our characterization will also use the quantities: that relates to the difference between the dimension of the sub-spaces spanned by source signals in classes i or j and the dimension of the intersection of such sub-spaces; and which measure the dimension of the sub-space spanned by the linear transformation of the signals in class i and the volume occupied by those signals in R M , respectively, and and which measure the dimension of the direct sum of sub-spaces spanned by the linear transformation of the signals in classes i or j and the volume occupied by the measured signals from classes i and j in R M , respectively.

III. MISCLASSIFICATION PROBABILITY, BOUNDS AND EXPANSIONS
The basis of our characterization of an upper bound to the number of random or designed measurements sufficient for reliable classification is an asymptotic expansion of an upper bound to the misclassification probability of the MAP classifier in (6). We work with an upper bound to the misclassification probability in lieu of the true misclassification probability, in view of the lack of closed-form expressions for the misclassification probability of the MAP classifier.
In particular, the Bhattacharyya bound [31] represents an upper bound to the misclassification probability associated to the binary MAP classifier which is based on the inequality min {a, b} ≤ √ ab, for a, b > 0. Then, it is possible to establish, by using the union-bound in conjunction with the Bhattacharyya bound, that the misclassification probability of 7 Note that this assumption on the linear spaces occupied by signals in different classes reflects well the behavior of many real data ensembles for various applications as face recognition, video motion segmentation, or digits classification [14], [47].
the MAP classifier can be upper bounded as follows: where Note that the exponent K ij is a function of the ratio between the volume collectively occupied by measured signals belonging to classes i and j and the product of the volumes occupied distinctly by measured signals in class i and measured signals in class j.
The following lemma now provides the low-noise expansion of the upper bound to the probability of error. It defines the probability of error using two quantities: one quantity characterizes the slope of the decay of the upper bound to the misclassification probability (in a log σ 2 scale) and the other quantity defines the power offset of the upper bound to the misclassification probability at low-noise levels.
Lemma 1: Consider the measurement model in (1) and the assumptions A.1, A.2 in Section II. Then, in the regime of low noise where σ 2 → 0, the upper bound to the probability of misclassification can be expanded as: where This lemma leads immediately to the following corollary that provides conditions for lim σ 2 →0Pe = 0 and hence conditions for lim σ 2 →0 P e = 0.
Corollary 1: Consider the measurement model in (1) and the assumptions A.1, A.2 in Section II. We have that and The conditions that guarantee that lim σ 2 →0Pe = 0 stem directly from conditions that guarantee d > 0. The ensuing analysis then concentrates on how to define the effect of the number of random measurements or designed measurements on the value of the exponent d as a proxy to characterize the phase transition in (3).

IV. RANDOM MEASUREMENTS
We first consider the simpler problem where the measurement matrix Φ is random. In particular, we consider that the measurement matrix is randomly drawn from a left rotationinvariant distribution. 8 We consider the following problem: Determine the minimum number of random measurements needed to guarantee that The following proposition provides a solution for the case d 0 = 0 that leads precisely to the minimum number of measurements for lim σ 2 →0Pe = 0 hence an upper bound on the minimum number of measurements for lim σ 2 →0 P e = 0. is Proof: The proof of this proposition follows immediately from the characterization in Corollary 1 and from the observation that, with probability 1 over the distribution of Φ, The following proposition provides a generalization of this result from the case d 0 = 0 to d 0 > 0. is Proof: The proof of this proposition follows immediately from the characterization of the exponent d in Lemma 1 and from the observation that, with probability 1 over the distribution of Φ, We note that the result in Proposition 1 implies that reliable classification with random measurements is obtained when the signals are embedded into a linear space with dimension strictly greater than the dimension of the spaces spanned by the class conditioned input signals, i.e., r Σ ; in fact, when this is not the case, the measured signals occupy the entire space 8 A random matrix A ∈ R m×n is said to be (left or right) rotation-invariant if the joint pdf of its entries p(A) satisfies p(ΘA) = p(A), or p(AΨ) = p(A), respectively, for any orthogonal matrix Θ or Ψ. A special case of (left and right) rotation-invariant random matrices is represented by matrices with independent identically distributed (i.i.d.), zero-mean Gaussian entries with fixed variance, which is common in the CS literature [3], [5].
R M and, therefore, they are not distinguishable with arbitrarily low misclassification probability when σ 2 → 0.
On the other hand, the results in Proposition 2 unveil the interplay between the decay rate of the upper bound to the misclassification probability, the measurements and the geometry of the source. In particular, the results imply that the decay rate scales linearly with the number of measurements up to the maximum decay rate associated with the upper bound in (16), i.e., R/4, which is achieved when signals are embedded into a linear space with dimension equal to dim(R ij ) = min{N, 2r Σ }, i.e., the dimension of the sum of any pair of spaces spanned by signals in a given class.

V. DESIGNED MEASUREMENTS
We now consider the more challenging problem where the measurement matrix Φ is designed. In particular, we also want to consider the following problem: Determine the minimum number of designed measurements needed to guarantee that Note once again that by setting d 0 = 0 one obtains an upper bound to the minimum number of measurements for lim σ 2 →0 P e = 0, thereby guaranteeing a phase transition in the misclassification probability; and by setting d 0 > 0 one obtains an upper bound to the minimum number of measurements for lim σ 2 →0 − log Pe log(1/σ 2 ) > d 0 , thereby guaranteeing a certain decay in the misclassification probability.
We will consider separately the case of two classes and the multiple classes scenario.

A. Two classes
The following propositions provide an upper bound to the minimum number of measurements required to drive the misclassification probability to zero at a rate higher than a given value d 0 .
Proposition 3: Consider the measurement model in (1) where the assumptions A.1-A.4 in Section II are verified and L = 2. Then, an upper bound on the minimum number of measurements for lim is and a possible measurement matrix that achieves (29) is Proof: The proof of this proposition follows immediately from the evaluation of the expansion exponent d of the upper bound (16). Namely, when Φ = φ T , where φ ∈ R N ×1 is a vector in N 1 or N 2 that is not contained in the intersection N 1 ∩ N 2 , we obtain d = (2r 12 − r 1 − r 2 )/4 = 1/4 > 0, which immediately implies (29). Note also that the existence of the vector φ is guaranteed by the fact that, if r Σ < N , then R 1 = R 2 and, therefore, is and a measurement matrix Φ that achieves (31) is obtained by choosing arbitrarily 4d 0 + 1 out of the R rows of matrix where the sets constitute an orthonormal basis of the linear spaces N 12 , N 1 and N 2 , respectively, and We can observe that a designed kernel can offer marked improvements over a random one in the low-noise regime. Namely, perfect separation of the measured signals can be achieved with a single measurement -with a random measurement kernel we require M ≥ r Σ + 1 -and the maximum decay exponent d associated with the upper bound (16) We also observe that the kernel design embedded in Proposition 4 relates to previous results in the literature about measurement kernel optimization for the 2-classes classification problem. In particular, for the case of zero-mean classes, it was shown in [38] that the measurement kernel minimizing the Bhattacharyya bound of the misclassification probability for two zero-mean classes is obtained via the eigenvalue decomposition of the matrix Σ −1 1 Σ 2 , where the covariance matrices Σ 1 and Σ 2 are assumed to be full rank.
A generalization of this construction for the case when Σ 1 and Σ 2 are not invertible is presented in [39]. Such kernel design leverages the generalized singular value decomposition (GSVD) [48] of the pair of matrices (Σ 1 , Σ 2 ) in order to minimize the corresponding Bhattacharyya upper bound. In particular, it is shown that the most discriminant measurements are those corresponding to generalized eigenvectors which lie in the intersections R 1 ∩N 2 or R 2 ∩N 1 . Then, on recalling that R i = N ⊥ i , we can note that the most discriminant measurements are picked from a subspace contained in N 1 (N 2 ) that is also orthogonal to (and therefore, not contained in) N 2 (N 1 ). In this sense, the construction described by Proposition 4 is similar to this result. However, there are significant differences between our results and the results in [39]. First, our analysis applies to a sensing scenario in lieu of feature extraction; so the measurements in (1) are contaminated by noise whereas the measurements in [39] are not. More importantly, the analysis in [39] does not offer an explicit characterization of the number of measurements needed to guarantee a given misclassification probability performance. On the other hand, our analysis offers sufficient conditions for reliable classification in the lownoise regime and a direct connection between the number of measurements taken on the source signal and the lownoise behavior of the corresponding upper bound to the misclassification probability via the exponent d.

B. Multiple classes
The following propositions offer an upper bound to the minimum number of measurements required to drive the misclassification probability to zero, and a procedure to determine an upper bound to the minimum number of measurements required to guarantee that the misclassification probability decays to zero with an exponent higher than a given value d 0 . is Moreover, a measurement matrix Φ that achieves (34) is obtained as follows: let N i be a matrix that contains a basis for the null space N i . Then, the M = min{L−1, r Σ +1} rows of the matrix Φ are obtained by randomly picking one row from each of the matrices N T π(1) , . . . , N T π(min{L−1,r Σ +1}) , where π(·) is any permutation function of the integers 1, . . . , L.
Proof: See Appendix C.
Note that the characterization embodied in Proposition 5 is obtained by taking the measurement matrix to belong to a certain restricted subset of R M ×N rather than the entire R M ×N .
The choice of such subset of R M ×N is inspired by our characterization pertaining to the two-class problem embodied in Propositions 3 and 4. Namely, let N i ∈ R N ×(N −r Σ ) be a matrix that contains a basis for the null space N i and let N = [N 1 , . . . , N L ] be a matrix that contains the concatenation of the bases for all the null spaces N 1 , . . . , N L . Then, we take the measurement matrix to consist of M rows of N T rather than M arbitrary vectors from R N .
Note also that the result embodied in Proposition 5 -which is shown to be very sharp both with synthetic data and real data simulations -provides a fundamental insight in the role of measurement design in comparison with random measurement kernels in the discrimination of subspaces. In particular, we can clearly identify two operational regimes that depend on the relationship between two fundamental geometrical parameters describing the source: the number of classes and the dimension of the linear subspaces associated to the different classes.
• When, the number of classes in the source is lower than or equal to the dimension of the spaces spanned by signals in each class, the designed measurement matrix is such that we take one measurement from L−1 out of the L null spaces N i , i = 1, . . . , L. In this sense, the construction that achieves the upper bound implements a one-vs-all approach, where each measurement is able to perfectly detect the presence of signals coming from a specific class against signals from all the remaining classes. Note that in this regime, proper design of the measurement kernel can provide a dramatic performance advantage with respect to random measurements, as it can guarantee that the misclassification probability approaches zero, in the lownoise regime, even when random measurements yield an error floor. • On the other hand, when the number of classes is larger than the dimension spanned by signals in a given class, (more precisely, when L > r Σ + 1), then r Σ + 1 measurements are sufficient to drive to zero the misclassification probability in the low-noise regime. In this case, the designed measurement kernel obtains the same performance of random measurements in terms of phase transition of upper bounds to the misclassification probability. However, properly designing the measurement kernel can have an impact on the value of the error floor or the speed of the decay of the misclassification probability with 1/σ 2 .
is given by the solution to the integer programming problem Proof: Note that R/4 is the maximum decay exponent d associated with the upper bound in (16), and note also that, if d 0 < R/4, a sufficient condition for (36) is given by d(i, j) > d 0 , for all (i, j), i = j. Then, the proposed upper bound follows from taking the measurement matrix to belong to the same restricted subset considered in Proposition 5. In this case, on denoting by M i the number of measurements in Φ that are also columns of (38) which leads to the formulation of the problem (37). 9 The details are provided in Appendix C.
Note that, although a general closed-form solution to the optimization problem in (37) is difficult to provide, our formulation allows to drastically reduce the number of (integer) optimization variables, which is now equal to the number of classes L. Moreover, the integer programming problem in (37) involves a linear objective function and constraints that are expressed via linear functions combined via the max function, thus allowing the use of efficient numerical methods for its solution.
It is also important to emphasize the differences between the result in Proposition 5 and other results in the literature. The result in (35) is reminiscent of a result associated to multiclass LDA, that involves the extraction of L − 1 linear features from the data using the LDA rule [15]. However, such LDA construction does not provide conditions on the number of measurements needed for reliable classification. Moreover, in contrast with the analysis here proposed, LDA approaches are usually applied to the nonzero-mean classes scenario rather than the zero-mean case considered here. In fact, LDA methods are shown to be ineffective in the case of zero-mean classes, due to the measurement kernel construction approach that is based on the computation of the GSVD of inter-class and intra-class scatter matrices, where the first one is a function of the class means.
A modified version of LDA which can cope also with zeromean classes has been presented in [39]. Such method is based on recasting a multiclass classification problem into a binary pattern classification problem. However, in this case the measurement kernel Φ is not determined on the basis of the statistical description of the classes, but rather it is derived via a non-parametric approach, which involves the computation of scatter matrices from labeled training samples. In particular, on denoting by Σ b the between-class scatter matrix and by Σ w the within-class scatter matrix, measurements are designed in order to maximize the objective function leading to measurement designs that are associated with the generalized eigenvectors corresponding to the largest generalized eigenvalues of (Σ b , Σ w ). In addition, in this case, conditions on number of measurements needed for reliable classification are not available in general.

VI. NUMERICAL RESULTS
We now show how our theory aligns with practice, both for synthetic data and real data associated with a video segmentation application and with a face recognition application. We also show how our upper bound on the minimum number of measurements required for the phase transition compares to those associated with state-of-the-art measurement designs such as information discriminant analysis (IDA) methods [21] and methods based on the maximization of Shannon mutual information and quadratic Rényi entropy [14].

A. Synthetic data
We first consider experiments with synthetic data by concentrating on two examples that reflect the two regimes embodied in Proposition 5. In the first example, the data is generated by a mixture of L = 11 Gaussian distributions with dimension N = 64, with probability p i = 1/11, for i = 1, . . . , 11. The input covariance matrices have all rank r Σ = 14, and their images are drawn uniformly at random from the Grassmann manifold of 14-dimensional spaces in R 64 . Figure 1 reports the upper bound to the misclassification probability and the true misclassification probability, respectively, vs 1/σ 2 both for random kernel designs and measurement designs that obey the construction embodied in Proposition 5. 10 The measurement kernels are also normalized such that tr(Φ T Φ) ≤ M .
Note that theoretical results are aligned with experimental results in the sense that both theory and practice suggest that the low-noise phase transition occurs with M ≥ L−1 = 10 for designed kernels and M ≥ r Σ + 1 = 15 for random kernels. This is observed from Fig. 1, suggesting that our analysis is sharp.
In the second example, the data is drawn from a mixture of L = 12 Gaussian distributions with dimension N = 64, with probability p i = 1/12 for i = 1, . . . , 12. The input covariance matrices have all rank r Σ = 9, and their images are drawn uniformly at random from the Grassmann manifold of 9-dimensional spaces in R 64 . Figure 2 showcases the upper bound to the misclassification probability and the true misclassification probability, respectively, vs 1/σ 2 both for random kernel designs and measurement designs that obey the construction embodied in Proposition 5. It is evident -as predicted by Proposition 5 -that both random and designed kernels achieve a low-noise phase transition in the upper bound to the misclassification probability with M ≥ r Σ +1 = 10. However, designed kernels offer a lower misclassification probability than random kernels for finite noise levels. It is also evident by comparing the true 10 Note that the construction embodied in Proposition 5 is shown to achieve the low-noise phase transition with a number of measurements equal to (35).  misclassification probability values and the upper bounds in Fig. 2 that our analysis is sharp. Our upper bound to the minimum number of measurements required for the phase transition of the misclassification probability relied on a specific construction. It is therefore relevant to examine how such a bound compares to the number of measurements required for the phase transition associated with state-of-the-art kernel designs. To that end, we consider three state-of-the-art measurement kernel designs applied to the two previous examples: these are the IDA method in [21] and methods based on the maximization of Shannon mutual information (MI) and Rényi quadratic entropy [14], respectively. Table I reports the minimum number of measurements needed by such methods in order to drive to zero the numerically simulated misclassification probability, as well as the theoretical predictions derived in the previous sections for both random and designed kernel. It is interesting to see that the bound embodied in Proposition 5 predicts very well the behavior of state-of-the-art kernel design methods. This means that our bound can be used to gauge a suitable number of measurements to be used in state-of-the-art kernel design approaches.

B. Real data: Motion segmentation
We now consider experiments with real data by concentrating on a motion segmentation application, where the goal is to segment a video in multiple rigidly moving objects. Such 3.1756 0.7410 0.0267 0.0022 0.000 Σ 3 11.2797 5.9315 0.0672 0.0004 0.000 application involves the extraction of feature points from the video whose position is tracked over different frames. Then, motion segmentation aims at partitioning pixels extracted from different frames into spatiotemporal regions. In particular, feature point are clustered into different groups, each corresponding to a given motion [47]. The data to be processed by the clustering algorithm is obtained by stacking the coordinate values associated to a given feature point corresponding to different frames. For a detailed description of how clustering data are obtained from feature points coordinates, please refer to [47].
We use the Hopkins 155 motion segmentation dataset [49], which consists of video sequences with two or three motions in each video. Each video of two motions consists of 30 frames, whereas each video of three motions consists of 29 frames. In particular the results reported in this section are obtained by considering the video with three motions in the dataset having the largest number of samples for each motion/class 11 , namely, 142 samples for class 1, 114 samples for class 2 and 236 samples for class 3.
We consider in particular a supervised learning approach, in which 50% or 30% of the vectors corresponding to features points are manually labeled, whereas the remaining points are classified automatically, starting from the observation of noisy measurements, where the noise variance is set to σ 2 = −60 dB. The manually labeled points represent labeled training samples from which the input signal parameters p i , Σ i , i = 1, . . . , L are inferred using maximum likelihood (ML) estimators.
As described in [47], [50], [51], features points trajectories belonging to a given motion can be shown to lie on approximately three dimensional affine spaces or four dimensional linear spaces. In fact, the covariance matrices obtained from the training samples present only two dominant principal components, as demonstrated by the magnitudes of eigenvalues of the input covariance matrices reported in Table II. Then, based on the results presented in Propositions 1 and 5, we can expect that at least 3 random measurements and 2 designed measurements are needed for reliable classification, respectively. Figures 3 (a) and (b) report the misclassification probability vs the number of measurements for random kernels, kernels designed via the construction embodied in Proposition 5, and the designs in [14], [21]. In particular, in view of the fact that the analysis is conducted for the scenario where the MAP classifier is provided with the true model parameters, our results consider both the scenario where a significant number 11 Denoted as "1RT2RCR" in the dataset. of training samples (50%) is used to learn the underlying models and a scenario where a lower number of training samples (30%) is used to derive the models in order to assess the robustness of the theoretical insights agains model mismatch. Note that now the misclassification probability does not exhibit a perfect phase transition in view of the fact that the data covariance matrices are not low-rank anymore but rather approximately low-rank, and due to the mismatch between the model inferred from training data and the actual test data. However, one can still conclude that our theoretical results align with practical ones, since they can unveil the number of measurements required for the misclassification probability to be below a certain low value.
In particular, Table III reports the minimum number of measurements required by the random and designed kernels to achieve a misclassification probability below 15%, 10% and 5%, for both cases when 50% and 30% of the vectors in the dataset are used as training samples. It can be observed that our characterization of the upper bound to the number of measurements required for the phase transition matches well the number of measurements required to achieve a low misclassification probability in IDA and methods based on the maximization of Shannon mutual information and Rényi quadratic entropy, in both scenarios where 50% and 30% of the vectors in the dataset are used as training samples.

C. Real data: Face recognition
We now consider a different real-word, compressive classification application. In particular, we consider a face recognition problem where the orientation of faces associated to different individuals relative to the camera remains fixed, but the illumination conditions vary. On assuming that faces are approximately convex and that reflect light according to Lambert's law, it is possible to show that the set of images of a same individual under different illuminations lies approximately on a 9-dimensional linear subspace [52]. Therefore, face recognition from linear measurements extracted from such images can be performed via subspace classification.
In this section, we show classification results using cropped images from the Extended Yale Face Database B [53]. In particular, we consider 16 × 16 images of L = 5 different individuals from the 38 available in the dataset. For each in- dividual, 63 images corresponding to 63 different illumination conditions are considered. As for the video motion segmentation application described in Section VI-B, classification is performed via the MAP classifier (2), where we assume Gaussian distribution for each class and the parameters p i , Σ i are obtained via ML estimators by using 50% or 30% of the available images as training samples. Moreover, we set the noise variance to σ 2 = −60 dB.
In contrast with the case of the Hopkins 155 dataset, samples in the Extended Yale Face Database B are described via an approximately low-rank model which is characterized by a slower decay of the eigenvalues of the corresponding covariance matrices, as reported in Fig. 4. In this sense, experimental results for this dataset represent a way to test the predictions provided by our analysis also for a scenario which departs further from the assumption of signals lying on a union of low-dimensional subspaces. Figures 5 (a) and (b) report the misclassification probability vs the number of measurements for random kernels, kernels designed via the construction embodied in Proposition 5, and the designs in [14], [21]. Also in this case, motivated by the fact that the analysis is conducted for the scenario where the MAP classifier is provided with the true model parameters, our results consider both the scenario where a significant number of training samples (50%) is used to learn the underlying models and a scenario where a lower number of training samples (30%) is used to derive the models in order to assess the robustness of the theoretical insights agains model mismatch. We note that in this case, due to the slow eigenvalue decay reported in Fig. 4, the measurement design described in Section V-B does not provide state-of-the-art classification results, as classification based on measurements extracted via the methods in [14], [21] guarantee lower misclassification probabilities.
On the other hand, it is possible to observe that the theoretical results in Proposition 1 and Proposition 5 indeed capture the actual behavior of classification with state-ofthe-art measurement design. In fact, the upper bounds (25) (35) applied to the face recognition scenario under exam predicts that M = r Σ + 1 = 10 random measurements or M = L − 1 = 4 designed measurements are required for reliable classification. Then, based on numerical simulations of classification with non-compressive measurements, we set the baseline misclassification probability for reliable classification at 25%. We observe that the predictions offered by Proposition 1 and Proposition 5 are in line with the trends shown in Table IV, which reports the minimum number of measurements required by random and designed kernels to achieve a misclassification probability below 25% for both cases when 50% and 30% of the vectors in the dataset are used as training samples.

VII. DISCUSSION: IMPACT OF MODEL MISMATCH
It is also instructive to discuss the impact of model mismatch on the classification performance of the MAP classifier  (2) in practical application scenarios. In fact, the analysis carried out in the previous sections assumed that the MAP classifier is given the true model parameters. On the other hand, in practical applications, the conditional pdfs p(y|C = i) and the prior probabilities p i are usually learnt from training data, thus implying the introduction of mismatch between the model adopted by the classifier and the actual statistical description of test data.
A proper derivation of the number of measurements required for reliable classification in practical application scenarios would therefore require a more in-depth analysis that takes into account the model mismatch induced by the learning process. In particular, it would require: i) expressions that articulate about the behaviour of the misclassification probability as a function of the true underlying model and the learnt model; ii) a further analysis that determines how compressive random or designed measurements influence the phase transition associated with the misclassification probability.
A preliminary analysis of the impact of model mismatch in classification problems has been conducted in [54], [55]. These works consider the classification of signals drawn from Gaussian distributions with mismatched classifiers. In particular, they provide sufficient conditions on the relationship between the true model parameters and the learnt model parameters that guarantee reliable classification in the low-noise regime. However, the results in [54], [55] are derived for a non-compressive classification scenario, therefore they cannot explain how compressive random or designed measurements influence the misclassification probability.
A generalization of our analysis on the minimum number of measurements sufficient for reliable classification to capture the impact of model mismatch does not seem immediate. However, our simulation results associated with real-data subspace classification problems in Sections VI-B and VI-C suggest that our theory can still provide meaningful insights both in the situation where we use a significant number of training samples (as expected because we can learn an accurate data model) and in the situation where we use a lower number of training samples. This is despite the fact that the learning process produces distributions that do not correspond exactly to the true ones and also the modelling process assumes a Gaussian distribution that does not necessarily correspond to the true ones pertaining to the motion segmentation or face classification examples.
We conjecture that the reasons for this phenomenon are related to the fact that reliable classification is achieved when compressive measurements are able to discriminate among linear subspaces spanned by signals in the different classes, irrespectively to the particular shape of the marginal distributions that are supported on such subspaces.
In this sense, motion segmentation is more immune to model mismatch than face recognition because, as it is implied by the quick decay of the eigenvalues of the covariance matrices, the majority of the energy of the samples in the motion segmentation dataset is concentrated in linear subspaces of dimension 2 or 3. Then, even a reduced number of training samples is sufficient to identify the dominating principal components for each class. On the other hand, when considering face recognition, the energy associated to samples drawn from a given class is only approximately concentrated on a lowdimensional subspace. In this case, training sets with increased cardinality can guarantee a refined estimation of the principal components associated to each class.

VIII. CONCLUSIONS
In this paper we have offered a characterization of the number of measurements required to reliably classify linear subspaces modeled via low-rank, zero-mean Gaussian distributions. In particular, we have provided upper bounds to the number of measurements required to drive the misclassification probability to zero both for random measurements as well as designed measurements for two-class classification problems and more challenging multi-class problems. Our characterization suggests that the minimum number of measurements required for phase transition may be achieved by either a one-vs-all approach, or by randomly spreading measurements over the Grassmann manifold, depending on the relationship between the number of classes and the dimension of the spaces spanned by signals in each class.
One of the hallmarks of our characterizations relates to its ability to predict the minimum number of measurements required to achieve a low-misclassification probability in stateof-the-art measurement design methods. Therefore, it offers engineers a concrete tool to gauge the number of measurements for reliable classification, thereby bypassing the need for time-consuming simulations.
Then, we recall the expression of the upper bound to the misclassification probabilitȳ and we can re-express K ij as thus leading tō Then, on letting σ 2 → 0, we note that the term in square brackets converges to the positive constant , moreover, the decay ofP e as a function of σ 2 is dominated by the terms in the sum corresponding to the minimum value of the exponent d(i, j) = (2r ij −r i −r j )/4, thus leading to the result in (18)- (20).

APPENDIX B PROOF OF PROPOSITION 4
The derivation of the upper bound on the number of measurements needed to verify (31) is based on the analysis of the upper boundP e in (16).
Recall that the low-noise expansion exponent d of the upper bound to the misclassification probability for the classification problem of two, zero-mean classes is given by We first show that, for all possible choices of Φ, it holds d ≤ R/4 so that there is not any M such that for d 0 ≥ R/4. Then, we consider the case d 0 < R/4 and we derive the minimum number of measurements M needed to verify (46), which represents an upper bound on the minimum number of measurements needed to verify (31).
Note that we have and likewise, and On the other hand, the ranks of the input covariance matrices can be expressed as and Let us now define the cardinalities of the following sets: Then, it becomes evident that, r Σ12 −r Σ = k 1 +k 2 −k c −k 1 = k 2 − k c = k 1 − k c , and, in view of the possible dependence between columns ofΦ, r 12 − r 1 ≤ k 2 − k c and r 12 − r 2 ≤ k 1 − k c , thus concluding the proof of (47).
B. Case where d 0 < R/4 We start by describing an explicit measurement matrix construction that achieves an expansion exponent of the upper bound to the misclassification probability strictly greater than d 0 with M = 4d 0 + 1 measurements. After that, we prove that M ≤ 4d 0 implies d ≤ d 0 for all possible choices of Φ.
Assume by contradiction that the vectors q i are linearly dependent. Then, there exists a set of n Σ scalars α i (with α i = 0 for at least one index i) such that Σ It is known that i α i w i = 0 because w i are linearly independent by construction. Therefore, the linearly dependence among the vectors q i implies that i α i w i ∈ N 1 , which is false since, by construction, i α i w i ∈ N 2 and i α i w i / ∈ N 12 . Therefore, we can establish that r 1 = rank Φ 0 Σ 1 Φ T 0 = rank (Q) = n Σ , and, we can similarly establish that Finally, we generate Φ by picking arbitrarily only M = 4d 0 + 1 among the R row vectors of the matrix Φ 0 in (64). In particular, we take M 1 rows from the set [v 1 , . . . , v n Σ ] and M 2 rows from the set [w 1 , . . . , w n Σ ], where M 1 + M 2 = 4d 0 + 1, which is always possible as 4d 0 + 1 ≤ R. Then, by following steps similar to the previous ones, it is possible to show that r 1 = rank 2) Converse: Assume now M ≤ 4d 0 . In this case, we can show that, for all possible choices of Φ, it holds This upper bound follows from the solution to the following integer-valued optimization problem 12 : subject to: r 1 + r 2 ≥ r 12 , r 1 ≤ M , r 2 ≤ M , r 12 ≤ M and r 1 , r 2 , r 12 ∈ Z + 0 . The solution, which can be obtained by considering a linear programming relaxation along with a Branch and Bound approach [56], is given by 13 : APPENDIX C PROOF OF PROPOSITION 5 Let N i ∈ R N ×(N −r Σ ) be a matrix that contains a basis for the null space N i and let N = [N 1 , . . . , N L ] be a matrix that contains the concatenation of the bases for all the null 12 Note that this problem represents a relaxation of the problem which aims at maximizing d, as it incorporates only some of the constraints dictated by the geometrical description of the scenario. For example, it does not take into account the actual value of some parameters of the input description as r Σ and r Σ 12 . 13 The solution of the optimization problem is not unique. Nevertheless, the maximum value achieved by the objective function is indeed unique.
spaces N 1 , . . . , N L . Then, consider the measurement matrices Φ ∈ R M ×N that consist of M rows of N T . More precisely, such matrices Φ are obtained by picking M i rows from N T i , so that L i=1 M i = M . 14 A sufficient condition for (34) is represented by d > 0, where d is the decay exponent associated to the misclassification probability upper bound (16). Moreover, d > 0 if and only if d(i, j) > 0 for all the pairs (i, j) with i = j.
We can now express the conditions d(i, j) > 0 in terms of the values M i as follows. On recalling Sylvester's rank theorem [57], which states rank (AB) = rank (B) − dim(Im(B) ∩ Null(A)), (71) we can write each term d(i, j) as We first show that Notice that, since the images R i are independently drawn from a continuous pdf over the Grassmann manifold, any min{N, L(N − r Σ )} columns of N are linearly independent with probability 1. Then, by leveraging the expression of the dimension of the intersection of two linear spaces, we can write whereΦ T is obtained from Φ T by deleting the M i columns corresponding to vectors taken from the basis of the null space N i . Then, given that the columns ofΦ T are picked from spaces drawn at random from the Grassmann manifold, we can conclude that and on replacing (77) into (75) we immediately obtain (74). Consider now the last term in (73) and recall that, since the linear spaces N i are drawn independently at random from a continuous pdf, then dim(N ij ) = dim(N i ∩ N j ) = max{N − 2r Σ , 0}, (78) thus implying immediately that dim(Im(Φ T ) ∩ N ij ) = 0 if N ≤ 2r Σ . Therefore, we assume N > 2r Σ and we show that dim(Im(Φ T )∩N ij ) = max{M −2r Σ , M i −r Σ , M j −r Σ , 0}.
(79) In order to do that, we first note that we can leverage the expression of the dimension of the intersection of two linear 14 Throughout the proof, we assume M ≤ N , since the decay exponent d associated to any matrix Φ is always smaller than or equal to the decay exponent associated to the identity matrix I N , as it was shown in Appendix B-A. where the columns of N ij form a basis of the linear space N ij . Let us also write Φ as where the M i columns of Φ T i are vectors picked from a basis of N i , the M j columns of Φ T j are vectors picked from a basis of N j and the M − M i − M j columns ofΦ T are vectors picked from the bases of the remaining null spaces. Then, on leveraging again the assumption that the linear spaces associated to the different classes are picked independently at random from a continuous distribution, we can write (82) On the other hand, on introducing the notation r Φij Nij = rank In fact, dim(Im[Φ T i N ij ]) = min{N − r Σ , M i + N − 2r Σ } derives from the fact that the columns of Φ T i and N ij are all picked at random from the space N i -which has dimension N − r Σ . Moreover, we have used the fact which follows from Then, on using the symbol r ΦNij = rank[Φ T N ij ], we have and, therefore, Finally, it is possible to show that (85) is equivalent to (79) by considering separately the cases for which M i r Σ and M j r Σ . Therefore, by using (73) In the remainder of this appendix, we will show that the solution of such minimization problem is given by M = min{L − 1, r Σ + 1}, by considering separately two cases. In particular, when L − 1 ≤ r Σ , we can show that the optimal solution is given by M = L i=1 M i = L − 1. We first observe that such value represents a feasible solution: in fact, by picking only 1 measurement from L − 1 out of L null spaces, e.g., by choosing M 1 = · · · = M L−1 = 1 and M L = 0, we can immediately prove that all the constraints are verified. Then, we also observe that any solutions for which M < L − 1 is not feasible: in fact, if M < L − 1 there exist at least two indexes k and such that M k = M = 0, and therefore at least one of the constraints (89) is not verified.
Consider now the case L − 1 > r Σ . In this case the optimal solution of the minimization problem yields M = r Σ + 1. In a similar way to the previous case, we start by observing that M = r Σ + 1 is a feasible solution, which can be achieved by picking 1 measurement from r Σ + 1 different null space, e.g., by picking M 1 = · · · = M r Σ +1 = 1 and M r Σ +2 = · · · = M L = 0. Also in this case it is straightforward to prove that all the constraints are verified. Moreover, it is possible to observe that there is not any feasible solution such that M < r Σ + 1, as r Σ < L − 1 implies that there exist at least two indexes k and such that M k = M = 0, and, therefore, at least one of the constraints (89) is not verified.