An introduction to Compressive Sensing.

Short, fat matricesa are useful, actually.

Introduction \label{sec:csinto}

This post discusses Compressive Sensing \gls{cs}: an alternative signal acquisition method to Nyquist sampling, which is capable of accurate sensing at rates well below those predicted by the Nyquist theorem. This strategy hinges on using the structure of a signal, and the fact that many signals we are interested in can be compressed successfully. Thus, Compressive Sensing acquires the most informative parts of a signal directly.

The first section surveys the mathematical foundation of CS. It covers the Restricted Isometry Property and Stable Embeddings - necessary and sufficient conditions on which a signal can be successfully acquired. Informally, these conditions suggest that the sensing operator preserves pairwise distances between points when projected to a (relatively) low dimensional space from a high dimensional space. We then discuss which operators satisfy this condition, and why. In particular, random ensembles such as the Bernoulli/Gaussian ensembles satisfy this property - we discuss how this can be applied to the problem of wideband spectrum sensing. We also survey a small amount of the theory of Wishart matrices.

The section concludes with an overview of reconstruction algorithms for CS - methods for unpacking the original signal from its compressed representation. We give insight into the minimum number of samples for reconstruction. We survey convex, greedy, and Bayesian approaches; as well as algorithms which are blends of all three. The choice of algorithm is affected by the amount of undersampling required for system performance, the complexity of the algorithm itself, and the desired reconstruction accuracy. In general: the more complex the algorithm, the better the reconstruction accuracy. The allowable undersampling depends upon the signal itself, and the prior knowledge available to the algorithm. Greedy algorithms are the simplest class, making locally optimal updates within a predefined dictionary of signal atoms. Greedy algorithms have relatively poor performance, yet are the fastest algorithms. Convex algorithms are the next most complex, based upon minimising a global functional of the signal. This class is based upon generalised gradient descent, and has no fixed number of steps. Finally, Bayesian algorithms are the slowest and most complex, but offer the best performance - both in terms of undersampling (as these algorithms incorporate prior knowledge in an elegant way), and in terms of reconstruction accuracy.

We survey some distributed approaches to compressive sensing, in particular some models of joint sparsity, and joint sparsity with innovations.

Finally, we survey some of the approaches to wideband spectrum sensing based upon compressive sensing. In particular we survey the Random Demodulator and the Modulated Wideband converter. Both of these systems make use of low-frequency chipping sequences (also used in spread spectrum communication systems). These low-frequency sequences provide the basis for CS - several different sequences each convoluted with the signal are sufficient to accurately sense the signal.

Preliminaries \label{sec:prelims}

Compressive sensing is a modern signal acquisition technique in which randomness is used as an effective sampling strategy. This is in contrast to traditional, or Nyquist, sampling, which requires that a signal is sampled at regular intervals. The motivation for this new method comes from two disparate sources: data compression and sampling hardware design.

The work of Shannon, Nyquist, and Whittaker \cite{unser2000sampling,} has been an extraordinary success - digital signal processing enables the creation of sensing systems which are cheaper, more flexible, and which offer superior performance to their analogue counterparts. For example, radio dongles, such as those which support the RTLSDR standard, which can process millions of samples per second, can now be bought for as little as £12. However, the sampling rates underpinning these advances have a doubling time of roughly 6 years - this is due to physical limitations in Analogue to Digital conversion hardware. Specifically, these devices will always be limited in bandwidth and dynamic range (number of bits), whilst applications are creating a deluge of data to be processed downstream.

Data compression means that in practice many signals encountered ‘in the wild’ can be fully specified by much fewer bits than required by the Nyquist sampling theorem. This is either a natural property of the signals, for example, images have large areas of similar pixels, or as a conscious design choice, as with training sequences in communication transmissions. These signals are not statistically white, and so these signals may be compressed (to save on storage). For example, lossy image compression algorithms can reduce the size of a stored image to about 1% of the size required by Nyquist sampling. In fact, the JPEG standard uses Wavelets to exploit the inter-pixel redundancy of images.

Whilst this vein of research has been extraordinarily successful, it poses the question: if the reconstruction algorithm is able to reconstruct the signal from this compressed representation, why collect all the data in the first place when most of the information can be thrown away?

Compressed Sensing answers these questions by way of providing an alternative signal acquisition method to the Nyquist theorem. Specifically, situations are considered where fewer samples are collected than traditional sensing schemes. That is, in contrast to Nyquist sampling, Compressive Sensing is a method of measuring the informative parts of a signal directly without acquiring unessential information at the same time.

These ideas have not come out of the ether recently however. Prony, in 1795, \cite{prony1795essai}, proposed a method for estimating the parameters of a number of exponentials corrupted by noise. This work was extended by Caratheodory in 1907 \cite{Caratheodory1907}, who proposed in the 1900s a method for estimating a linear combination of kk sinusoids for the state at time 00 and any other 2k2k points. In the 1970s, Geophysicists proposed minimising the 1\ell_1-norm to reconstruct the structure of the earth from seismic measurements. Clarebuot and Muir proposed in 1973, \cite{claerbout1973robust}, using the 1\ell_1-norm as an alternative to Least squares. Whilst Taylor, Banks, and McCoy showed in \cite{taylor1979deconvolution} how to use the 1\ell_1-norm to deconvolve spike trains (used for reconstructing layers in the earth). Santosa and Symes in \cite{Santosa1986} introduced the constrained 1\ell_1-program to perform the inversion of band-limited reflection seismograms. The innovation of \gls{cs} is to tell us under which circumstances these problems are tractable.

The key insight in CS is that for signals which are sparse or compressible - signals which are non-zero at only a fraction of the indices over which they are supported, or signals which can be described by relatively fewer bits than the representation they are traditionally captured in - may be measured in a non-adaptive way through a measurement system which is orthogonal to the signal’s domain.

Examples of sparse signals are:

  1. A sine wave at frequency ω\omega is defined as a single spike in the frequency domain yet has an infinite support in the time domain.
  2. An image will have values for every pixel, yet the wavelet decomposition of the image will typically only have a few non-zero coefficients.

Informally, CS posits that for ss-sparse signals αRn\alpha \in \mathbb{R}^{n} (signals with ss non-zero amplitudes at unknown locations) - O(slogn)\mathcal{O}(s \log{n}) measurements are sufficient to exactly reconstruct the signal.

In practice, this can be far fewer samples than conventional sampling schemes. For example, a megapixel image requires 1,000,000 Nyquist samples but can be perfectly recovered from 96,000 compressive samples in the wavelet domain \cite{candes2008introduction}.

The measurements are acquired linearly, by forming inner products between the signal and some arbitrary sensing vector:

yi=α,ψiy_i = \langle \alpha, \psi_i \rangle \label{inner-product-repr}

or

y=Ψαy = \Psi \alpha \label{vector-repr}

where yiy_i is the ithi^{th} measurement, αRn\alpha \in \mathbb{R}^n is the signal, and ψi\psi_i is the ithi^{th} sensing vector. We pass to (???) from (???), by concatenating all the yiy_i into a single vector. Thus the matrix Ψ\Psi has the vectors ψi\psi_i as columns.

If α\alpha is not ss-sparse in the natural basis of yy, then we can always transform α\alpha to make it sparse in some other basis:

x=Φαx = \Phi \alpha \label{vector-repr-2}

Note that the measurements may be corrupted by noise, in which case our model is:

y=Ax+ey = Ax + e \label{CSequation}

where eRme \in \mathbb{R}^m, and each component is sampled from a N(0,1/n)\mathcal{N}(0, 1/n) distribution. Here we have A=ΨΦαRn×mA = \Psi \Phi \alpha \in \mathbb{R}^{n \times m}, i.e., a basis in which the signal xRnx \in \mathbb{R}^n will be sparse.

We require that sensing vectors satisfy two technical conditions (described in detail below): an Isotropy property, which means that components of the sensing vectors have unit variance and are uncorrelated, and an Incoherence property, which means that sensing vectors are almost orthogonal. Once the set of measurements have been taken, the signal may be reconstructed from a simple linear program. We describe these conditions in detail in the next section.

RIP and Stable Embeddings

We begin with a formal definition of sparsity:

\begin{definition}[Sparsity]

A high-dimensional signal is said to be ss-sparse, if at most ss coefficients xix_i in the linear expansion

α=i=1nϕixi\alpha = \sum_{i=1}^{n} \phi_i x_i \label{sparse-basis-expansion}

are non-zero, where xRx \in \mathbb{R}, αR\alpha \in \mathbb{R}, and ϕi\phi_i are a set of basis functions of Rn\mathbb{R}^n.

We can write (???) as:

α=Φx\alpha = \Phi x \label{def:alpha}

We can make the notion of sparsity precise by defining Σs\Sigma_s as the set of ss-sparse signals in Rn\mathbb{R}^n:

Σs={xRn:supp(x)s}\Sigma_s = \{ x \in \mathbb{R}^n : |\mathrm{supp}(x)| \leq s \}

where supp(x)\mathrm{supp}(x) is the set of indices on which xx is non-zero.

\end{definition}

\begin{figure}[h] \centering \includegraphics[height=7cm, width=\textwidth]{compressive_sensing_example.jpg} \caption{A visualisation of the Compressive Sensing problem as an under-determined system. Image from \cite{cstutwatkin}} \label{l1l2} \end{figure}

We may not be able to directly obtain these coefficients xx, as we may not possess an appropriate measuring device, or one may not exist, or there is considerable uncertainty about where the non-zero coefficients are.

Given a signal αRn\alpha \in \mathbb{R}^n, a matrix ARm×nA \in \mathbb{R}^{m \times n}, with mnm \ll n, we can acquire the signal via the set of linear measurements:

y=Ψα=ΨΦx=Axy = \Psi \alpha = \Psi\Phi x = Ax \label{cs-model}

where we have combined (???) and (???). In this case, AA represents the sampling system (i.e., each column of AA is the product of Φ\Phi with the columns of Ψ\Psi). We can work with the abstract model (???), bearing in mind that xx may be the coefficient sequence of the object in the proper basis.

In contrast to classical sensing, which requires that m=nm = n for there to be no loss of information, it is possible to reconstruct xx from an under-determined set of measurements as long as xx is sparse in some basis.

There are two conditions the matrix AA needs to satisfy for recovery below Nyquist rates:

  1. Restricted Isometry Property.
  2. Incoherence between sensing and signal bases.

\begin{definition}[RIP]\label{def:RIP} We say that a matrix AA satisfies the RIP of order ss if there exists a δ(0,1)\delta \in (0, 1) such that for all xΣsx \in \Sigma_s:

(1δ)x22Ax22(1+δ)x22(1 - \delta) \|x\|_2^2 \leq \|Ax\|_2^2 \leq (1 + \delta) \|x\|_2^2

i.e., AA approximately preserves the lengths of all ss-sparse vectors in Rn\mathbb{R}^n.

\label{def:RIP} \end{definition}

\begin{remark} Although the matrix AA is not square, the RIP (???) ensures that ATAA^TA is close to the identity, and so AA behaves approximately as if it were orthogonal. This is formalised in the following lemma from \cite{shalev2014understanding}:

\begin{lemma}[Identity Closeness \cite{shalev2014understanding}] Let AA be a matrix that satisfies the RIP of order 2s2s with RIP constant δ\delta. Then for two disjoint subsets I,J[n]I, J \subset [n], each of size at most ss, and for any vector uRnu \in \mathbb{R}^n:

AuI,AuJδuI2uJ2\langle Au_I, Au_J \rangle \leq \delta \|u_I\|_2 \|u_J\|_2

where uIu_I is the vector with component uiu_i if iIi \in I and zero elsewhere.

\end{lemma}

\end{remark}

\begin{remark} \label{rem:rip-delta-comment} The restricted isometry property is equivalent to stating that all eigenvalues of the matrix ATAA^TA are in the interval [1δ,1+δ][1 - \delta, 1 + \delta]. Thus, the meaning of the constant δ\delta in (???) is now apparent. δ\delta is called the \textit{restricted isometry constant} in the literature.

The constant δ\delta in (???) measures how close to an isometry the action of the matrix AA is on vectors with a few non-zero entries (as measured in 2\ell_2 norm). For random matrices AA where the components are drawn from a N(0,1/n)\mathcal{N}(0, 1/n) distribution, δ<21\delta < \sqrt{2} - 1 \cite{candes2008restricted}. \end{remark}

\begin{remark} [Information Preservation \cite{davenport2010signal}] A necessary condition to recover all ss-sparse vectors from the measurements AxAx is that Ax1Ax2Ax_1 \neq Ax_2 for any pair x1x2x_1 \neq x_2, x1,x2Σsx_1, x_2 \in \Sigma_s, which is equivalent to:

A(x1x2)22>0\|A(x_1 - x_2)\|_2^2 > 0

This is guaranteed as long as AA satisfies the RIP of order 2s2s with constant δ\delta. The vector x1x2x_1 - x_2 will have at most 2s2s non-zero entries, and so will be distinguishable after multiplication with AA. To complete the argument, take x=x1x2x = x_1 - x_2 in definition ???, guaranteeing:

A(x1x2)22>0\|A(x_1 - x_2)\|_2^2 > 0

and requiring the RIP order of AA to be 2s2s. \end{remark}

\begin{remark} [Stability \cite{davenport2010signal}] We also require that the dimensionality reduction of compressed sensing preserves relative distances: that is, if x1x_1 and x2x_2 are far apart in Rn\mathbb{R}^n, then their projections Ax1Ax_1 and Ax2Ax_2 are far apart in Rm\mathbb{R}^m. This will guarantee that the dimensionality reduction is robust to noise. \end{remark}

A requirement on the matrix AA that satisfies both of these conditions is the following:

\begin{definition}[δ\delta-stable embedding \cite{davenport2010signal}] We say that a mapping is a δ\delta-stable embedding of U,VRnU,V \subset \mathbb{R}^n if

(1δ)uv22AuAv22(1+δ)uv22(1 - \delta) \|u - v\|_2^2 \leq \|Au - Av\|_2^2 \leq (1 + \delta) \|u - v\|_2^2

for all uUu \in U and vVv \in V.

\label{def:d-stable} \end{definition}

\begin{remark}[\cite{davenport2010signal}] Note that a matrix AA, satisfying the RIP of order 2s2s, is a δ\delta-stable embedding of Σs,Σs\Sigma_s, \Sigma_s. \end{remark}

\begin{remark}[\cite{davenport2010signal}] Definition \ref{def:d-stable} has a simple interpretation: the matrix AA must approximately preserve Euclidean distances between all points in the signal model Σs\Sigma_s. \end{remark}

Incoherence

Given that we know a basis in which our signal is sparse, ϕ\phi, how do we choose ψ\psi so that we can accomplish this sensing task? In classical sensing, we choose ψk\psi_k to be the set of TsT_s-spaced delta functions (or equivalently the set of 1/Ts1/T_s spaced delta functions in the frequency domain). A simple set of ψk\psi_k would be to choose a (random) subset of the delta functions above.

In general, we seek waveforms in which the signals’ representation would be dense.

\begin{definition}[Incoherence] A pair of bases is said to be incoherent if the largest projection of two elements between the sensing (ψ\psi) and representation (ϕ\phi) basis is in the set [1,n][1, \sqrt{n}], where nn is the dimension of the signal. The coherence of a set of bases is denoted by μ\mu. \end{definition}

Examples of pairs of incoherent bases are:

This implies that sensing with incoherent systems is good (in the sine wave example above, it would be better to sample randomly in the time domain as opposed to the frequency domain), and efficient mechanisms ought to acquire correlations with random waveforms (e.g., white noise).

\begin{theorem}[Reconstruction from Compressive measurements \cite{Candes2006}] Fix a signal fRnf \in \mathbb{R}^n with a sparse coefficient basis, xix_i in ϕ\phi. Then a reconstruction from mm random measurements in ψ\psi is possible with probability 1δ1 - \delta if:

mCμ2(ϕ,ψ)Slog(nδ)m \geq C \mu^2(\phi, \psi) S \log \left( \frac{n}{\delta} \right) \label{minsamples}

where μ(ϕ,ψ)\mu(\phi, \psi) is the coherence of the two bases, and SS is the number of non-zero entries on the support of the signal. \end{theorem}

Random Matrix Constructions \label{sec:mtx-contruction}

To construct matrices satisfying definition (???), given m,nm, n we generate AA by AijA_{ij} being i.i.d random variables from distributions with the following conditions \cite{davenport2010signal}:

\begin{condition}[Norm preservation] EAij2=1m\mathbb{E} A_{ij}^2 = \frac{1}{m} \label{cond:norm-pres} \end{condition}

\begin{condition}[sub-Gaussian] There exists a C>0C > 0 such that: E(eAijt)eC2t2/2\mathbb{E}\left( e^{A_{ij}t} \right) \leq e^{C^2 t^2 /2} \label{cond:sub-Gauss} for all tRt \in \mathbb{R}. \end{condition}

\begin{remark} The term E(eAijt)\mathbb{E}\left( e^{A_{ij}t} \right) in (???) is the moment generating function of the sensing matrix. Condition (???) says that the moment-generating function of the distribution producing the sensing matrix is dominated by that of a Gaussian distribution, which is also equivalent to requiring that the tails of our distribution decay at least as fast as the tails of a Gaussian distribution. Examples of sub-Gaussian distributions include the Gaussian distribution, the Rademacher distribution, and the uniform distribution. In general, any distribution with bounded support is sub-Gaussian. The constant CC measures the rate of fall off of the tails of the sub-Gaussian distribution. \end{remark}

Random variables AijA_{ij} satisfying conditions (???) and (???) satisfy the following concentration inequality \cite{baraniuk2008simple}:

\begin{lemma}[sub-Gaussian \cite{baraniuk2008simple}] P(Ax22x22εx22)2ecMε2\mathbb{P}\Big( \biggl\lvert \|Ax\|_2^2 - \|x\|_2^2 \biggr\rvert \geq \varepsilon \|x\|_2^2 \Big) \leq 2e^{-cM\varepsilon^2} \label{cond:sub-Gauss concetration} \end{lemma}

\begin{remark} Lemma \ref{cond:sub-Gauss concetration} says that sub-Gaussian random variables are random variables such that for any xRnx \in \mathbb{R}^n, the random variable Ax22\|Ax\|_2^2 is highly concentrated about x22\|x\|_2^2. \end{remark}

Then in \cite{baraniuk2008simple} the following theorem is proved:

\begin{theorem} Suppose that mm, nn and 0<δ<10 < \delta < 1 are given. If the probability distribution generating AA satisfies condition (???), then there exist constants c1,c2c_1, c_2 depending only on δ\delta such that the RIP (???) holds for AA with the prescribed δ\delta and any sc1nlogn/ss \leq \frac{c_1 n}{\log{n/s}} with probability 12ec2n\geq 1-2e^{-c_2n}. \end{theorem}

For example, if we take AijN(0,1/m)A_{ij} \sim \mathcal{N}(0, 1/m), then the matrix AA will satisfy the RIP, with probability \cite{baraniuk2008simple}:

12(12δ)kec0δ2n\geq 1 - 2\left(\frac{12}{\delta}\right)^k e^{-c_0\frac{\delta}{2}n}

where δ\delta is the RIP constant, c0c_0 is an arbitrary constant, and kk is the sparsity of the signal being sensed.

Wishart Matrices \label{sec:wishart}

Let {Xi}i=1r\{X_i\}_{i=1}^r be a set of i.i.d. 1×p1 \times p random vectors drawn from the multivariate normal distribution with mean 0 and covariance matrix HH.

Xi=(x1(i),,xp(i))N(0,H)X_i = \left(x_1^{(i)}, \ldots, x_p^{(i)}\right) \sim N\left(0, H\right)

We form the matrix XX by concatenating the rr random vectors into an r×pr \times p matrix.

\begin{definition}[Wishart Matrix] Let

W=j=1rXjXjT=XXTW = \sum_{j=1}^r X_j X_j^T = X X^T

Then WRr×rW \in \mathbb{R}^{r \times r} has the Wishart distribution with parameters:

Wr(H,p)W_r(H, p)

where pp is the number of degrees of freedom. \end{definition}

\begin{remark} This distribution is a generalization of the Chi-squared distribution if p=1p = 1 and H=IH = I. \end{remark}

\begin{theorem}[Expected Value] \label{thm:wishart-mean} E(W)=rH\mathbb{E}(W) = rH \end{theorem}

\begin{proof} \begin{align} \mathbb{E}(W) &= \mathbb{E}\left(\sum_{j=1}^r X_j X_j^T\right)
&= \sum_{j=1}^r \mathbb{E}(X_j X_j^T)
&= \sum_{j=1}^r \left( \mathrm{Var}(X_j) + \mathbb{E}(X_j) \mathbb{E}(X_j^T) \right)
&= rH \end{align
} Where the last line follows as XjX_j is drawn from a distribution with zero mean. \end{proof}

\begin{remark} The matrix M=ATAM = A^T A, where AA is constructed by the methods from section ???, will have a Wishart distribution. In particular, it will have:

E(M)=1mIn\mathbb{E}(M) = \frac{1}{m} I_n \label{remark: exp AtA} \end{remark}

The joint distribution of the eigenvalues is given by \cite{levequeMatrices}:

p(λ1,,λr)=cri=1reλii<j(λiλj)2p\left(\lambda_1, \ldots, \lambda_r\right) = c_r \prod_{i=1}^r e^{-\lambda_i} \prod_{i<j} \left(\lambda_i - \lambda_j\right)^2

The eigenvectors are uniform on the unit sphere in Rr\mathbb{R}^r.

Reconstruction Objectives

Compressive sensing places the computational load on reconstructing the coefficient sequence xx, from the set of compressive samples yy. This is in contrast to Nyquist sampling, where the bottleneck is in obtaining the samples themselves—reconstructing the signal is a relatively simple task.

Many recovery algorithms have been proposed, and all are based upon minimizing some functional of the data. This objective is based upon two terms: a data fidelity term, minimizing the discrepancy between the reconstructed and true signal, and a regularization term—biasing the reconstruction towards a class of solutions with desirable properties, for example sparsity. Typically the squared error 12yAx22\frac{1}{2} \|y - Ax\|_2^2 is chosen as the data fidelity term, while several regularization terms have been introduced in the literature.

A particularly important functional is:

argminxx1 s.t. y=Ax\arg \min_x \|x\|_1 \text{ s.t. } y = Ax \label{program:bp}

This is known as Basis Pursuit \cite{Chen1998a}, with the following program known as the LASSO \cite{tibshirani1996regression} as a noisy generalization:

argminx12Axy22+λx1\arg \min_x \frac{1}{2} \|Ax - y\|_2^2 + \lambda \|x\|_1 \label{program:lasso}

The statistical properties of LASSO have been well studied. The program performs both regularization and variable selection: the parameter λ\lambda trades off data fidelity and sparsity, with higher values of λ\lambda leading to sparser solutions.

The LASSO shares several features with Ridge regression \cite{hoerl1970ridge}:

argminx12Axy22+λx22\arg \min_x \frac{1}{2} \|Ax - y\|_2^2 + \lambda \|x\|_2^2 \label{program:Ridge-regression}

and the Non-negative Garrote \cite{breiman1995better}, used for best subset regression:

argminx12Axy22+λx0\arg \min_x \frac{1}{2} \|Ax - y\|_2^2 + \lambda \|x\|_0 \label{program:ell0}

The solutions to these programs can all be related to each other—it can be shown \cite{hastie2005elements} that the solution to (???) can be written as:

x^=Sλ(xOLS)=xOLS sign(xiλ)\hat{x} = S_{\lambda}(x^{OLS}) = x^{OLS} \text{ sign}(x_i - \lambda) \label{soln:lasso}

where xOLS=(ATA)1ATyx^{OLS} = (A^T A)^{-1} A^T y is the ordinary least squares solution, whereas the solution to Ridge regression can be written as:

x^=(1+λ)1xOLS\hat{x} = \left( 1 + \lambda \right)^{-1} x^{OLS} \label{soln:ridge}

and the solution to the best subset regression (???) where $$|x|_0 = { i : x_i \neq 0 }$$, can be written as:

x^=Hλ(xOLS)=xOLSI(xOLS>λ)\hat{x} = H_{\lambda}(x^{OLS}) = x^{OLS} \mathbb{I}\left( |x^{OLS}| > \lambda \right) \label{soln:l0}

where I\mathbb{I} is the indicator function. From (???) and (???), we can see that the solution to (???), (???), translates coefficients towards zero by a constant factor, and sets coefficients to zero if they are too small; thus, the LASSO is able to perform both model selection (choosing relevant covariates) and regularization (shrinking model coefficients).

Solutions to the Compressive Sensing optimization problem intersect the $$l_1$$ norm at points where all components (but one) of the vector are zero (i.e., it is sparsity promoting) \cite{Tibshirani1996}. \label{fig:l1l2}

Figure ??? provides a graphical demonstration of why the LASSO promotes sparse solutions. (???) can also be thought of as the best convex approximation of the 0\ell_0 problem (???), as the 1\ell_1-norm is the convex hull of the points defined by xp\|x\|_p for p<1p < 1 as p0p \rightarrow 0.

Other examples of regularizers are:

argminx12Axy22+λx22+μx1\arg \min_x \frac{1}{2} \|Ax - y\|_2^2 + \lambda \|x\|_2^2 + \mu \|x\|_1 \label{program:enat}

The estimate has the benefits of both Ridge and LASSO regression: feature selection from the LASSO, and regularization for numerical stability (useful in the under-determined case we consider here) from Ridge regression. The Elastic Net will outperform the LASSO \cite{zou2005regularization} when there is a high degree of collinearity between coefficients of the true solution.

argminx12Axy22+λx1\arg \min_x \frac{1}{2} \|Ax - y\|_2^2 + \lambda \|\nabla x\|_1 \label{program:tc}

This type of regularization is used when preserving edges while simultaneously denoising a signal is required. It is used extensively in image processing, where signals exhibit large flat patches alongside large discontinuities between groups of pixels.

minxRnx1 s.t. AT(Axy)tσ\min_{x \in \mathbb{R}^n} \|x\|_1 \text{ s.t. } \|A^T(Ax - y)\|_{\infty} \leq t\sigma \label{program:danzig}

with t=c2lognt = c \sqrt{2 \log{n}}. Similarly to the LASSO, this functional selects sparse vectors consistent with the data, in the sense that the residual r=yAxr = y - Ax is smaller than the maximum amount of noise present. In \cite{candes2007dantzig} it was shown that the l2l_2 error of the solution is within a factor of logn\log{n} of the ideal l2l_2 error. More recent work by Bikel, Ritov, and Tsybakov \cite{bickel2009simultaneous} has shown that the LASSO enjoys similar properties.

Reconstruction Algorithms

Broadly, reconstruction algorithms fall into three classes: convex-optimisation/linear programming, greedy algorithms, and Bayesian inference. Convex optimisation methods offer better performance, measured in terms of reconstruction accuracy, at the cost of greater computational complexity. Greedy methods are relatively simpler, but don’t have the reconstruction guarantees of convex algorithms. Bayesian methods offer the best reconstruction guarantees, as well as uncertainty estimates about the quality of reconstruction, but come with considerable computational complexity.

Algorithm Type Accuracy Complexity Speed
Greedy Low Low Fast
Convex Medium Medium Medium
Bayesian High High Slow

Convex Algorithms

Convex methods cast the optimisation objective either as a linear program with linear constraints or as a second-order cone program with quadratic constraints. Both of these types of programs can be solved with first-order interior point methods. However, their practical application to compressive sensing problems is limited due to their polynomial dependence upon the signal dimension and the number of constraints.

Compressive sensing poses a few difficulties for convex optimisation-based methods. In particular, many of the unconstrained objectives are non-smooth, meaning methods based on descent down a smooth gradient are inapplicable.

To overcome these difficulties, a series of algorithms originally proposed for wavelet-based image de-noising have been applied to CS, known as iterative shrinkage methods. These have the desirable property that they boil down to matrix-vector multiplications and component-wise thresholding.

Iterative shrinkage algorithms replace searching for a minimal facet of a complex polytope with an iteratively denoised gradient descent. The choice of the (component-wise) denoiser is dependent upon the regulariser used in \eqrefprogram:lasso(???). These algorithms have an interpretation as Expectation-Maximisation \cite{figueiredo2003algorithm} — where the E-step is performed as gradient descent, and the M-step is the application of the denoiser.

Iterative Soft Thresholding Algorithm (IST)

```python \begin{algorithm} \begin{algorithmic}[1] \Procedure{IST}{y,A,μ,τ,εy,A, \mu, \tau, \varepsilon} \State x0=0x^0 = 0 \While{$|x^{t} - x^{t-1}|2^2 \leq \varepsilon} \State x^{t+1} \gets S{\mu\tau}\left(x^t + \tau A^T z^t \right) \State \State z^t \gets y - A x^t\EndWhile\Statereturn \EndWhile \State \textbf{return} x^{t+1}$ \EndProcedure \end{algorithmic} \caption{The Iterative Soft Thresholding Algorithm \cite{donoho1995noising}}\label{alg:IST} \end{algorithm}

Bayesian Algorithms

Bayesian methods reformulate the optimisation problem into an inference problem. These methods come with a unified theory and standard methods to produce solutions. The theory is able to handle hyper-parameters in an elegant way, provides a flexible modelling framework, and is able to provide desirable statistical quantities such as the uncertainty inherent in the prediction.

Previous sections have discussed how the weights xx may be found through optimisation methods such as basis pursuit or greedy algorithms. Here, an alternative Bayesian model is described.

Equation \eqrefCSequation(???) implies that we have a Gaussian likelihood model:

p(yz,σ2)=(2πσ2)K/2exp(12σ2yAx22)p \left(y \mid z, \sigma^2 \right) = (2 \pi \sigma^2)^{-K/2} \exp{\left(- \frac{1}{2 \sigma^2} \|y - Ax|_{2}^{2} \right)}

The above has converted the CS problem of inverting sparse weight xx into a linear regression problem with a constraint (prior) that xx is sparse.

To seek the full posterior distribution over xx and σ2\sigma^2, we can choose a sparsity-promoting prior. A popular sparseness prior is the Laplace density function:

p(xλ)=(λ2)Nexpλi=1Nxip\left(x \mid \lambda\right) = \left(\frac{\lambda}{2}\right)^N \exp{-\lambda \sum_{i=1}^{N} |x_i|}

Note that the solution to the convex optimisation problem \eqrefprogram:lasso(???) corresponds to a maximum a posteriori estimate for xx using this prior. I.e., this prior is equivalent to using the l1l_1 norm as an optimisation function (see figure ??? \cite{Tibshirani1996}).

\begin{figure}[h] \centering \includegraphics[height=7cm]{LaplaceandNormalDensity.png} \caption{The Laplace (l1l_1-norm, bold line) and Normal (l2l_2-norm, dotted line) densities. Note that the Laplace density is sparsity-promoting as it penalises solutions away from zero more than the Gaussian density. \cite{Tibshirani1996}} \label{laplacenormal} \end{figure}

The full posterior distribution on xx and σ2\sigma^2 may be realised by using a hierarchical prior instead. To do this, define a zero-mean Gaussian prior on each element of ee:

p(ea)=i=1NN(ni0,αi1)p\left(e \mid a\right) = \prod_{i=1}^{N} \mathbb{N}\left(n_i \mid 0, \alpha_{i}^{-1}\right)

where α\alpha is the precision of the distribution. A gamma prior is then imposed on α\alpha:

p(αa,b)=i=1NΓ(αia,b)p\left(\alpha \mid a, b \right) = \prod_{i=1}^{N} \Gamma\left( \alpha_i \mid a, b \right)

The overall prior is found by marginalising over the hyperparameters:

p(xa,b)=i=1N0N(wi0,αi1)Γ(αia,b)p\left( x \mid a, b \right) = \prod_{i=1}^{N} \int_{0}^{\infty} \mathbb{N}\left(w_i \mid 0, \alpha_{i}^{-1}\right) \Gamma\left( \alpha_i \mid a, b \right)

This integral can be done analytically and results in a Student-t distribution. Choosing the parameters aa and bb appropriately, we can make the Student-t distribution peak strongly around xi=0x_i = 0, i.e., sparsifying. This process can be repeated for the noise variance σ2\sigma^2. The hierarchical model for this process is shown in figure ???. This model, and other CS models which do not necessarily have closed-form solutions, can be solved via belief-propagation \cite{Baron2010}, or via Monte-Carlo methods.

\begin{figure}[h] \centering \includegraphics[height=7cm]{bayesiancs.png} \caption{The hierarchical model for the Bayesian CS formulation \cite{Ji2008}} \label{fig:bayesiancs} \end{figure}

However, as with all methodologies, Bayesian algorithms have their drawbacks. Most notable is the use of the most computationally complex recovery algorithms. In particular, MCMC methods suffer in high-dimensional settings, such as those considered in compressive sensing. There has been an active line of work to address this: most notably, Hamiltonian Monte Carlo (see \cite{neal2011mcmc}) — an MCMC sampling method designed to follow the typical set of the posterior density.

Belief propagation (BP) \cite{Yedidia2011} is a popular iterative algorithm, offering improved reconstruction quality and undersampling performance. However, it is a computationally complex algorithm. It is also difficult to implement. Approximate message passing (AMP) (figure ???) solves this issue by blending BP and iterative thresholding (figure ???). The algorithm proceeds like iterative thresholding but computes an adjusted residual at each stage. The final term in the update:

zt+1=yAxt+x0mztz^{t+1} = y - Ax^t + \frac{\|x\|_0}{m} z^t

comes from a first-order approximation to the messages passed by BP \cite{metzler2014denoising}. This is in contrast to the update from IST (figure ???):

zt+1=yAxtz^{t+1} = y - Ax^t

The choice of prior is key in Bayesian inference, as it encodes all knowledge about the problem. Penalising the least-squares estimate with the 1\ell_1 norm,

Compressive Estimation \label{sec:estimation}

In this section, we develop some intuition into constructing estimators for the signal ss directly from the compressive measurements:

\begin{theorem}

Given a set of measurements of the form:

y=As+ey = As + e
where A\rem×nA \in \re^{m \times n}, AijN(0,1/m)A_{ij} \sim \mathcal{N}\left(0,1/m\right), and e\rene \in \re^n is AWGN, i.e., N(0,σ2I)\sim N\left(0, \sigma^2 I\right). We again assume that ss comes from a fixed set of models, parametrised by some set Θ\Theta.

Then, the maximum likelihood estimator of ss, for the case where ss can be expanded in an orthonormal basis s=i=1nαiϕis = \sum_{i=1}^n \alpha_i\phi_i:

s^=i=1nmy,Aϕiϕi\hat{s} = \sum_{i=1}^n m \langle y, A\phi_i \rangle \phi_i

\end{theorem} \begin{proof} The likelihood for this model is (as yy is a normal random variable):

f(ys)=(1(2π)n/2)exp((yAs)T(yAs)2)f\left(y \mid s\right) = \left(\frac{1}{\left(2\pi\right)^{n/2}}\right) \exp{\left(- \frac{\left(y-As\right)^T \left(y-As\right)}{2} \right)}

Taking the logarithm and expanding, we find

lnf(ys)=yTysTATAs+2y,As+c\ln{f\left(y \mid s\right)} = -y^Ty - s^TA^TAs + 2\langle y, As \rangle + c

which is equal to:

lnf=y22As22+2y,As\labelloglike\ln{f} = - \|y\|_2^2 - \|As\|_2^2 + 2\langle y, As \rangle \label{log-like}

(where the constant has been dropped). The first term of \eqrefloglike(???) is constant, for the same reasons as in section \eqrefsec:estimation(???). The term

As22=As,As\|As\|_2^2 = \langle As, As \rangle

can be written as

ATAs,s\labelata\langle A^TAs, s\rangle \label{ata}

We will assume that ATAA^TA concentrates in the sense of ??? and replace ??? with its expectation \ep(ATAs,s)\ep{\left( \langle A^TAs, s\rangle \right)}

\begin{align} \ep{\left(\langle A^TAs, s\rangle\right)} &= \ep{\sum_{i=1}^n (A^TAs)^T_i s_i}
&= \sum_{i=1}^n \ep{(A^TAs)_i s_i}
&= \sum_{i=1}^n \left(\frac{1}{m} e_i s_i\right)^T_i s_i
&= \frac{1}{m} \langle s, s \rangle \end{align
}

because

\epATA=1mI\ep{A^TA} = \frac{1}{m} I

as it is a Wishart matrix (see section ???).
So we can further approximate (???):

lnf(ys)=y221ms22+2y,As\labelapproxloglike\ln{f\left(y \mid s\right)} = - \|y\|_2^2 - \frac{1}{m} \|s\|_2^2 + 2 \langle y, As \rangle \label{approx-log-like}

The only non-constant part of (???) is the third term, and so we define the estimator:

s^=arg maxΘy,As(Θ)\labeleq:compressiveestimator\hat{s} = \argmax_{\Theta} \langle y , As\left(\Theta\right)\rangle \label{eq: compressive-estimator} \end{proof}

\begin{corollary} Consider the case where y=Asy = As (no noise). Then

\begin{align} y^TA\phi_j &= \sum_i \alpha_i \phi_i^TA^TA\phi_j \end{align}

So

\begin{align} y^TA\phi_j &= \sum_i \alpha_i \phi_i^TA^TA\phi_j \sim \frac{\alpha_i}{m} \delta_{ij} \end{align}

giving

αi^=m(yTAϕj)\widehat{\alpha_i} = m \left( y^T A \phi_j \right) \end{corollary}

\begin{remark} The matrix M=ATAM = A^TA is the projection onto the row-space of AA. It follows that Ms22\|Ms\|_2^2 is simply the norm of the component of ss which lies in the row-space of AA. This quantity is at most s22\|s\|_2^2, but can also be 00 if ss lies in the null space of AA. However, because AA is random, we can expect that Ms22\|Ms\|_2^2 will concentrate around m/ns22\sqrt{m/n} \|s\|_2^2 (this follows from the concentration property of sub-Gaussian random variables (???)). \end{remark}

\begin{example}{Example: Single Spike} We illustrate these ideas with a simple example: estimate which of nn frequencies ss is composed of.

A signal sR300s \in \mathbb{R}^{300} is composed of a single (random) delta function, with coefficients drawn from a Normal distribution (with mean 100, and variance 1), i.e.,

s=αiδis = \alpha_i \delta_i
with

aiN(100,1)a_i \sim \mathcal{N}\left(100, 1\right)

and the index ii chosen uniformly at random from [1,n][1, n].
The signal was measured via a random Gaussian matrix AR100×300A \in \mathbb{R}^{100 \times 300}, with variance σ2=1/100\sigma^2 = 1/100, and the inner product between y=Asy = As and all 300 delta functions projected onto R100\mathbb{R}^{100} was calculated:

α^j=m(Aαiδi),Aδj\hat{\alpha}_j = m \langle (A \alpha_i \delta_i), A \delta_j \rangle

We plot the αj^\hat{\alpha_j} below, figure ???, (red circles), with the original signal (in blue, continuous line). Note how the maximum of the αj^\hat{\alpha_j} coincides with the true signal.

\begin{figure}[h] \centering \includegraphics[height=7.3cm]{1spike_legend.jpg} \label{fig:new_basis_25} \caption{An example estimation of a single spike using Compressive Inference methods. Note how the maximum of the estimator α^j\hat{\alpha}_j corresponds to the true signal.} \end{figure} \end{example}