Abstract-Sparse signal recovery finds use in a variety of practical applications, such as signal and image restoration and the recovery of signals acquired by compressive sensing. In this paper, we present two generic very-large-scale integration (VLSI) architectures that implement the approximate message passing (AMP) algorithm for sparse signal recovery. The first architecture, referred to as AMP-M, employs parallel multiply-accumulate units and is suitable for recovery problems based on unstructured (e.g., random) matrices. The second architecture, referred to as AMP-T, takes advantage of fast linear transforms, which arise in many real-world applications. To demonstrate the effectiveness of both architectures, we present corresponding VLSI and field-programmable gate array implementation results for an audio restoration application. We show that AMP-T is superior to AMP-M with respect to silicon area, throughput, and power consumption, whereas AMP-M offers more flexibility.
many natural or man-made signals exhibit a sparse representation in certain bases (e.g., speech signals are approximately sparse in the Fourier domain) and many real-world problems can be formulated in terms of a system of linear equations, sparse signal recovery finds use in a large number of practical applications. Prominent examples are the restoration of audio signals or images from saturation, impulse noise, or narrowband interference [2] [3] [4] , signal separation [5] , de-noising [6] , de-blurring [7] , super-resolution [6] , and in-painting [8] , as well as compressive sensing (CS) [9] , [10] . CS has recently gained significant attention in the research community by enabling the sampling of sparse signals using fewer measurements than the Nyquist rate suggests. In particular, CS has the potential of lowering the costs of sampling (compared to conventional analog-to-digital converters) and is used in a large number of practical applications, such as magnetic resonance imaging (MRI) [11] , electroencephalography [12] , imaging devices [13] , radar [14] , or wireless communication [15] , [16] .
Unfortunately, high-performance sparse signal recovery algorithms typically require a significant computational effort for the problem sizes occurring in most practical applications. While the computational complexity is not a major issue for applications where offline processing on central processing units (CPUs) or graphics processing units (GPUs) can be afforded (e.g., in MRI), it becomes extremely challenging for applications requiring real-time processing at high throughput or for implementations on battery-powered (mobile) devices. Hence, to meet the stringent throughput, latency, and power-consumption constraints of real-time audio, image, and video restoration, CS-based imaging devices, radar, or wireless systems, developing dedicated hardware implementations, such as application specific integrated circuits (ASICs) or field-programmable gate arrays (FPGAs), is of paramount importance.
While significant research effort has been devoted to the design of high-performance and low-complexity sparse signal recovery algorithms, e.g., [17] [18] [19] [20] [21] [22] [23] [24] [25] , much less is known about their economical implementation in dedicated hardware. Notable exceptions are the ASIC designs reported in [15] , where the authors compared several implementations of greedy pursuit (GP) algorithms for sparse channel estimation in wireless communication systems. A similar recovery algorithm specifically designed for signals acquired by the modulated wideband converter was implemented on an FPGA in [16] . Another FPGA implementation for generic CS problems of dimension 32 128 was developed in [26] . All these implementations rely on GP algorithms, which are well suited for the recovery of very sparse signals in hardware. However, for applications featuring less sparse (or approximately sparse) signals, such as audio signals, images, or videos, these algorithms quickly become inefficient in terms of throughput, silicon area, and power consumption, as their complexity scales roughly linearly in the signal's sparsity level (see [15] for a corresponding discussion).
A. Contributions
In this paper, we present-to the best of our knowledge-the first VLSI designs of a basis pursuit denoising (BPDN) solver for signal restoration and signal recovery from CS measurements. We compare possible candidate algorithms and identify the approximate message passing (AMP) algorithm [25] to be well suited for the recovery of approximately sparse signals, such as audio signals, images, or videos, in hardware. To demonstrate the suitability of AMP for VLSI implementations, we develop two generic architectures.
• The first architecture, referred to as AMP-M, is a generalpurpose solution employing multiply-accumulate (MAC) units, which is suitable for sparse signal recovery problems relying on arbitrary (e.g., unstructured or random) linear measurements.
• The second architecture, referred to as AMP-T, is specifically designed for recovery problems for which the measurement matrices (i.e., the aggregation of the linear measurement operators) have fast transform algorithms. In order to demonstrate the efficacy of both sparse-signal recovery architectures, we present corresponding VLSI designs for a real-time audio restoration example. For the AMP-T architecture, we employ a fast implementation of the discrete cosine transform (DCT), which substantially improves upon the MACbased AMP-M solution in terms of silicon area, throughput, and power consumption. For both architectures, we present reference VLSI and FPGA implementation results to highlight the effectiveness of AMP-T and AMP-M for sparse signal recovery and CS in practical systems.
B. Outline of the Paper
The remainder of the paper is organized as follows. Section II briefly introduces CS and evaluates prominent sparse signal recovery algorithms, including AMP. Section III discusses the application of AMP for signal restoration. The VLSI architectures for AMP-M and AMP-T are detailed in Section IV; corresponding reference VLSI and FPGA implementation results and a comparison to existing solutions are given in Section V. We conclude in Section VI.
C. Notation
Lowercase and uppercase boldface letters stand for column vectors and matrices, respectively. The th entry of a vector is ; the th column and the th entry on the th column of a matrix are denoted by and , respectively. and stand for the identity and the all-zeros matrix, respectively; the transpose of a matrix is designated by . The Euclidean (or ) norm of a vector is denoted by . The -norm of is designated by , and , often referred to as the -norm, corresponds to the number of nonzero entries in . The complex conjugate, real part, and imaginary part of is denoted by , and , respectively. For , we define .
II. SPARSE SIGNAL RECOVERY AND COMPRESSIVE SENSING
We next review the basics of sparse signal recovery and CS and evaluate potential candidate recovery algorithms, with a focus on their suitability for VLSI. We then summarize the AMP algorithm [25] , which is considered in the remainder of the paper.
A. Compressive Sensing in a Nutshell
Compressive sensing (CS), as put forward in [9] , [10] , aims to sample a signal vector using fewer measurements than the Nyquist rate suggests. Specifically, CS considers the acquisition of through linear (and nonadaptive) measurements as follows:
Here, is a sensing matrix satisfying , and represents additive measurement noise. Since the recovery of from the noiseless measurements corresponds to solving an under-determined set of linear equations, the estimation of from the noisy measurements (1) is, in general, an ill-posed problem. Nevertheless, many natural or man-made signals have a sparse representation in a given orthonormal basis , i.e., , where only a few entries of are nonzero. Sparsity enables us to estimate the signal if the effective matrix satisfies the so-called restricted isometry property (RIP) [10] . For instance, if the entries of are i.i.d. zero-mean Gaussian, then is known to satisfy the RIP with overwhelming probability provided that [27] (2)
In this case, a fundamental result of CS states that a stable estimate for can be obtained with the aid of sparse signal recovery algorithms. More specifically, recovery from (1) is commonly achieved by a convex optimization-based method known as basis pursuit de-noising [28] with . Finally, an estimate for the desired signal vector can be obtained by computing , where denotes the solution to BPDN.
B. Evaluation of Sparse Signal Recovery Algorithms
In order to compute the solution to BPDN, a variety of optimal and suboptimal sparse signal recovery algorithms have been proposed in the literature [17] [18] [19] , [22] [23] [24] [25] . We next evaluate several potential candidate algorithms with respect to their suitability for VLSI implementation.
1) Interior-Point Methods: Convex optimization problems, such as BPDN, can be solved accurately using interior point methods [17] . Such methods are known to exhibit high computational complexity for moderate-to-large problem sizes and typically require high numerical precision; both drawbacks render an efficient implementation in VLSI challenging.
2) First-Order Methods: To alleviate the complexity and precision requirements of interior-point methods, a variety of first-order methods (i.e., algorithms involving matrix-vector multiplications with and only) for solving the Lagrangian BPDN problem have been proposed in the literature [18] , [23] . Here, the regularization parameter controls a trade-off between fidelity to the measurements and -norm of the solution .
Iterative soft-thresholding (IST) [18] , for example, is a simple first-order method that computes for each iteration . Here, denotes the maximum number of iterations, the current estimate for , represents the residual error, is an iteration-dependent threshold, and implements an entry-wise soft-thresholding policy as follows:
The IST algorithm is able to deliver the result to BPDN given that and the thresholds satisfy certain properties [29] . Unfortunately, IST exhibits slow convergence, which eventually leads to high computational complexity. To this end, first-order algorithms achieving faster convergence than IST have been developed in the literature, e.g., [20] , [21] , [23] . The associated computational complexity is, however, still too high for most real-time applications in dedicated hardware.
3) Greedy Pursuit (GP): Rather than solving (BPDN) or (BPDN ) altogether, a variety of GP-based algorithms that deliver approximations to these convex problems have been proposed in the literature, including the following.
• Matching pursuit (MP) [19] iteratively identifies the column of that is most correlated to a current signal estimate, followed by a simple update that computes an improved signal estimate. While each iteration of MP requires very low computational effort, the number of iterations heavily depends on the sparsity level and hence, MP is only suitable for extremely sparse signals.
• Orthogonal matching pursuit (OMP) [22] and compressive sampling matching pursuit (CoSaMP) [24] are more sophisticated GP-based algorithms that incorporate a least squares (LS) step to compute a signal estimate. The LS step significantly reduces the number of required iterations compared to MP, but it induces a high computational complexity per iteration and requires considerable numerical precision (see, e.g., [15] ). While GPs are well suited for recovering very sparse signals in VLSI, as in certain applications in wireless communication or radar [15] , [16] , [26] , for example, they turn out to be rather inefficient for large-dimensional problems and/or (approximately sparse) signals exhibiting moderate-to-high sparsity levels, such as audio signals, images, or videos.
C. Approximate Message Passing (AMP)
AMP [25] is a recently developed sparse signal recovery algorithm that delivers excellent recovery performance, exhibits fast convergence at low computational complexity per iteration, while requiring low arithmetic precision. All these properties render AMP advantageous for the implementation in VLSI compared to the algorithms evaluated above.
1) Algorithm:
The pseudo-code for AMP is given in Algorithm 1. One can immediately see that AMP has a similar structure as IST (cf. Section II-B2), with the key difference that the residual is computed not only based upon the current estimate but also using the residual obtained in the previous iteration (cf. line 6). In order to identify a suitable thresholding policy, we decided to set proportionally to the regularization parameter and the root mean square error (RMSE) of the residual (see line 3 of Algorithm 1) as proposed in [30] . All these differences to IST render AMP vastly superior in terms of convergence rate, without noticeably penalizing the complexity per iteration. Moreover, AMP provably delivers the solution to (BPDN ) for matrices having zero-mean i.i.d. Gaussian distributed entries 1 of order [25] . For arbitrary (e.g., deterministic) matrices, AMP does not necessarily converge to the (BPDN ) solution. We emphasize, however, that AMP has been shown (empirically) to deliver excellent recovery performance for a large class of deterministic and highly structured matrices, such as subsampled Fourier or DCT matrices [31] . can be replaced by a fast transform, e.g., using a fast Fourier transform (FFT), then each iteration can be performed with very low computational complexity (see Section IV-C for a corresponding example). The maximum number of iterations required by the algorithm to converge usually ranges between 10 to 100 (depending on the target accuracy, the signal sparsity, and the dimensionality of ). The square-root computation in the RMSE (cf. line 3) requires a specialized hardware unit, but can be implemented at low cost (see Section IV-A1 for the details).
All remaining computations can be carried out efficiently using standard circuitry, which renders AMP well-suitable for the implementation in VLSI.
3) Early Termination: The complexity of AMP can be reduced further by means of early termination (ET). In particular, the iterations of AMP can be terminated as soon as the RMSE is small enough (depending on the target application). To this end, we define an ET threshold and stop the iterative procedure (cf. lines 2-7) as soon as RMSE . Determining when ET occurs comes at virtually no additional hardware costs, since computation of the RMSE is anyway required for the soft-thresholding parameter (cf. line 3). Furthermore, the ET threshold in combination with the maximum number of iterations allows us to trade performance for complexity; this trade-off is analyzed in Section III-B for an audio restoration application.
III. SIGNAL RESTORATION
In addition to CS, sparse signal recovery has been employed in the restoration of signals corrupted by impulse noise and/or saturation [2] [3] [4] . We next show how signal restoration can be formulated as a sparse signal recovery problem and demonstrate the suitability of AMP for this particular application.
A. Signal Restoration as a Sparse Signal Recovery Problem
As shown in [2] [3] [4] , signals corrupted by impulse noise and saturation can be modeled as (4) with , , and the corrupted observation . Here, the matrix sparsifies the signal to be restored and the matrix sparsifies the corruptions on the signal . Signal restoration now amounts to the recovery of the sparse vector from (4) using (BPDN) or (BPDN ), for example, followed by computing . In certain cases, one can identify the locations of the corrupted entries in prior to recovery, which typically results in improved restoration performance [2] [3] [4] .
For the restoration to succeed, the matrix must not only sparsify the (uncorrupted) signal , but also be incoherent to ; i.e., the mutual coherence [3] should be small (see [3] and [4] for a theoretical analysis). This important observation allows one to select a matrix pair that is suitable for the given signal-restoration application.
For the restoration of audio signals from clicks/pops and saturation, as considered in the remainder of the paper, setting to the (unitary) DCT matrix defined as with for and otherwise, enables one to sparsify audio signals; setting sparsifies clicks/pops and saturation artifacts. Since is incoherent to , excellent performance for audio restoration can be achieved by using this pair of matrices. In particular, as shown in [3] and [4] , if the number of corrupted entries of is small compared to its dimension , then the DCT-identity pair is guaranteed to enable stable recovery of and, hence, of the desired (uncorrupted) signal .
B. Numerical Results for AMP
We next evaluate the performance of AMP for audio restoration. Fig. 1 shows a snapshot of the old phonograph recording "Mussorgsky" from [32] and the restored signal via AMP. Restoration is performed in blocks of length using the DCT-identity pair; 16 samples between each pair of adjacent blocks are added and overlapped using raised-cosine windows to avoid artifacts at the block boundaries. Each block is recovered using AMP with iterations, , and no ET (i.e., ). Fig. 1 illustrates the fact that AMP using the DCT-identity pair efficiently removes clicks/pops from old phonograph recordings, without knowing the locations of the sparse corruptions (also see [4] for similar results).
The performance and complexity of AMP for audio restoration is studied in Fig. 2 . We artificially corrupt a 16 bit 44.1 kHz speech signal from [33] ; the clicks/pops are modeled by adding Gaussian pulses consisting of five samples whose peak value is uniformly distributed in [ 1, 1] . We define the click rate as the number of clicks per sample. The restoration performance is measured using the recovery signal-to-noise-ratio (RSNR), defined as , where corresponds to the original (uncorrupted) signal and to the signal restored by AMP. Fig. 2(a) shows the RSNR for different numbers of maximum iterations ; the regularization parameter has been optimized for each click rate . One can immediately see that AMP converges quickly; i.e., setting turns out to be sufficient for near-optimal performance for the considered click rates.
The impact of ET on the performance and complexity is studied in Fig. 2(b) . We see that, for , the number of average iterations is reduced while slightly degrading the RSNR. Lowering the ET threshold leads to a smaller reduction of average iterations but results in higher RSNR. Thus, carefully selecting enables one to reduce the complexity of AMP at no loss in terms of RSNR. For example, Fig. 2(b) shows that for , , and , only 16.2 iterations are necessary to achieve the same performance of without ET. Hence, in practice, ET can either be used to increase the average restoration throughput or for power reduction, e.g., by silencing the entire circuit during idle clock cycles.
IV. VLSI ARCHITECTURES OF THE AMP ALGORITHM
In this section, we present two novel VLSI architectures for the AMP algorithm. The first architecture, referred to as AMP-M, is a generic MAC-based solution that is applicable to arbitrary sparsity-based signal restoration and CS problems. The second architecture, referred to as AMP-T, is a generic solution for situations where multiplications of a vector with and can be carried out by a fast transform. While the computational complexity of AMP-M is dominated by matrix-vector multiplications scaling with , a fast transform computes the same operation with lower asymptotical complexity. We start by describing the architectural principles for AMP-M and AMP-T suitable for arbitrary sparse signal recovery applications and then derive corresponding optimized architectures for audio restoration.
A. AMP-M: MAC-Based AMP Architecture
The first architecture implements the matrix-vector multiplications on lines 4 and 6 of Algorithm 1 using a predefined number of parallel MAC units. This MAC-based architecture has the advantage of being suitable for arbitrary matrices , including unstructured (e.g., random) matrices or matrices obtained through dictionary learning [34] , as used, e.g., in many signal restoration or de-noising problems. Moreover, if is explicitly stored in a memory, the matrices used in AMP-M can be reconfigured at run-time, without the need to redesign and re-implement the circuit. 1) Architecture: Fig. 3(a) shows the high-level block diagram of the AMP-M architecture. The AMP algorithm requires memories to store the input signal , the residual , and the signal estimate obtained after applying the thresholding function. The input vector and the residual can be stored in the same memory [referred to as ZR-RAM in Fig. 3(a) ], i.e., each coefficient of and can be stored at the same address, which allows for memory instances having suitable address-to-wordlength ratios leading to small S-RAM macro cells. The signal estimate is stored in a separate memory, referred to as X-RAM. Depending on the application, the entries of the matrix are either stored in a RAM or ROM, or can be generated on the fly.
All matrix-vector multiplications are carried out in parallel MAC instances; the number of MAC units is configurable during compile-time of the architecture and determines the maximum achievable throughput (see Section V-B). Pipeline registers are added at the multiplier inputs to increase the maximum achievable clock frequency. Each MAC unit is used to sequentially compute an inner product of a row of the matrix with or with . Hence, each MAC unit requires access to a different entry of in each clock cycle, while the same vector entry is shared among all units [see Fig. 3(a) ].
The RMSE is computed using a separate unit that is specialized for computing sums of squares. The subsequent square root computation is implemented using the approximation developed in [35] , which requires neither multipliers nor look-uptables (LUTs). The RMSE is computed in parallel to the matrixvector multiplication (line 4 of Algorithm 1). Note that the rather limited numerical accuracy of the deployed square-root approximation was found to be sufficient for our purposes (see the discussion in Section V-A).
To implement the thresholding function (3), we instantiated a subtract-compare-select unit that applies thresholding in a serial and element-wise manner (performed in the TRSH unit). The -norm on line 5 of Algorithm 1 is computed in the L0-unit, which counts the nonzero entries of in a serial manner and concurrently to the matrix-vector multiplications. To avoid additional hardware resources, all remaining arithmetic operations, e.g., computation of the residual (line 6 of Algorithm 1), are performed using the available MAC units in a time-shared fashion.
2) Optimization for Audio Restoration: The main bottleneck of the AMP-M architecture is the memory bandwidth required to deliver the entries of to the parallel MAC units. For unstructured (e.g., random) matrices, (multi-port) LUTs, ROMs, or on-chip S-RAMs can be used for small dimensions (in the order of a few hundred kbit). For large-dimensional problems, external memories (e.g., off-chip D-RAMs) become necessary, which shifts the memory bottleneck to the bandwidth of the external memory interface. Fortunately, in many real-world applications, the matrix is highly structured; hence, it is often possible to generate its coefficients on the fly at very high throughput. Specifically, for the DCT-identity pair often used in audio restoration, we can avoid the explicit storage of all entries of the DCT matrix (which would result in prohibitively large on-chip memory). Instead, we exploit the regular structure of the DCT matrix and make use of symmetries to generate the matrix at high throughput by using a small cosine LUT having only entries. The LUT address is calculated on the basis of the row and column of the required DCT entry. The parallel LUT outputs (or their negative values) are directly fed to the MAC units. Thus, instead of a multi-port memory for explicitly storing , only values needed to be stored; this results in a 1024 memory-size reduction for a block size of samples. The multiplications with the identity basis obviously do not require any memory and are implemented by simple control logic.
B. AMP-T: Transformation-Based AMP Architecture
While the AMP-M architecture is well-suited for unstructured matrices, small-scale problems, or applications for which the matrix must be reconfigurable at run time, the complexity scaling, storage requirements, and memory bandwidth requirements (which are roughly proportional to the number of entries of the matrix ) render its application difficult for throughput intensive and/or large-scale problems. Fortunately, in many practical applications the matrix has a fast transform, e.g., the fast Fourier, DCT, Hadamard, or wavelet transform (or combinations thereof), which allows for the design of more efficient VLSI implementations. The AMP-T architecture described next exploits these advantages. 1) Architecture: Fig. 3(b) shows the high-level block diagram of the AMP-T architecture. The structure of AMP-T is similar to that of the AMP-M architecture, apart from the following key differences.
• No storage for the matrix or logic to generate its entries on the fly is required. • The parallel MAC units have been replaced by a specialized fast transform unit, which must support both the fast forward transform and its inverse.
• The residual, which was calculated in the MAC units in the AMP-M architecture, is now computed in a dedicated unit (referred to as R-CALC); this unit only consists of a small multiplier and a few adders.
• The RMSE is calculated simultaneously to the fast forward transform, whereas the -norm is computed simultaneously to the fast inverse transform. The architecture for carrying out the fast forward and its inverse heavily depends on the used transform and algorithm. Hence, AMP-T is less flexible compared to AMP-M, as the transform unit must be redesigned for each target application. However, as shown in Section V, AMP-T substantially improves upon AMP-M in terms of throughput, silicon area, and power consumption.
2) Optimization for Audio Restoration: For the audio restoration application considered in this paper, we use an architecture implementing a fast DCT (FCT) and its inverse (IFCT). The corresponding algorithm and the resulting VLSI architecture are detailed in Section IV-C. The additions required to implement the identity basis are carried out in the R-CALC unit. The X-RAM has been divided into two memories to support parallel access, which enables fast thresholding.
C. VLSI Implementation of the FCT/IFCT
Existing VLSI implementations of a fast DCT/IDCT have mainly been designed for MPEG-2 video compression, which relies on problems of size 8 8 (see, e.g., [36] ). For the targeted audio restoration application, however, the problem size is for which-to the best of our knowledge-no VLSI architecture has been described in the open literature. To this end, we next evaluate potential algorithms for efficiently computing a large-dimensional FCT/IFCT and then, we detail the architecture used in the final AMP-T implementation.
1) Algorithm Evaluation:
A variety of algorithms to compute the FCT/IFCT have been proposed in the literature [37] [38] [39] [40] . A high-level comparison of some of the most prominent algorithms is provided in Table I . We consider the algorithm's regularity, memory requirements, and computational complexity. While the computational complexity is a key metric for most implementations, regularity and low memory requirements are of similar importance when designing dedicated VLSI circuits. As a reference, we compare all candidate algorithms to a straightforward matrix-vector multiplication-based DCT/IDCT approach. [40] The algorithm proposed in [37] directly performs divide-andconquer on the DCT matrix to achieve very low computational complexity. This algorithm, however, exhibits an irregular data flow and corresponding architectures cannot easily be parametrized to support different problem sizes. Another direct approach is the recursive method proposed in [38] , which is more efficient than [37] (in terms of operation count and memory), but still lacks a regular data flow. Another line of fast DCT algorithms relies on the well-established FFT. A straightforward approach is based on a -point FFT, which exhibits high regularity and requires almost no overhead for the DCT-to-FFT conversion [39] . An improved algorithm relying on an -dimensional FFT only, was proposed in [40] . This approach exhibits lower complexity while causing only a small conversion overhead. An even faster method replaces the real-valued FFT by a complex-valued -dimensional FFT followed by a few additional computations [40] . This approach reduces the computational complexity and memory requirements compared to the -FFT approach, while maintaining high regularity. Hence, we decided to implement this -FFT-based algorithm in the remainder of the paper.
2) -FFT-Based FCT/IFCT Algorithm:
The -FFT approach described in [40] is summarized in Table II and performs the FCT and IFCT in multiple steps. For the FCT, the entries of the real-valued input vector are first reordered and stored to a vector . Then, the reordered vector is converted into a complex-valued vector of half the length, i.e.,
. The main task of the FCT algorithm is to compute a -length FFT of the vector . The result is expanded into a conjugate-symmetric vector , which corresponds to a -point FFT of the real-valued vector . To obtain the result of the FCT, the entries of are rotated by certain twiddle factors (as used in the FFT), defined as (5) followed by extracting the real value of the rotated entries of . The procedure for the IFCT is analogous to that of the FCT; see Table II for the details.
3) Architecture: We now detail the FCT/IFCT architecture used in AMP-T for audio restoration. To process a stereo 192 kHz audio signal with a block size of samples and 16 samples overlap, restoration of one block must be completed within 1.29 ms. Since , an FCT/IFCT operation must be computed in no more than 32.3 . Fig. 4 shows the high-level block diagram of an FCT/IFCT architecture achieving the specified throughput in 65 nm CMOS technology. The input vectors and intermediate results are stored in a single-port S-RAM with complex-valued entries. The address generator computes the FFT addressing scheme proposed in [41] and also controls the operations carried out during the other phases of the algorithm.
We perform a complex-valued in-place FFT/IFFT using a single radix-2 butterfly in a time-shared fashion. With a single memory access per clock cycle, each butterfly operation requires four clock cycles. The complex-valued multiplier in the butterfly is implemented using four real-valued multipliers and two adders. All the additional operations (carried out in the reorder, reduce, expand, and rotate phases) are also calculated on the same butterfly unit by reusing the existing arithmetic circuitry. A dedicated twiddle generator unit is used to provide the necessary factors for the FFT as well as for the FCT/IFCT steps in Table II . This unit contains a real-valued LUT with 512 entries; the real and imaginary parts of the twiddle factors are assembled from two consecutive table look-ups.
The resulting architecture is able to compute an FCT/IFCT in 8 200 clock cycles, which is sufficiently fast for the targeted audio restoration assuming a clock frequency of at least 255 MHz. We note that for applications requiring substantially higher throughput, such as for sparse signal recovery of high-resolution images or videos, significantly faster FFT/IFFT architectures become necessary; this can be achieved by parallel and higher-order butterfly units, as well as by using parallel multi-port S-RAM macro cells. The FCT/IFCT architecture developed here enables us to achieve the specified throughput at minimum silicon area and, hence, was chosen for the VLSI designs described next.
V. IMPLEMENTATION RESULTS
In this section, we provide reference VLSI and FPGA implementation results of AMP-M and AMP-T for the audio restoration application described in Section III-B. We emphasize that the corresponding implementation results do also reflect the performance, implementation complexity, and power consumption of the proposed VLSI designs for signal recovery from CS measurements.
A. Fixed-Point Parameters for Audio Restoration
In order to optimize the hardware efficiency (in terms of area per throughput) and the power dissipation, fixed-point arithmetic is employed in our AMP architectures. For the targeted audio restoration application, the most critical word length of the AMP-T architecture resides in the FCT/IFCT unit. To achieve the performance of a floating-point implementation with a 16 bit quantized audio input/output, 26 bit are sufficient for the real and imaginary part in the -point FFT/IFFT block. Another important word length parameter is the accuracy of the RMSE and the resulting thresholding parameter . Simulation results have shown that only 11 bit are sufficient to represent to achieve the full RSNR performance with respect to the original audio signal. Therefore, we employ the fast and low-area square-root approximation in [35] to calculate . The Z-RAM uses 16 bit, the XU-, XL-, and R-RAM use 26 bit word-width. The AMP-M-architecture requires the same memory word lengths as AMP-T; the precision required in the accumulator of the MAC units in AMP-M, however, corresponds to 30 bit to achieve the performance of AMP-T. The implementation loss of both AMP architectures is less than 0.013 dB RSNR compared to their floating-point models.
In the final designs, the regularization parameter and the ET threshold are both configurable at run-time: is tunable to the values {0.125, 0.25, 0.5, 1, 2, 4}, whereas can be set to any positive number representable by 13 fraction bits. 
B. Comparison of AMP-M and AMP-T
In order to compare the hardware complexity of AMP-M and AMP-T, we synthesized both architectures in a 65 nm CMOS technology. The target throughputs for both designs were set such that real-time restoration of audio signals can be performed with different sampling rates up to 384 ksample/s, which corresponds to a high-quality 192 ksample/s stereo signal. For AMP-M, the throughput can be increased by instantiating more parallel MAC units. In order to process a single audio channel with 48 ksample/s in real-time, four parallel MAC units are required; processing 384 ksample/s in real-time necessitates 32 parallel MAC units. The throughput of AMP-T can be adjusted (up to a certain speed) by reducing the critical path of the AMP-T architecture during synthesis. Fig. 5 shows the standard cell and memory area (in mm ) after synthesis of AMP-M and AMP-T. The number of required MAC units in AMP-M is annotated in parentheses. When targeting a high throughput, the silicon area of AMP-T is substantially smaller than that of AMP-M. The reason for this behavior is that implementing the FCT/IFCT rather than using MAC units to perform matrix-vector multiplications requires substantially fewer operations and, therefore, fewer hardware resources. This behavior is reflected in the number of clock cycles required by AMP-M and AMP-T, which can be approximated as follows:
Here, is the number of parallel MAC units. Since, for large , AMP-M scales approximately with and AMP-T with , we conclude that the transform-based architecture is both faster and more efficient (in terms of area and power consumption) for large-scale problems.
C. ASIC Implementation
To demonstrate the efficacy of AMP-M and AMP-T, we designed a reference ASIC including both designs for real-time audio restoration in 1P8M 65 nm CMOS technology. The target is to process 192 ksample/s stereo audio signals with 16 samples overlap between adjacent blocks; this requires an AMPthroughput of 396 ksample/s. Fig. 6 shows the corresponding chip layout, where we highlighted both designs and their main processing blocks. The corresponding post-layout results are Table III . A detailed area and power breakdown of both ASIC designs is provided in Table IV .
From Table III , we see that both designs achieve the specified target throughput of 396 ksample/s. AMP-M runs at a higher clock frequency of 333 MHz compared to 256 MHz for AMP-T, since more pipelining stages are used in AMP-M. We furthermore observe that AMP-T is roughly five times smaller than AMP-M. Note that AMP-M requires less memory compared to AMP-T, which is due to the facts that 1) we do not store the DCT matrix in AMP-M but compute its entries on the fly, and 2) AMP-T requires an additional memory (compared to AMP-M) within the FCT/IFCT unit.
The power figures shown in Table III and Table IV are extracted from postlayout simulations using node activities obtained from simulations with actual audio data at maximum clock frequency, 1.2 V core voltage, and at 298 K. AMP-T turns out to be roughly 7 more energy efficient than AMP-M in terms of per sample, which highlights the effectiveness of the AMP-T design.
From Table IV we see that the FCT/IFCT unit of AMP-T occupies almost 3/4 of the overall circuit area. The remaining blocks, i.e., RMSE calculation and thresholding, make up for around 1/4 of the design. In the AMP-M ASIC, almost 2/3 of the circuit area is occupied by the 32 parallel MAC units and almost 1/3 is required to generate the entries of the DCT matrix on-the-fly.
D. FPGA Implementation
In addition to the ASIC design shown above, we mapped both AMP architectures to Xilinx Spartan-6 FPGAs, which are fabricated in a 45 nm low-power CMOS technology. To improve the throughput of both architectures (compared to a straightforward implementation), we performed optimizations for the underlying FPGA structure. The basic parameters, such as the number of MAC units in AMP-M or the FCT architecture in AMP-T are, however, equivalent to those of the ASIC design. We also included an AC'97 audio interface to process audio signals from analog audio sources in real time. The interface consists mainly of control circuitry and in-/output buffers to implement windowing and overlapping.
1) AMP-T Optimization:
For the AMP-T architecture, we replaced the single-port S-RAMs with dual-port memories, since dual-port block RAMs are readily available in the FPGA considered here. This modification enables us to compute each butterfly operation in two clock cycles (compared to four cycles required by the architecture used in the ASIC) resulting in a 2 speed-up of the FCT/IFCT. The number of clock cycles required by this modified AMP-T architecture is approximately which leads to an overall throughput increase of 40% compared to the architecture used in the AMP-T ASIC at almost no increase in FPGA logic complexity.
2) AMP-M Optimizations: In the AMP-M architecture, the synthesized LUT in the matrix generator is replaced by 16 dual-port ROMs. Moreover, additional pipeline registers are introduced after the multipliers, which increases the maximum clock frequency by 20% while slightly increasing the processing latency (i.e., less than 1%).
3) Comparison:
The FPGA implementation results of the two optimized designs are shown in Table V. Note that AMP-T can be mapped to a very small XC6SLX9 FPGA, whereas AMP-M requires the much larger XC6SLX75 FPGA.
The optimized AMP-T architecture is able to process stereo signals with the standard sampling rate of 44.1 ksample/s. Despite of the larger FPGA, AMP-M achieves only half the Power consumption is measured on a Digilent Atlys platform featuring an XC6SLX45 FPGA (a down-sized version of AMP-M with 4 MAC units was measured; the power figures were scaled to 32 MAC units). Since other devices are connected to the same power supply, only the difference between an active and inactive AMP core is reported.
throughput, which, however, still allows us to process a single audio channel in real time. For stereo processing, the number of MAC units must be doubled, which would require an FPGA of twice the logic capacity. Note that the audio interface requires only 140 slices, 2 RAM blocks, and a single DSP slice.
We additionally conducted power measurements using the integrated power monitor of a Digilent Atlys prototype board. Stereo audio data is fed into the line-in port, sampled at 44.1 ksample/s, restored using AMP, and then fed to a digital-to-analog converter. The resulting power consumption and energy efficiency is reported in Table V , which demonstrates that AMP-T is roughly 10 times more energy efficient than AMP-M. Hence, if the flexibility advantage of AMP-M is not required, then the AMP-T architecture is the preferred solution for FPGA implementations with respect to complexity, throughput, and power consumption.
We emphasize that both FPGA designs only achieve 1/4 and 1/8 of the throughput of AMP-T and AMP-M compared to the ASIC designs. An even more pronounced difference can be observed in terms of power efficiency. Specifically, both ASIC designs outperform the FPGA implementations by a factor of 17 and 24 for AMP-T and AMP-M, respectively.
E. Comparison With Existing Sparse Signal Recovery Circuits
We finally compare both AMP designs to the ASIC implementations of MP and OMP presented in [15] for channel estimation in 3GPP-LTE. 2 A direct comparison is difficult, because the ASICs in [15] perform sparse signal recovery in 0.5 ms of signals with roughly 12-18 significant entries and problems of dimension 200 256; moreover, both applications have different precision requirements.
Nevertheless, by scaling 3 the required operations per time unit of the MP and OMP implementation using results of [15, Table I ] to the throughput required by the single-channel audio restoration problem considered here, the estimated circuit area of OMP is more than 4 larger than AMP-M, which is mainly caused by the complexity required by LS estimations for the high sparsity levels typically arising in audio signals or images. For signals having very low sparsity levels (as it is the case in sparse channel estimation, for example), however, OMP is likely to be more efficient than AMP.
The scaled circuit area of MP requires only half the area of AMP-T, but delivers inferior performance when used for signal restoration or CS applications with strong undersampling. Nevertheless, MP remains a valid low-complexity alternative to AMP in applications where suboptimal sparse signal recovery performance can be tolerated.
VI. CONCLUSION
Among the two generic VLSI architectures of the AMP algorithm for sparse signal recovery, the first one, referred to as AMP-M, was shown to be suitable for the recovery of signals acquired by CS or signal restoration problems relying on unstructured (e.g., random or learned) matrices. The second architecture, referred to as AMP-T, is able to exploit fast transforms, which significantly reduces circuit area and power dissipation compared to AMP-M. To demonstrate the suitability of AMP for real-time sparse signal recovery in dedicated hardware, we have implemented both architectures in 65 nm CMOS technology for a high-rate audio restoration application. Moreover, we demonstrated the real-time restoration capabilities of both architectures using an FPGA prototype implementation.
There are many avenues for future work. A theoretical performance analysis of AMP in the presence of fixed-point arithmetic and ET is a challenging open research topic. On the VLSI implementation side, developing an AMP-T architecture suitable for real-time recovery of images or videos from CS measurements is part of ongoing work. 
