Compressive sensing (CS) is perceived as a breakthrough in sampling theory, as it proves that sparse signals can be reconstructed from fewer samples than required by the Shannon-Nyquist theorem. However, CS can hardly be applied to real-time applications because reconstruction algorithms are computationally demanding. To tackle this problem, in this paper, we propose a new high-speed architecture for implementing the orthogonal matching pursuit (OMP), which is one of the most popular algorithms for CS reconstruction. Specifically, a novel pipelined systolic architecture and an optimized scheduling strategy are proposed. From the synthesis results, we find that the proposed design takes 1.638 µs to reconstruct 16-sparse signal, which is 19.2 times faster than the existing VLSI implementation of the OMP.
Introduction
Over the past decade, compressive sensing (CS) has became popular in acquisition of sparse signals (i.e. signals that have only a few non-zero values) [1, 2, 3] , as it allows for reconstruction of such a signal from much fewer measurements than required by the conventional sampling at the Nyquist rate. Therefore, CS makes it possible to dramatically reduce resource consumption, processing time, and power consumption required for data acquisition, transmission, and manipulation in diverse applications.
One of CS applications is in medical image acquisition. The sparsity property of a magnetic resonance image (MRI) made the CS technique a successful application in improving the image quality through the reduction of the measurement amount [1] . As for application in photography, CS allows enhancement in the acquisition speed of the single-pixel Terahertz camera [2] . In high-speed applications like the 1080p HD video, CS helps relieve the burden on data acquisition process, as the number of acquired datum is significantly fewer [3] . These high resolution signal processing applications require an ultrafast reconstruction to meet their real-time requirements.
Orthogonal matching pursuit (OMP) is one of the most widely used reconstruction algorithms for the CS. In [4] , hardware solution is presented to accelerate the computation for OMP. Modified versions of the original OMP have also been proposed to speed up the algorithm in [5] . However, the OMP is not well suited for pipelined hardware architectures because of its iterative nature. It is also difficult to develop a high-throughput implementation of the OMP to meet real-time requirements, as the algorithm is based on matrix computations of high computational complexity.
In this paper, we propose solutions to these problems: 1) a novel pipelined systolic architecture for the OMP, 2) new processing elements that facilitate pipelining, and 3) a scheduling strategy for pipelined iterations. In the proposed design, all stages are designed to consume the same number of clock cycles in order to avoid cycle wasting and to maximize resource utilization.
The rest of this paper is organized as follows. Section 2 presents brief reviews of compressive sensing technique and orthogonal matching pursuit algorithm. Section 3 describes the proposed pipeline architecture with newly designed processing elements. Synthesis results and comparison between the proposed design and other existing implementations are discussed in Section 4. Finally, conclusions are given in Section 5.
2 Background review 2.1 Compressive sensing Compressive sensing, since introduced by Candes et al. [6] in 2006, has been widely used in various applications of signal processing. CS is built upon the fundamental that signal can be represented in a proper domain or basis using only a few non-zero coefficients. Nonlinear optimization then can be used to recover such compressively sensed signals. The recovery is possible as long as the signal is sparse in the represented basis. Assume that signal f is sparse in the basis È with the sparsity representation x, then f can be described by the equation
where f 2 R N , È 2 R NÂN and x 2 R N . As most of the energy is distributed only on a few coefficients, the other ones can be zeros; yet the signal is not distorted. Hence, x will be mostly zero. Acquisition process of the signal vector f via linear transformation with measurement matrix É results in measurement vector y. That is
(2) is an underdetermined system of linear equations. Minimization of the L pnorm is the solution to such problems. Solution for Eq. (2) by L 2 minimization can be expressed as
The problem of Eq. (3) can be solved by greedy iterative methods. For hardware implementation, greedy methods such as matching pursuit (MP) and orthogonal matching pursuit (OMP) are more preferable for their simplicity, high speed of reconstruction and low implementation cost [7, 8, 9] .
Orthogonal matching pursuit
Orthogonal matching pursuit is the derivative of the original matching pursuit algorithm, and the algorithm is summarized in Table I . The reconstruction matrix Â and measurement vector y are the inputs and the sparse reconstruction x is the output of the algorithm. OMP is iterative algorithm and the number of iterations depends on the signal sparsity K. In each iteration, one column of Â which has the most correlation with the residual y is chosen and then its effect is removed to create new residual. A new value is reconstructed in each iteration, and the procedure is repeated until all the K values are reconstructed. The LS problem in
Step 3 can be solved by
which involves finding the inverse of matrix. In this paper, modified method of Cholesky factorization, known as LDL decomposition, is used to compute matrix inversion as it avoids using square-root operation.
3 Proposed novel pipeline architecture of OMP
Hardware architecture
In the existing hardware implementation of the OMP [8, 9] , a signal frame is generally processed as shown in Fig. 1 , using three circuits: the column selector, least-square problem (LSP) solver, and residue calculator. The largest and most complicated one is the LSP solver as it performs four sequential operations on matrices: computation, decomposition, inversion, and composition. In the existing implementation, a high-speed solver has been constructed based on a multifunctional systolic array, in which each cell computes four totally distinctive independent functions and stores the value separately inside the cell for each function. This leads to a tremendous overhead in the area. Furthermore, data stored in each cell is required for the next steps, which means that the last iteration of the current frame must be completed before the first iteration of the next frame begins. Therefore, the existing approaches, in which the algorithms has been coarsely partitioned among hardware blocks, is characterized by a limited throughput. In order to improve performance and save area, we propose a new 8-stages pipelined design of the OMP algorithm as shown in 
3. Solve least-squares (LS) problem x i ¼ arg min ky À Â i xk 2 . This is to find out the value of non-zero location in x. 4. Compute the residual r i ¼ y À Â i x i . This is what remains after removing contribution of i . Repeat step 1 to step 4 until i ¼ K. 
As matrix C is symmetric, only half the number of elements of C need to be computed using KðK þ 1Þ=2 MACs. Stage 3 computes the first half of the matrix multiplication and feeds the results into stage 4 with temporary values for final computation. Stages 3 and 4 can be completed in M=2 cycles. Stage 5 performs the LDL decomposition of matrix C in 2K À 1 cycles, in accordance with the following equations:
for 0 < m K, 0 < n < K, and m > n. Stage 6 performs triangular matrix inversion and matrix composition for the LSP solver. Let us denote A as the inverse matrix of L, that is, A ¼ L À1 , then its ðm; nÞth entry can be calculated as
in K À 1 cycles, and a mn ¼ 1 for m ¼ n. Then, C À1 can be computed as 
As L and D are triangular and diagonal matrices, respectively, their multiplication can be simply computed by the accumulation of DPs of their appropriate elements. Because all the three matrices of A T , D À1 , and A have the same size of K Â K, the matrix composition subpart takes K cycles to compute the final result. Thus, stage 6 requires 2K À 1 cycles in total. Stage 7 employs two matrix multipliers and one subtractor to compute the residual r ¼ y À Âx. As each multiplier requires K cycles to execute the multiplication, the total number of cycles consumed by stage 7 becomes 2K.
Processing elements
The structures of processing elements (PE) used from stage 0 to stage 7 are delineated in Fig. 3. Fig. 3(a) shows PEs used in stages 0, 3, and 7 to multiply matrices. Stages 1 and 4 compute matrix multiplication with accumulated values from stages 0 and 3, respectively, by using PEs shown in Fig. 3(b) . PEs in Figs. 3(c) and (d) are used to compute the elements of diagonal matrix D and elements on the lower-left half of triangular matrix L for stage 5, respectively. As elements that lie on the diagonal line of L will be 1 after inversion, only nondiagonal elements of L need to be computed in the matrix inversion subpart of stage 6 by using PEs shown in Fig. 3(e) . Finally, the matrix composition subpart of stage 6 is performed by PEs shown in Fig. 3(f ) . The detailed struture of each PE is described in the following subsections. 
LSP solver -matrix multiplication
The proposed design exploits the advantage of parallel processing as well as the dependency between PEs in order to accelerate the computational speed of the matrix multiplication. Fig. 4(a) shows the dataflow for matrix multiplication in stages 3 and 4, and Fig. 4(b) shows its detailed architecture. For each cycle, KðK þ 1Þ=2 PEs of each stage fetch corresponding data to compute the matrix multiplication. Two types of PEs for the first-half and the second-half multiplications denoted as PE fh and PE sh , respectively, are employed to perform the MAC operations. It should be known that the parallel processing with KðK þ 1Þ=2 PEs leads to a fast matrix multiplication.
LSP solver -LDL decomposition
Diagonal matrix D and triangular matrix L can be calculated by Eqs. (5) and (6), respectively, where the rightmost summation in Eq. (6) can be expressed in the accumulation form as
where
Eq. (9) can be realized by the systolic array with PE tm for triangular matrix and PE dm for diagonal matrix. Fig. 5(a) using the fetched-in c in , s in , and d in , and also computes the value of s out according to Eq. (9). The inner register after the multiplexer selectively stores either l in or l mn , then transmits to the output l out for the use by the next PE. PE dm first computes the value of diagonal element by Eq. (5) with s in received from PE tm and then produces d out and d À1 out . Note that the registers to store l out , s out , d out , and d À1 out need to be updated properly by the enable signals. Fig. 5(b) shows an example for computing l 41 , l 42 , l 43 , and d 44 in consecutive cycles to demonstrate the dataflow.
LSP solver -matrix inversion
Dataflow for the inversion of triangular matrix L represented as Eq. (7) is shown in Fig. 6(a) , and detail architecture and connection are described in Fig. 6(c) . The inner register after the subtractor in each PE inv is first initialized by l p , which is the same as l mn from the previous LDL decomposition stage, then each PE inv operates as a multiply-accumulator to compute a mn . l out and a out are transferred to the right PE inv and the lower PE inv , respectively for the subsequent computations. The inner registers to store l out and a out are updated properly by the control signal. Fig. 6(b) illustrates example dataflow between PE inv s for the computations of a 41 , a 42 , and a 43 . Fig. 7 shows the time scheduling of the proposed design where f, i, and s represent the frame, iteration, and stage indices, respectively. Each stage is designed to be completed in the same number of cycles (denoted as TS in Fig. 7 ), e.g., in 32 cycles for M ¼ 64, N ¼ 256, and K ¼ 16. Specifically, stage 0 receives frame 0 for the first iteration of the OMP and completes the processing in 32 cycles. While the processed frame 0 is delivered to stage 1, frame 1 starts to be fed into stage 0. Frames 2-7 in order are fed into stage 0 for the first iteration in a similar way, and in the meantime, frame 0 travels through all the eight stages in 32 Â 8 cycles and goes back to stage 0. Then, stage 0 takes frame 0 for the second iteration, rather than pulling a new frame 8, so that the first 8 frames are reconstructed first and the next 8 frames are reconstructed subsequently. The proposed pipelined systolic architecture can dramatically increase the throughput by reconstructing a frame every 32 cycles. On the other hand, in the existing design, a frame cannot be processed before finishing all iterations for the previous frame, so the throughput is much lower than in the proposed design.
Time scheduling

Synthesis results and comparison
The proposed architecture of OMP has been coded in Verilog hardware description language and synthesized using the Synopsys Design Compiler with the CMOS standard cell library [10, 11] . In Table II , we show the synthesis results of the proposed designs and existing designs [8, 9] in terms of the maximum frequency, clock period, number of cycles (NOC) per frame, reconstruction time, area, and area-delay-product (ADP). The second and third column of Table II show the synthesis results with 90 nm CMOS library of the proposed and existing design [8] when the input word length is 16 bits, M ¼ 64, N ¼ 256, and K ¼ 16. The proposed design operates at a maximum frequency of 312 MHz and completes the processing of every 256-sample frame in 32 3.2-nanosecond cycles. The number of iterations is 16 for a 16-sparse signal. Hence, the number of samples processed per second is 256=ð32 Â 3:2 Â 16 Â 10 À9 Þ ¼ 156:25 Â 10 6 whereas the existing design of [8] can process 256=ð474 Â 4:15 Â 16 Â 10 À9 Þ ¼ 8:133 Â 10 6 samples per second. As the input word length is set to 16, the proposed design can support a processing speed of up to 156:25 Â 10 6 Â 16 ¼ 2:5 Â 10 9 bits per second (bps), which is faster than the speed required for the reconstruction of a 1080p, 30 fps, 4:2:2 YCbCr HD video stream. The proposed design offers nearly 19 times higher throughput than the known approach, whereas the area overhead is Fig. 7 . Scheduling strategy for the proposed pipelining for the OMP. only 1.16 times; therefore, the area-delay product is 16.49 times lower than in the existing VLSI implementation of the OMP.
The proposed design has also been synthesized using different OMP parameters, i.e. M ¼ 64, N ¼ 256, and K ¼ 8 and different technology library of 65 nm CMOS cell library [11] in order to be compared with the previous OMP implementation of [9] , whose inversion matrix has been implemented by employing Q-R decomposition (QRD) process. Their synthesis results are listed in the fourth and fifth column in Table II . One of the significant drawback of [9] is that it is not reconfigurable, i.e. it does not allow for the reconstruction of input signals with different values of sparsity. Additionally, the hardware implementation is only suitable for those signals which have high sparsity and hence, not applicable for practical use. Compared to OMP architecture in [9] , our proposed design requires 45.68 times less reconstruction time for each signal frame with only 6.8 times larger in area, which leads to 6.79 times less area-delay product.
Conclusion
By developing a novel pipelined systolic architecture, we have shown that the OMP can be implemented as a high-throughput VLSI circuit. The fine-grained pipeline design is constructed from stages which consume the same number of clock cycles, so that time and resource consumption are optimized. Additionally, the newly designed PEs which are specialized for matrix computation bring about a significant area reduction of the whole design. In particular, the proposed design, compared with the existing solutions, can offer a 19.2 times faster reconstruction time at the price of a marginal 1.16 times area overhead, also 45.68 times faster speed with 6.8 times area overhead for the reconstruction of 16-sparse and 8-sparse signals, respectively. The proposed design is capable of reconstructing a compressively sensed 1080p, 30 fps HD video stream in real time, therefore, can be used in the real-time CS applications with stringent performance requirements. 
