Abstract: This work presents an electrocardiogram (ECG) compression processor for wireless sensors with configurable data lossless and lossy compression. Lifting wavelet transforms of 9/7-M and 5/3 are employed for signal decomposition instead of traditional wavelet. A hybrid encoding scheme improves compression efficiency by encoding the higher scales of decomposed coefficients with modified embedded zero-tree wavelet (EZW) and the lowest scale with Huffman encoding. Besides, a transposable register matrix for coefficients buffering during EZW encoding lowers the processing frequency without extra register resource. Implemented in SMIC 40nm CMOS process, the processor only takes a total gate count of 10.8K with 92nW power consumption under 0.5V voltage and achieves a compression ratio of 2.71 for lossless compression and 14.9 for lossy compression with PRD of 0.39%.
Introduction
Wireless and wearable healthcare device has been put a high premium in the recent decade due to the growing demands for real-time and continuous monitoring of physiological signals. This is a pre-step to realize disease prevention/diagnosis and timely alarm based on big data analysis in Internet of Things (IoT), which is of great importance to the evolution of healthcare system. Electrocardiogram (ECG) is a type of physiological signal widely used for health monitoring and the diagnosis of cardiovascular diseases. But as the monitoring of ECG signal requires 24-7 on duty with continuously sampled data for processing, data transmission has dominated the total power consumption of such a wireless system [1] .
Researchers have been focusing on ECG signal compression to reduce data amount for transmission. For lossless compression, temporal methods such as linear predictors [2, 3] are commonly used with small hardware cost. But they also result in limited compression ratio (CR) and transmission power reduction. Lossy compression algorithms such as [4] [5] [6] gain higher CR and more power saving, while they bring certain degree of signal distortion. And these transform-based algorithms are usually carried out in software and require massive resource for computation and data buffering. This could lead to huge resource overhead and high power consumption in hardware, which might offset or even exceed the power saving in transmission. [7] proposed a discrete wavelet transform (DWT) based ECG compression ASIC, but the power consumption is still high due to the complicated bior4.4 wavelet computation and large memory use.
Besides, as signal quality and power efficiency may have varied priorities under different situations, a single function of lossless or lossy compression can be rigid for practical applications. [8] put forward a lossless and lossy compression scheme based a simple temporal method of Fan algorithm, which only achieves modest compression performance. [9] presented another DWT based compression design for lossless and lossy compression. Though a simpler wavelet basis is applied to reduce cost, the reconstructed signal quality drops quickly as CR increases. And like many transform-based methods, fractional arithmetic is involved which causes rounding errors in the final result for coding and thus the compression is only near-lossless.
In this work, we present the design of an ECG compression processor with configurable lossless and lossy compression based on the integer-to-integer lifting wavelet transform (LWT). The wavelets of 9/7-M and 5/3 are applied in different scales of LWT to gain high decomposition performance with less resource overhead than traditional wavelet. Based on the characteristics of the decomposed coefficients, a hybrid encoding scheme is proposed to improve the compression efficiency by applying modified embedded zero-tree wavelet (EZW) to the higher scales of coefficients and Huffman encoding to the lowest scale. Besides, the design is optimized in hardware with a transposable register matrix that stores coefficients orthogonally to achieve low processing frequency and balanced duty work. The low complexity and high compression efficiency make the proposed work a suitable choice for wireless monitoring systems.
Method
ECG compression processor The block diagram of the proposed ECG compression design is illustrated in Fig. 1 . The sampled ECG signal is put into a 5-scale LWT for data decomposition. The highest 4 scales of decomposed coefficients are buffered to perform EZW scanning and adaptive region encoding for compression. The lowest scale coefficients are encoded by Huffman encoding only for lossless compression. The encoded data from the two encoding modules are then packaged into fixed length for transmission.
Data decomposition
In traditional DWT such as the bior4.4 (also known as 9/7) widely used for ECG compression [7, 10] , fractional arithmetic is involved in the coefficients calculation. For multi-scale DWT, either extra buffers are provided to hold the fractional part of the intermediate result, or it is ignored or rounded with information loss in each scale, which hinders its use for lossless compression. To overcome this, LWT is applied in this work for data decomposition. The advantage of LWT is that it is reversible integer to integer transform [11] which causes no resolution loss in scale to scale concatenation.
Considering resource overhead and decomposition performance, 9/7-M and 5/3 lifting wavelets [11] with short step size and light computation load are employed to perform a 5-scale LWT for ECG signal. 9/7-M wavelet is applied to the first and second scale, as signals in these scales are more closely related and 9/7-M wavelet with longer step size can perform better for the fast changing parts in signal such as the QRS complex. But with down-sampling in each scale, correlation decreases and 9/7-M shows no better performance than 5/3 wavelet. So we apply 5/3 wavelet to the decomposition of the higher three scales for its lower complexity. Table I shows the hardware resources required for 9/7-M and 5/3 wavelet and compares to that of the commonly used bior4.4 wavelet and the bior3.1 wavelet adopted in [9] . Large buffer size and extra multiplications are required for bior4.4, whereas the others are of much simpler arithmetic. Although bior3.1 is also simple, it is not integer to integer transform. To evaluate the decomposition performance of the above wavelet bases, a 5-scale wavelet transform is performed and the result is compared in Fig.  2 . It is shown in Fig. 2 (a) that more coefficients fall into small ranges for the combination of 9/7-M and 5/3, which is more efficient for encoding. Also, after applying the typical wavelet-based compression method [10] that flushes fixed percentage of coefficients to zero, the proposed method achieves comparable reconstructed signal quality to bior4.4, and higher than bior3.1, as depicted in Fig. 2(b) . This high decomposition performance with low complexity is extremely desirable for low power wireless systems.
Hybrid encoding strategy
The 5-scale decomposed coefficients are encoded by a hybrid encoding strategy that involves two coding methods applied to different scales.
Modified EZW
EZW [12] is an efficient method to encode the coefficients of wavelet transform by correlating coefficients between scales. We find it very suitable for hardware implementation as the coefficients can be mapped to a binary matrix and the encoding can be performed by bit-to-bit scanning. To further lower hardware overhead and improve compression efficiency, the algorithm is modified in several aspects, as depicted below. Due to the slow change of ECG signal and the high decomposition performance of 9/7-M, the detail coefficients in the first scale (D 1 ) that constitute half of the total coefficients are generally of small amplitudes and contain very few information of signal. Thus we take them out from EZW to cut down the resources for coefficients buffering by half with minor information loss in signal reconstruction.
The higher 4 scales of coefficients (D 2 ∼ D 5 , A 5 ) are buffered in a binary matrix with each coefficient represented by its absolute value and sign bit, as shown in Fig. 3 . By applying binary format, the encoding can be done by scanning the matrix bit by bit from MSB to LSB. In the m th loop, the m th bit in each coefficient is scanned. If the scanned bit belongs to searching list, it will be coded based on the following rules: 1) If the scanned bit is 1, then it is coded as significant (S). 2) If the scanned bit is 0 and all its descents in searching list is 0, it is coded as root (R). Fig. 3 shows the root-descendant relationship of coefficients. 3) Otherwise it is coded as zero (Z). While for the scanned bit belonging to refinement list, its value will be sent as the code. The updating of searching list and refinement list is the same with traditional EZW, but only three types of code (S/R/Z ) are possible, as the sign bit of each coefficient is coded as an extra bit following LSB.
To improve encoding efficiency, we propose an adaptive region encoding method to encode S/R/Z based on the region separation in the binary matrix, as depicted in Fig. 3 . In Region A, as A 5 contains much of the signal energy and usually it is of largest amplitude, S is assigned with the shortest length. In Region B, due to the overall trend of amplitude descending in coefficients from higher scales to lower scales, R is the most common case and coded with highest priority. As to Region C, since D 2 has no descendant, only two cases (S/R) are possible, which requires only 1 bit code. Besides, when scanning to the less significant bits and sign bit, the remaining coefficients in searching list have similar ranges, thus they are treated as individuals without descendant and coded like D 2 . By applying the adaptive region encoding, the compression performance is improved compared to that of the original encoding [12] and the entropy encoding based on the overall probability, as shown in Fig. 4 . It can be observed that the proposed method outperforms the other two methods about 10%∼30% under different coding precisions.
Huffman encoding for D 1
When lossless compression is required, the first scale detail coefficients D 1 discarded in EZW are encoded with Huffman coding. As D 1 is usually of small amplitude that concentrates around zero, entropy encoding method such as Huffman coding is more efficient than EZW with lower complexity as no resource is required for coefficients buffering. The hardware architecture of the proposed ECG compression processor is shown in Fig. 5 . The 5-scale LWT is implemented with multiple FIFOs and serial logic adders. Considering that timing constraint for LWT computation is loose due to the low sampling rate of ECG signals (≤Khz), data bypass is applied between scales to reduce FIFO depth for coefficients computing, which can save about 10% resource for LWT implementation. In EZW encoding, a 16×16-bit transposable register matrix is designed to hold the latest wavelet coefficients with an FSM to manage the writing/reading and several buffers to control the scanning process. Besides, a packaging unit is allocated that manages the encoded data from EZW and Huffman coding modules to be ordered and packaged into fixed length for storage and transmission.
Transposable register matrix
EZW scanning handles the binary matrix with a total of 16×16-bit coefficients each time. With each cycle dealing with 1 bit, the scanning requires 16×16 cycles to complete at most. But considering that EZW scanning can only be activated when the 16 coefficients are ready in the binary matrix and the scanning takes multiple cycles, if new coefficients are ready before the current EZW scanning completes, they need to be buffered elsewhere. A resource saving solution is to complete the scanning before the next valid coefficient is ready, as processing clock A shows in Fig. 6 . The next coefficient (D 2 ) will be ready 4 sample clock cycles later and the scanning must be done before that. 6. For ECG signal with sampling frequency f s , processing frequency will need to be f p = 16 × 16 × f s /4. Yet this high processing frequency is inefficient as it only works for a short time and then put to rest. To balance the duty work in each clock cycle without extra resource for coefficient buffering, we propose a novel transposable register matrix with time-sharing mechanism, as depicted in Fig. 7 . Normally, the 16 coefficients are buffered in the register matrix with a fixed location and the scanning goes from MSB to LSB, as a case shows in Fig. 7(a) . When finishing scanning one bit in all coefficients, the information in that column is useless, and the new coming coefficient can be buffered there, as shown in Fig. 7(b) . In this way, when the old coefficients are being scanned, the new coefficients can be buffered in the transpose of the original register matrix with the same location. Thus the register matrix is written in horizontal and vertical direction alternately to take full use. And the processing clock only needs to ensure that one loop of scanning is done before that group of register is required for the new coefficient. Therefore the processing frequency can be set to f p = 16 × f s /2 to balance duty in each cycle, as processing clock B shows in Fig. 6 . Each loop of scanning is allocated with exactly 16 cycles to process with the 16-bit data.
Code packaging
After encoding, the coded data is packaged into fixed length of 16-bit for temporary storage and transmission. For lossy compression, the encoded data from EZW scanning is packaged in order, as shown in Fig. 8(a) . At the beginning of an EZW loop, a 4-bit max code indicates the bit location of the first significant bit scanned. Then the scanned and encoded data follow in order. For lossless compression, the new detail coefficient D 1 keeps coming and requires encoding during EZW scanning, as shown previously in Fig.6 . To ensure correct decoding, the Huffman code for D 1 is arranged after every loop of EZW scanning to ensure correct decoding, as depicted in Fig. 8(b) . 
Experimental results
The proposed ECG processor is implemented in SMIC 40nm CMOS process with a total area of 12882µm 2 and a gate count of 10.8K, as shown in Fig.  9 . The 5-scale LWT and transposable register matrix take more than 80% of the total resources. Although the variable length of Huffman encoding for D 1 complicates the packaging logic, which accounts for 9% of the area, the size of register matrix for coefficient buffering is cut by half and the total area is reduced. In order to further decrease power consumption, near-threshold voltage supply is applied in the design and the processor is able to run at Fig. 9 . The layout photograph of the processor and its specifications.
23KHz frequency under 0.5V voltage supply. The well known MIT-BIH Arrhythmia database (MITDB) in 11-bit resolution with 360Hz sampling rate is applied for evaluation. The processor works at a low frequency of 2.88KHz for MITDB, with power consumption of 92nW under 0.5V. The compression performance is evaluated by CR and signal distortion. In estimation of signal distortion, percentage root-mean-square difference (PRD) is adopted, as shown in equation (1), where x i is the raw signal and y i is the reconstructed signal. n is the total number of samples.
Table II lists the compression performance of the proposed work and some previous works. The proposed work supports different degrees of lossy compression by configuring the EZW scanning loop. It outperforms the wavelet based method of [9] , which shows inferior PRD as CR increases. And it also shows comparable performance to the more complicated software-based methods such as DCT [5, 13] and Fourier decomposition [6] . For example, when CR is below 10, the proposed work achieves better performance than the others except for [13] , which shows higher CR with similar PRD. And for CR around 20, the PRD of the proposed work is only 0.68%, much lower than the rest.
A comparison with existing hardware implementations is shown in Table  III . [8] uses a simple temporal method for compression and has the smallest gate count. [7] and [9] employ DWT methods and show larger gate count or area. The proposed work applies the integer-to-integer LWT and it takes a gate count smaller than [9] and an area smaller than both [7] and [9] . Since leakage power dominates the total power of the low processing frequency (KHz) ECG monitoring systems, with smaller gate count and area, the proposed work also consumes lower power than [7, 9] . As to compression efficiency, though [8] has the lowest complexity, it also gains the lowest performance for both lossless and lossy compression. [9] achieves the highest lossless CR of 2.89, yet the compression is only near lossless due to the integer encoding for fractional results. With LWT and an optimized hybrid encoding strategy, this work gains a high lossless CR of 2.71 and the highest lossy CR of 14.87 with the lowest PRD of 0.39% among the four works. 
Conclusion
This work presents a low-complexity lossless and lossy ECG compression processor. The combination of 9/7-M and 5/3 LWT applied for ECG signal decomposition and the proposed hybrid encoding strategy contribute to an efficient compression processor with lower complexity than the stat-of-the-art works. Besides, a transposable register matrix helps optimize the processing frequency without burdening resource overhead. Implemented in 40nm C-MOS process, the proposed processor only takes a small gate count of 10.8K with a low power consumption of 92nW. And it achieves a lossless CR of 2.71 and scalable lossy CR of 4.24∼33.34 with low PRD of 0.11%∼1.34% for MITDB. The high power efficiency and compression performance make the proposed processor an attractive choice for wireless ECG monitoring systems.
