Abstract-In this brief, we propose a stand-alone system-on-aprogrammable-chip (SOPC)-based cloud system to accelerate massive electrocardiogram (ECG) data analysis. The proposed system tightly couples network I/O handling hardware to data processing pipelines in a single field-programmable gate array (FPGA), offloading both networking operations and ECG data analysis. In this system, we first propose a massive-sessions optimized TCP/IP hardware stack using a macropipeline architecture to accelerate network packet processing. Second, we propose a streaming architecture to accelerate ECG signal processing, including QRS detection, feature extraction, and classification. We verify our design on XC6VLX550T FPGA using real ECG data. Compared to commercial servers, our system shows up to 38× improvement in performance and 142× improvement in energy efficiency.
I. INTRODUCTION

C
ARDIOVASCULAR disease is still one of the deadliest diseases in the world. Therefore, real-time electrocardiogram (ECG) monitoring and diagnosis are of great significance in early detection, prevention, and treatment of heart disease. There are two research directions in building ECG analysis systems. One is the portable solution based on wearable sensor technology and low-power specialized hardware [1] , [2] . However, these portable systems are primarily aimed for real-time ECG monitoring, which is insufficient to achieve professional diagnosis that requires a sophisticated ECG data processing procedure. The other is the cloud-based ECG telemonitoring and remote diagnosis that collects personal ECG data from wearable ECG devices and makes ECG analysis in the cloudside healthcare platforms. The cloud-based solution has two major benefits. First, moving the ECG processing from the terminal side to the cloud side makes wearable ECG devices simpler, more low-power, and cheaper, which is more cost effective for users. Second, with large amounts of ECG data gathered from different users, it can optimize the ECG analysis models by big data analysis method and eventually provide a more accurate diagnosis to users. A recent study [3] built a remote ECG data analysis service on commercial cloud infrastructure with a large number of connected commodity computers to support the automated analysis of large patient populations. However, commercial servers are not efficient for this application, in terms of both performance and energy efficiency. Since remote ECG data analysis, like many other services on the cloud, is a web-based application that incurs a large number of concurrent requests, each of the requests involves an individual task, which often desires short latencies in response from energy-hungry servers. Thus, for existing commercial servers, the performance is typically limited by overheads of the network packet processing and the connection management in the network interface controller and operating system kernel. Furthermore, the process of ECG analysis is complex and time consuming, which imposes a heavy computation burden for commercial servers. As a result, network I/O handling and the high complexity of ECG data analysis are the two bottlenecks to have an efficient cloud computing platform for ECG analysis.
To address the issues, we present an FPGA-based cloud system for massive ECG data analysis. Although the FPGA is widely used in massive data processing systems [4] , little research has been performed to build a stand-alone FPGAbased system. Differing from previous work on ECG monitoring hardware [2] , [5] , which mainly focused on front-end preprocessing of ECG data such as QRS detection, our system is an all-in-one cloud system that tightly couples network I/O handling and full ECG diagnosis pipeline in a single FPGA. First, we propose a massive-sessions optimized TCP/IP hardware stack to offload networking operations. We adopt a macropipeline architecture with hardware-based connection management, in which all TCP sessions share a coarse-grained pipeline and centralized memory. Compared to the current thirdparty TCP offload engines (TOEs) that are primarily optimized for interserver data transfer among a few TCP connections in data centers, our TCP/IP hardware stack is mainly optimized for high concurrent connections, supporting up to 100 K TCP sessions under 10-Gb Ethernet. Second, we propose a streaming architecture to accelerate ECG signal processing and diagnosis, including not only QRS detection but also feature extraction (FE) and classification. Our streaming architecture is optimized for massive concurrent ECG processing and tightly coupled to 1549-7747 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. our TCP/IP hardware stack, which can achieve much higher throughput. We verified our design on a Xilinx XC6VLX550T FPGA. Compared to commercial servers, our system shows up to 38× improvement in performance and 142× improvement in energy efficiency.
II. SYSTEM ARCHITECTURE FOR MASSIVE ECG DATA ANALYSIS
To support remote ECG data analysis for massive users concurrently, we propose a system-on-a-programmable-chip (SOPC)-based system architecture (see Fig. 1 ) that tightly couples network I/O handling and data processing in a single FPGA. It includes a network I/O handling subsystem and an ECG data processing subsystem, as well as an embedded microprocessor and a memory subsystem. In the network I/O subsystem, a massive-sessions optimized TCP/IP hardware stack is implemented to support fast network package processing and rapid connection management. Moreover, a highly pipelined ECG data processing accelerator is directly connected to the hardware TCP/IP stack, which responds to massive user requests without any software interaction.
A. TCP/IP Hardware Stack
Mainstream TOEs are primarily optimized for massive data volume transmission rather than high concurrent connections to boost their support for interserver data transfer among a few TCP connections in data centers. They typically adopt a one-TCP/IP-session-per-pipeline architecture, in which each TCP session contains a full TCP/IP pipeline, as well as private TX and RX buffer using on-chip SRAMs. However, cloudbased ECG analysis is quite a different application scenario in that the TOE needs to maintain more than 10 000 concurrent connections, where each connection only contributes a little portion of the total throughput. Thus, these third-party TOEs no longer satisfy this requirement due to the limited on-chip resources to support more TCP sessions. To address the problem, we propose a massive-sessions optimized TOE. We use a macropipeline architecture in which all TCP sessions share a coarse-grained pipeline with carefully organized TX and RX buffer. In addition, we develop a hardware-based connection management to support the rapid scheduling of up to 100 000 TCP sessions.
1) Macropipeline Architecture: In order to achieve highsystem throughput, all processing modules in the macropipeline are operated in an asynchronous way, each of which handles a portion of a processing stage or directly responds to a specific type of TCP packets (see Fig. 2 ). These processing modules are event triggered, and they continuously read the events or messages from the input first-in first-out (FIFO), process the events, and send messages to the next stage without any synchronization. Specifically, for incoming data processing, data from network are first offloaded from the 10-Gb Ethernet and unpacked by the IP datagram receive engine to obtain TCP packets. Then, these TCP packets are divided into different categories, based on the control bit in their TCP headers, and delivered to different processing engines, respectively. Each processing engine deals with a particular type of TCP packet, such as SYN, FIN, or ACK packet, and directly generates feedback packets to response. The connection management engine deals with the connection and disconnection of TCP links and the schedules of the data transmission order among all established links. For outgoing data processing, the data fragment engine segments data that are to be transmitted into small data packages. Then, these data packages are encapsulated with the TCP header and IP header successively by TCP packaging engine and IP encapsulation engine. When a timeout occurs, the retransmission engine retransmits the lost TCP packets based on the information from the most recent data ACK number.
2) Connection Management: Due to the limited on-chip memory resources, connection information can only be stored in an external memory, while the shared memory may become the bottleneck of the system, when it incurs random accesses from different processing modules concurrently. We mitigate the impact of the problem in two ways: by reducing access latency and reducing the number of memory accesses. We first choose SRAM, instead of DRAM, to store TCP sessions because of its much lower access latency. Then, as described in Fig. 3 , the memory space of the external SRAM is segmented into a number of consecutive TCP connection slots, in which we use a hash mechanism with source IP address and port number as the key to fast lookup of a dedicated TCP connection. We avoid using a round-robin scheduling scheme, which is widely used in TOEs that support a small number of TCP sessions, to schedule the execution of data transmission among established TCP connections because, in a web-based cloud computing scenario with high concurrent connections, it is time consuming to check every connection slot to find an active connection. Instead, we maintain two linked lists to quickly locate active and timeout connections. The active linked list is used to store TCP connections that have an event to be processed. The connection management engine continuously obtains the connection in the head, executes the event, and then puts the connection to the tail of the timeout linked list. When an event (new ACK received, new data received, or new data to send) for a certain connection occurs, the connection is inserted into the tail of the active linked list. On the other hand, the timeout linked list is used for monitoring the transmission timeout of idle connections. Idle connections are inserted into the tail of the timeout linked list with a timeout point that adds a fixed timeout value to the current time. Since the timeout linked list is naturally sorted in timeout points, the retransmission engine only needs to check through the head of the list to obtain the most timeout connections, avoiding recurrently traversing the whole established TCP connection slots that may impose access burdens to the shared memory.
B. ECG Data Processing Module
The ECG data processing module in our work differs from the previous portable ECG monitoring hardware designs [2] , [5] in that: 1) we achieve ECG diagnosis more than ECG monitoring; thus, our design includes full ECG diagnosis procedures, including not merely QRS complex detection, but also FE and classification and 2) we consider ECG data analysis for massive concurrent users rather than only one user; thus, our design objective is to maximize system throughput within limited onchip resources.
The architecture of our ECG data processing module is composed of three processing stages: task assignment, FE, and classification (see Fig. 4 ). After the TCP/IP stack offloads users' ECG data to the external DDR3 memory, the task assignment module dispatches these ECG data to different FE pipelines. In the FE pipeline, we first detect the QRS complex of the ECG data. Then, we extract several features from the QRS complex as the inputs of the classification stage. We further implement an artificial neural network (ANN) to classify the extracted features. Finally, the diagnostic results are sent back to the users by the TCP/IP engine.
1) FE Pipelines:
In our work, we first use a wavelet transform (WT) method [6] to detect the QRS complex, which is the most important and most informative segment in the ECG to help determine heart conditions. The WT decomposes a target signal into an approximate signal and a set of high-frequency wavelets. We implement a two-stage WT pipeline, each of which consists of a high-pass filter G[·], a low-pass filter H[·], and two down-samplers by 2 (↓ 2) (see Fig. 5 ). The downsampled outputs of the first high-and low-pass filters provide the detail D1 and the approximation A1, respectively. The approximation A1 is further decomposed in the second level to produce the second-level approximation A2 and detail D2. As shown in Fig. 6 , the detection of QRS complex is focused on the detection of R peaks in the raw ECG data. In D2, every special wave, in which a negative minimum point is followed by a positive maximum point, corresponds to each R peak. Thus, we capture the position of R peaks by identifying the zero crossing of extremum pairs in D2. Then, the QRS complex is extracted from the raw ECG data by segmenting a fixed number of samples centered at R peaks.
To support massive ECG data analysis within limited onchip resources, we do not use all sample points of the QRS complex, but rather extract features from the QRS complex as the inputs of the classification, which reduces the size of classification hardware. We use 11 statistics over the set of the wavelet coefficients to represent the time-frequency distribution of the ECG signals, including variance of the original QRS complex, variance of the wavelet coefficients in each subband (D1, D2, A2), variance of autocorrelation functions of the wavelet coefficients in each subband (D1, D2, A2), ratio of the minimum to the maximum of the wavelet coefficients in each subband (D1, D2, A2), and instantaneous RR intervals. For more details, please refer to [7] .
Since our design objective is to maximize the system throughput rather than to reduce the processing latency, we adopt a more resource-saving FE pipeline architecture (see Fig. 5 ), and we achieve higher throughput by instantiating multiple pipelines (ten pipelines to match the 10-Gb/s processing rate). First, we tightly combine the QRS detection and the FE procedures in our PE pipeline. We share one WT module between these two procedures to save resources. When ECG data are ready in the local block RAM, the WT module executes a pipelined WT. The D2 output is sent to the R-peak detection module, in which all R-peak positions, as well as the 64-point QRS segments centered at R peaks, are extracted from the record. After the QRS detection, a second WT to the QRS segments is executed using the same WT hardware, which generates subband wavelet coefficient inputs to the FE procedure. Second, we share the execution engines among FE computations of all subband data. We use a controller logic to manage the execution orders of Min/Max, autocorrelation, and variance computations. Finally, each FE pipeline outputs 11 statistics features and user's connection information message to one feature FIFO for further classification.
2) ANN-Based Classification:
The classification module contains a pretrained ANN with 11 input neurons, 20 hidden neurons, and 6 output neurons. Since we reduce the size of input data to the ANN via the FE stage, the computation complexity of ANN is greatly reduced. Thus, a more lightweight neuron design is adopted to save the on-chip resources. In each neuron, we put only one multiply-accumulate unit and a finite-state machine to control the access of data from the output of the former neurons. Thus, the neuron executes one multiply-accumulate operation to one of its input data with its pretrained weight in each cycle. In addition, we use a shared lookup table (64 K depth and 16-bit width) stored in the on-chip block RAM to implement the sigmoid function, which improves its processing speed.
III. EXPERIMENTAL RESULTS AND DISCUSSION
A. Experimental Setup
We evaluate our design on one computing node of mimicry computer [8] based on a Xilinx XC6VLX550T FPGA, which contains two 8-GB DDR3 SDRAMs, one 72-Mb SRAM, and 10-Gb Ethernet. The embedded microprocessor is implemented by MicroBlaze, a lightweight soft core provided by Xilinx. The system clock frequency is 156.25 MHz, which is equal to the frequency of 10-Gb/s physical layer (PHY) interface.
In our study of cloud-based ECG classification, 23 ECG records were selected from the widely used MTI-BIH arrhythmia database [9] . All records were digitized at 360 samples per second with 11-bit resolution for slightly over 30 min. These records include six ECG beat types. For each beat type, we choose 50% beats for training and another 50% beats for testing. The parameters related to the neuron network were trained TABLE I  DEVICE UTILIZATION SUMMARY   TABLE II  LATENCY OF THREE ECG DATA PROCESSING STEPS offline using software and then fixed to the implementation in the FPGA. Ten FE pipelines were integrated in the FE module to maximize the throughput and make full use of reconfigurable resources. Since the ECG input data are 11-bit resolution and there is a square term in variance computation, we choose a unified 32-bit fixed point for all arithmetic logical units. It is worth noting that we implement the sigmoid function in a 16-bit 64 K-depth lookup table. However, according to [10] , 8-bit fixed-point precision in all ANN forward retrieving computation causes less than 10 −4 error of the floating-point version. Thus, our precision configuration can reduce the accuracy error greatly. Our experiment shows that the accuracy of diagnosis in our implementation is the same as that of the software version. Table I shows the device utilization summary of our design generated by Xilinx ISE 14.1. It uses only 19.0% of the total slice registers, 34.7% of the total slice LUTs, 31.1% of the total BRAMs, and 53.1% of the total DSP48E1s. Table II illustrates the latency of three ECG data processing steps in our design. The latency of the FE pipeline to process 4 K points ECG data containing 12 heartbeats is 9782 cycles. Although the latency of reading data from the DDR3 memory to the block is 4096 cycles, such data acquisition latency was hidden in the execution pipeline using double buffering. Since the system clock is 156.25 MHz, the throughput of single FE pipeline is 8 kB÷(9728 cycles×1/156.25 MHz) = 128.5 MB/s. Considering the max theoretical upstream throughput of 10-Gb Ethernet, ten FE pipelines are sufficient to meet the requirement of the system-level throughput: [10 Gb/s ÷ (128.5 × 8) Mb/s = 9.72 < 10]. On the other hand, the latency of the neural network module to classify one heartbeat is 61 cycles. As a result, its throughput can be calculated as 8 kB ÷ (61 cycles × 1/156.25 MHz × 12 beats) = 1.71 GB/s > 10 Gb/s, which also meets the requirement of the system-level throughput.
B. Performance Evaluation
To evaluate the performance, we simulate the user requests by client PCs, which generate multiple connections to our prototype board and upload ECG data from the testing set. Each connection uploaded 4K sequential sample points to the server. We implemented a software version of the same ECG analysis application on an IBM x3635 M3 server with Intel x5650 CPU (2.66 GHz), 8-GB DDR3 memory, and Intel x520-SR2 10-Gb Ethernet network adapter. The software version was programmed in standard C using Linux socket for network communication.
1) Maximum Transactions per Second:
We used maximum transactions per second (Tps) as the performance metric to evaluate the system-level throughput. A completed transaction included the uploading of the 4 K points ECG data, the ECG data classification, and the sending back of the result. Our FPGA-based solution was able to accomplish 41 830 transactions per second, which is 38 times that of the software version [see Fig. 7(a) ]. Compared to the software version, in which both I/O-intensive workload (network handling) and computationintensive workload (ECG data processing) contend for the same CPU time, our stand-alone SOPC-based architecture deals with two parts of workload by separated hardware modules in a pipeline form and thus achieves a better performance.
2) Energy Efficiency:
We use the term transactions per joule (T rans/J) to define the energy efficiency of two systems. We measured the total energy consumption P (J) and the total completed transactions T during the 1-h test, and then we calculated the energy efficiency as T /P (Trans/J). As shown in Fig. 7 (b), our energy efficiency was 909 Trans/J, which is 142 times over the software version.
3) Benchmarking With the State-of-the-Art Designs: A comparison with prior works in processing throughput [ECG samples processed per second (Sams/s)] is given in Table III . Since most of the ECG processing systems [2] , [5] , [11] are designed for a portable and single-user scenario, they typically employ embedded processors or lightweight hardware to save power and chip size, so that these systems have a moderate ECG processing capacity (e.g., Hashim et al. [11] achieves only 29.06×10 3 Sams/s). Ieong et al. [12] otherwise designed a throughput-optimized QRS detection hardware in FPGA, which achieves 57.53 × 10 6 Sams/s. However, it only accelerated QRS detection, whereas our work presents an entire cloud-based ECG analysis system, including network handling, connection management, QRS detection, FE, and classification. Zhou et al. [13] proposed an FPGA-assisted cloud framework for massive ECG processing. It uses a commercial server to handle network connections and attaches an FPGA via PCI Express (PCIE) to accelerate ECG processing. However, the data transmission latency between the PC host and the FPGA degrades the whole performance (17.75 × 10 6 Sams/s). In addition, it does not contain FE and classification logic. Compared with that framework, our system is a stand-alone SOPC system. By offloading both networking operations and ECG data analysis in a close-coupled architecture, our system can achieve 63.89 × 10 6 Sams/s.
IV. CONCLUSION
In this brief, we have presented the design and implementation of a stand-alone SOPC-based system that tightly couples the network I/O handling and data processing in a single FPGA to accelerate massive ECG data analysis. We have designed and implemented a novel TCP/IP hardware stack that supports 10-Gb Ethernet and 100 K concurrent TCP connections to meet the requirement of high concurrent requests. Due to the hardware pipeline implementation of network protocols and full-featured ECG signal processing, our implementation can achieve higher performance and lower energy consumption with stand-alone cloud service functionalities of ECG data analysis. We evaluated our design using a prototype FPGA implementation, showing 38× speedup in performance and 142× improvement in energy efficiency over existing servers.
