I. INTRODUCTION
One of the great challenges of the next decade has been on the integration of information technologies and health care. In this case the quality and cost of services for the patients and health care providers must benefit from reduced misdiagnoses and by providing greater access to advanced modalities for more patients [1], [2] , [3] , [4] , [5] . In addition, wearable biomedical devices are used in inpatient, outpatient, and at home e-Patient care [6] that must constantly monitor the patient's biomedical and physiological signals 24/7.
Several biomedical applications require execution of digital signal, image, and video processing algorithms. For instance, ultrasound and seizure detection both contain different filter ing, FFT blocks, up/down sampling and windowing techniques that can be parallelized on DSP processors. On the other hand, such portable devices have extremely small budgets for size and power, which currently use application specific integrated circuits (ASIC) or highly custom SoCs [7] , [8] .
In this paper, we present a programmable low power many core processing platform which implements the workloads of biomedical signal processing efficiently for computer aided di agnosis. The paper is organized as folIows, Section 11 discusses background on many core platforms, seizure detection and ultrasound. Then the many-core architecture and its enhanced features are discussed. Finally, the CMOS implementation of 978-1-4673-4953-6/13/$3l.00 ©2013 IEEE 368 the many-core along with the application mapping results are presented.
BACKGROUND A. Many-core Trend
Many-core architectures have been weil studied as a po tential riyal to reconfigurable fabric based platforms [9] . One common many-core architecture is the multiple instruction and multiple data (MIMD) architecture. In such a design, the processors are independent cores capable of executing their own instruction streams and process data through their input output (1/0) ports and/or local data memories. A low power seizure detection and analysis architecture was initially proposed by the authors [26] . 14th Int'l Symposium on Quality Electronic Design 
III. PROPOSED ARCHITECTURE
The proposed many-core architecture consists of in-order processors with a RISC-like DSP instruction set, 16-bit integer datapath, and minimal instruction and data memory suitable for task level parallelism [27] . The network is designed as a 4-ary tree architecture, where each single processor in a cluster of four can communicate to any other core within its own cluster through direct connections. The network is designed to support processors operating on asynchronous clock signals in Globally Asynchronous Locally Synchronous (GALS) fashion [19] , [28] . Therefore processors can halt or change their clock based on the workload requirement to achieve the maximum energy efficiency.
Each core processor is based on a RISC architecture with a 6
stage pipeline. Figure 3 
A. DSP Enhanced Architecture
The primary purpose will be to map and run biomedical ap plications which are heavily focused on digital signal process ing. DSP applications have a predictable runtime compared to an application on a general purpose processor (GPP), so this can be leveraged to decrease the runtime. As a result, the architecture has been designed to improve upon existing GPP's. The goal becomes decreasing processing overhead, so arithmetic operations become the primary runtime component. computations to addition and multiplication. However, the data in memory must be carefully managed in order to correctly cal culate the transform. For each iteration, two input data points and the corresponding twiddle factor must be addressed and read in from memory before the data can be manipulated. In other processors, the addresses for the data and twiddle factor must be calculated in sequence before the FFT calculations.
The additional calculations to generate the data and twiddle factor addresses can be a costly time penalty for the FFT.
Instead, our proposed core includes an FFT block to perform these calculations in parallel to the FFT calculations. The addresses are generated using simple data reordering and bit reversal and then passed to the decode stage in the pipelined architecture.
To take advantage of both registers and data SRAM, pointer support has been added to reference a memory address using the value of a register. Pointers are another component built into the hardware to perform the calculation with no additional cycles to execute. This is possible since register values are read one cycle before memory reads. The value of the register can then be set as the address of the memory read. Pointers are useful for applications with simple data structures such as lists.
The register value can then be incremented or decremented to reference a different address in memory. instruction and data memory, and input/output FIFO's.
Instruction

Data SRAM SRAM
IV. CMOS IMPLEMENTATION AND ApPLlCATION MAPPING RESULTS
Each processor was synthesized and placed and routed in the 65 nm TSMC CMOS process which occupies 0.070 rmn 2 and runs at l.18 GHz at 1 V . We used a standard-cell RTL to GDSII ftow using synthesis and automatic place and route.
The hardware was developed using Verilog to describe the architecture, synthesized with Cadence RTL Compiler, and placed and routed using Cadence SOC Encounter. Fig. 5 shows the layout of a single core. 
