Abstract-This paper presents an energy-efficient processing platform for wearable sensor nodes, designed to support diverse biological signals and algorithms. The platform features a 0.5 V-1.0 V 16-bit microcontroller, SRAM, and accelerators for biomedical signal processing. Voltage scaling and block-level power gating allow optimizing energy efficiency under applications of varying complexity. Programmable accelerators support numerous usage scenarios and perform signal processing tasks at 133 to 215 lower energy than the general-purpose CPU. When running complete EEG and EKG applications using both CPU and accelerators, the platform achieves 10.2 and 11.5 energy reduction respectively compared to CPU-only implementations.
I. INTRODUCTION
A DVANCES in sensor and circuit technologies are fueling new possibilities in ambulatory medical monitoring, where wearable devices can monitor a subject's vital signs while the subject is moving about freely. A local processor on such devices can extract key information from raw physiological signals, thus greatly reducing the amount of data to be transmitted or stored. In this context, a processor offers some advantages over custom ASICs: it can be used across many applications and allows ongoing algorithm improvement.
A processor with a general-purpose instruction set offers great flexibility but is less energy-efficient than an ASIC designed for a particular algorithm. However, since many algorithms share common signal processing tasks, we integrate custom accelerators with a CPU to strike a balance between flexibility and energy-efficiency. Accelerators are widely employed in many systems to speed up specific tasks, and more recently, their use to reduce energy in low-power processors was explored in [1] . Alternatively, the voltage-scalable SIMD processor in [2] contains many arithmetic processing elements suitable for filtering. It is targeted for image processing but similar architectures exist in other domains. However, accelerators in the context of ambulatory medical monitoring have been relatively unexplored, as previous digital signal processors (DSPs) have focused on full-custom solutions [3] or general-purpose CPUs [4] . The processor in [5] used dedicated hardware for tasks such as data compression along with a RISC CPU, and the system is intended for EKG processing. The system of [6] is designed more broadly for low-power biomedical applications and includes a 32-bit CPU plus an FFT accelerator.
In contrast, this work focuses on a processing platform that accommodates a range of applications and domains through the use of several accelerators to perform key computations. The accelerators have programmable features to handle different use cases. The overall platform employs voltage scaling and power gating to adapt to algorithms of varying complexity in an energy-efficient manner. We also propose several architectural optimizations in the accelerator designs to reduce power and improve speed.
The platform pictured in Fig. 1 [7] leverages a 16-bit CPU based on [8] , but here the CPU has been extended with logic to support software debugging as well as a direct memory access (DMA) block. Besides the accelerators, the platform also includes a custom SRAM, timers, serial ports, and a hardware multiplier. The CPU initializes these modules by writing to their register interfaces, while the modules drive an interrupt or DMA trigger at task completion. To accommodate different performance demands, the system is designed to function at variable supply voltages from 1 V to 0.5 V. In the prototype system, the power supply is provided off-chip and adjusted to meet frequency constraints of the given application. The user can clockand power-gate unused modules by writing to control registers in software. Power gating of 15 domains is achieved with on-chip, high-switches controlled by the power management unit.
To illustrate the potential impact of signal processing accelerators, cycle count profiling results are presented in Section II on several biomedical applications. Section III to Section V describe the implemented accelerators and several architectural optimizations to reduce power, decrease cycle count, and improve usability. A glitch reduction scheme in the system SRAM is described in Section VI. Finally, test-chip measurement results in Section VII quantify the energy savings afforded by the accelerators in two applications.
II. DIGITAL SIGNAL PROCESSING IN BIOMEDICAL APPLICATIONS
A survey of biomedical signal processing algorithms in literature reveals several common operations that can benefit from hardware acceleration. Filtering is a prevalent task since noise in 0018-9200/$26.00 © 2011 IEEE the acquired signals must be removed. The Fast Fourier Transform (FFT) is employed to analyze the frequency content of various physiological signals, for example in [9] , [10] . In addition, many applications make use of adaptive thresholds [11] , which often require the largest or smallest sample in a window of data. Similarly, median filtering, which is helpful for removing noise spikes without degrading signal edges (unlike a low pass filter), involves finding the median of a window of data. Many algorithms utilize mathematical functions such as [12] and [13] that typically involve expensive software emulation in fixed point CPUs.
To understand what proportion of clock cycles are spent on signal processing on a general-purpose processor, we profiled several open-source applications on the 16-bit CPU core in the test-chip of Fig. 1 , which is compatible with a commercial microcontroller instruction set [8] . The code was compiled and run on the CPU (supported by a multiplier but no accelerators) in cycle-accurate simulations. We then recorded the number of cycles spent on executing different portions of the algorithms.
The cycle breakdown of algorithms for EKG arrhythmia classification [14] , pulse-oximetry [15] , and heart sound processing [10] , [16] are shown in Fig. 2 . It should be noted that the code was not hand-optimized for the execution platform (for example, the heart sound application was in floating point), but default compiler optimizations were enabled. The arrhythmia classification algorithm involves many control tasks that are suited to a general-purpose CPU (e.g. loops, comparisons, searches). However, many clock cycles are spent on matching an incoming heart beat against templates. Part of this pattern matching involves computing the sum of absolute differences (SAD). It is possible to replace SAD with matched filters, which can then be mapped to a hardware FIR module. FIR filtering contributes the majority of clock cycles in the pulse-oximetry application as well. In the third algorithm, heart sounds are transformed into the frequency domain, processed, then transformed back into the time domain. Consequently, the FFT/IFFT is a key component consuming roughly half of the total cycle count.
The profiling results illustrate that on a low-power processor, operations such as filtering and FFT can dominate the cycle count in an application. Accordingly, we extend a microcontroller core with accelerators targeted for biomedical signal processing. Based on analysis and profiling of the algorithms of interest, we include four accelerators in our processor: an FFT module, a Coordinate Rotation Digital Computer (CORDIC) engine, an FIR filter, and a median filter.
III. FFT ACCELERATOR
The FFT accelerator pictured in Fig. 3 chosen for area reasons and our implementation is based on [1] . Our contribution is a different control scheme to reduce switching activity and thus the FFT power.
A. Power Optimization: Switching Activity Reduction
In this serial architecture, the datapath computes one butterfly per clock cycle, and hence the control logic must manage a sequence of butterflies to complete an FFT. Various control schemes have been proposed (e.g. [17] , [18] ), but here we focus on reducing the datapath switching activity and power. Specifically, we present a scheme that orders the butterflies according to two constraints:
• In iteration of an FFT, , butterflies with the same twiddle factor are performed consecutively.
• On every clock cycle, two inputs are read from memory and two outputs are written back. Reads and writes should occur on different memory banks. This allows the use of single-port SRAMs that are smaller and simpler than multiport memories. The memory banking structure will be detailed in the following. The proposed ordering is shown conceptually in Fig. 3(b) for the second iteration of an 8-point FFT. A straightforward ordering would process the butterflies in sequence (from top to bottom), causing the twiddle factor to toggle between and . To reduce the switching activity, we instead process the butterflies associated with , then the ones with . We generalize this sequencing for all iterations of the FFT and different FFT sizes.
To realize the sequencing in hardware, we construct the control logic to generate the memory addresses of the two inputs to the butterfly, as well as the address of the twiddle factor. Let be the two addresses being read from memory, the twiddle factor address, and a butterfly counter going from . In the first iterations of an N-point FFT , they can be generated as follows:
where indicates bit 0 of , and denotes concatenated with .
involves rotating an bit number, , by bits to the left.
In the last iteration , is generated with an -bit gray code counter, while but with the MSB set to 1. The above scheme lends itself to a compact circuit realization as shown in Fig. 4 . is generated by the butterfly counter while the operations are implemented by two bit-rotators that support variable widths in order to handle different FFT sizes. A gray code counter is used during the post-processing step in a real-valued FFT [19] .
The above scheme also helps prevent memory hazards when the local FFT memory is divided into four banks according to the parity and MSB of the addresses [19] . As seen from (2) and (3), the two inputs to a butterfly have different parity and hence will be stored in different SRAM banks. Moreover, on two consecutive clock cycles, the MSBs of and will be different (because toggles), so reading the current inputs and writing the previous outputs will also occur on different banks.
B. Power Comparison
Fig. 5(a) shows the simulated waveform of the proposed FFT against a reference design. The reference design has the same datapath and memory structure as the proposed FFT, but uses a straightforward butterfly ordering without switching activity reduction, thus allowing us to quantify the impact of the control logic. The proposed control scheme greatly reduces the switching activity of the twiddle factor ( and ), especially in the early iterations when only several distinct twiddle factors are in use. This is reflected in Fig. 5(b) , which compares the simulated power of the reference design with the proposed accelerator, including wiring parasitics extracted from layout. The proposed control scheme reduces the datapath power by 50%, leading to an overall power reduction of 29% in a 128-point complex-valued FFT.
IV. CORDIC ENGINE
Biomedical applications employ mathematical functions that must be emulated in software on low-power processors, often requiring several thousand clock cycles to complete. Fortunately, the CORDIC algorithm [20] , [21] computes these functions in a few clock cycles at relatively low hardware cost. However, the classic algorithm has several shortcomings which restrict its usefulness in some applications of interest. In this section we describe these shortcomings and address them with a modified CORDIC architecture.
A. Overview of CORDIC
To aid understanding of the hardware architecture and its limitations, we first give an overview of the classic CORDIC algorithm [20] , [21] . For brevity we omit some details, but more thorough descriptions can be found in the literature.
The left portion of Fig. 6 illustrates how and can be computed with CORDIC. We begin with a unit vector and rotate it successively by until it makes the desired angle with the x-axis. By restricting such that equals , the rotation can be realized by shifting and addition. To determine the direction of rotation, an angle accumulator is first initialized with , then each rotation is taken in the direction that decreases the magnitude of the angle accumulator. Mapping this algorithm onto hardware leads to the classic architecture consisting of the solid shaded blocks in Fig. 6 , where the and registers store the x-and y-components of the current vector, and is the angle accumulator.
The above summarizes the rotation mode of operation of CORDIC in the circular coordinate system. Reversing this process gives rise to the vectoring mode of operation. Similar operations can be performed in three different coordinate systems (circular, linear, hyperbolic) [21] , where the angle accumulator is adjusted by values stored in three different lookup tables (Fig. 6 ). Many CORDIC designs operate in the circular coordinate system to compute sin, cos, etc. However, the accelerator in this work operates in all three coordinate systems in order to support a variety of functions, namely , , ,
B. Extending Input Range
In conventional CORDIC, the range of inputs for which CORDIC gives valid results is extremely limited for certain functions, as reported in the second column of Table I . Since many previous implementations of CORDIC focus on trigonometric functions, this issue has not received much attention. In this work we propose architectural improvements to implement an extended algorithm described in [22] .
Intuitively, the valid input range is limited because the angle accumulator ( in Fig. 6 ) can only be increased/decreased by values in a finite-size lookup table. The algorithm in [22] solves this by adding several large entries to the lookup table which satisfy specific properties. If the input argument is large, these large entries are subtracted from in several extra iterations, in order to quickly reduce the input down to the range supported by conventional CORDIC.
Since we are implementing a fixed point hardware engine, we must address several numerical precision issues not discussed in the theoretical work. In particular, we only perform extra iterations as needed based on the magnitude of the input. This is because unnecessary iterations will degrade numerical precision. Therefore, before starting the conventional CORDIC iterations we compare the input against fixed ranges to determine how many extra steps are needed. The lookup table labeled "8 constants" and the dotted multiplexers in Fig. 6 are added for this purpose. If needed, the extra steps (as per [22] ) involve adding/subtracting and with their copies shifted by bits to the right and subtracting lookup table values from . Since these steps occur before the conventional CORDIC iterations, existing arithmetic blocks can be reused.
C. Reducing Quantization Error
Although the CORDIC algorithm can compute functions far more quickly than floating point emulation in software, it is affected by two inherent sources of quantization errors [23] . Further, our CORDIC engine supports operations in the hyperbolic coordinate system (e.g.
), in which the intermediate datapath values have a large dynamic range, exacerbating quantization issues.
In this work, we improve the accuracy by: 1) widening the datapath and increasing the number of iterations 2) dynamically adjusting the binary point, shifting right for overflow and shifting left to increase precision 3) computing with alternate method We performed extensive simulations to quantify the impact of these techniques and to characterize the quantization error in the CORDIC accelerator. First, Fig. 7(a) plots the RMS error introduced by the CORDIC algorithm, where error is the difference between the CORDIC output and the floating point result. Widening the fixed point datapath reduces the RMS error exponentially at the cost of a linear increase in energy/cycle. In this design we employ a 24-bit datapath to achieve good accuracy in the hyperbolic coordinate system. Second, the previous section described a technique to extend the input range, which unfortunately increases the dynamic range of the intermediate datapath values and thus the quantization error. Binary point adjustment is particularly effective in these cases. To illustrate, Fig. 8(a) and Fig. 8(b) plot the relative error versus when is computed without and with binary point adjustment respectively. This can be implemented with very small hardware cost, as indicated by the components in Fig. 6 with a zigzag pattern. Table II summarizes the RMS error improvements in different modes of operation.
Lastly, CORDIC typically provides by computing . However, when , this amounts to adding two large numbers of opposite signs to obtain a small number, which results in large inaccuracies in a fixed point system. We solve this by performing then inverting the result, also with CORDIC. The associated input range Fig. 7(b) , and are important in biomedical applications such as [12] that require the use of .
D. Cycle Count Improvement
After addressing two main limitations of conventional CORDIC, it becomes practical to use the CORDIC engine for computing a variety of functions, leading to cycle count savings. Table III compares the cycle count of several functions based on software emulation in 32-bit floating point versus using the accelerator. Note that the accelerator computes in fixed point while software emulation is in floating point. However, in ambulatory monitoring applications where the acquired data has limited resolution, floating point precision may not be necessary. As an example, we will show an EKG algorithm in Section VII-C in which fixed point CORDIC computations do not affect the end result. 
V. ACCELERATORS FOR FILTERING

A. FIR Filter
Since filtering is critical to many applications, this platform includes FIR and median filter accelerators. The FIR filter utilizes a serial architecture with one multiply-accumulate (MAC) unit. One pair of data and coefficient is provided to the multiply-accumulate unit per clock cycle. Because the sequence of multiplications is managed by the control logic rather than hard-wired as in a parallel filter, this architecture offers the flexibility to optimize cycle count in special cases. In addition, it can support high-order filters with much smaller area than a parallel implementation.
1) Local Memory:
To reduce expensive accesses to the system SRAM, the FIR filter contains two 32-word 16-bit local memory to store coefficients and data. Due to its relatively small size, the local memory is implemented as a synthesized register file. However, instead of edge-triggered flip-flops, we employ level-sensitive latches as storage elements with approximately half the size of flip-flops and half the clock loading. Although the use of latches presents a potential race condition which must be carefully avoided and verified, latches help reduce the overall FIR power by 31% compared to a flip-flop-based memory, as seen from simulation with extracted parasitics. A power breakdown in Fig. 9 contrasts the two designs. 2) Cycle Count Improvement: Since FIR filtering is widely used, we include several features to reduce the cycle count of some special cases. Specifically, the FIR accelerator supports:
• symmetric/anti-symmetric filters [24] • high-order filters (up to 128 taps) with more coefficients than the size of the local memory. 32 data samples and coefficients are stored in local memory and the remainder are fetched from the system SRAM. This leads to cycle count savings compared to a software implementation since the accesses to local memory take only one clock cycle, while software accessing system SRAM takes at least two cycles.
• storing two separate filters in local memory and context switching between them • multiplication of an input sequence by a window function • efficient decimation-by-2 filtering as described below Polyphase decomposition is a well-known technique to efficiently implement decimation filters, reducing the number of multiplications by the decimation factor as compared to a straightforward implementation [24] . In the serial FIR architecture that computes 1 MAC per clock cycle, the cycle count can be reduced by the same factor. In Fig. 10 we demonstrate that this can be mapped to a serial FIR architecture by storing the even and odd coefficients as two different filters in local memory. The control logic is then designed to alternate between the two filters at every input sample, and add the outputs of the odd and even filters to get one output sample. The FIR accelerator implements this for decimation-by-2 but the same concept can be extended to decimation by larger factors.
B. Median Filter
The median filter provides the median, minimum and maximum of a sliding window of data. In the context of biomedical applications, this is useful in setting adaptive thresholds or removing noise spikes without degrading the edges of a signal. The median accelerator is based on [25] and contains a bank of registers to keep a window of data in sorted order. When a new sample arrives, the oldest sample is removed from the window, then the new sample is compared with all entries in the window (in parallel) to find the correct position for insertion. Therefore, the median is simply the sample located at the middle of the window. The minimum and maximum are also readily found as the samples located at the first and last positions. The accelerator accommodates variable window sizes from 3 to 65. While the design in [25] employs two comparators per sample in the window, the median filter in this work shares one comparator between four samples to reduce area.
VI. SRAM WITH GLITCH REDUCTION
This processor employs a 0.5 V-1.0 V SRAM which operates across the same voltage range as the logic, to avoid the delay and power overhead of level conversion between logic and memory. At 0.5 V, the conventional 6-T bit-cell is severely affected by process variation, making reliable reading and writing difficult [26] , [27] . The SRAM in this processor is based on an 8-transistor design in [27] . We add a self-timed mechanism to reduce glitches on the data bus connecting 16 SRAM blocks. Self-timing for this purpose has been shown before (e.g. [28] ) but here we consider leakage overhead and add minimal logic to prevent most glitches in the average case, rather than attempt to remove all glitches at high leakage cost.
The top right corner of Fig. 11 shows the top-level organization of a 64 kb SRAM macro. 16 bit-cell sub-blocks of 64 rows by 64 columns are connected via a tri-state data bus used for both reading and writing. During a read operation, the accessed sub-block drives the value being read via the bus to the top-level output buffers. In low-voltage SRAMs, the energy to drive the bus can be particularly large for two reasons. First, full-swing signaling is employed for increased robustness instead of the low-swing schemes found in above-threshold SRAMs. Second, the area density of low-voltage SRAMs tends to be lower, because having fewer bit-cells in a column helps improve read speed and relieve bitline leakage. However, must now span a larger physical area for the same memory capacity.
The SRAM column read circuit is shown in the top left of Fig. 11 . During a read, the sense-amplifier (SA) in the column periphery compares the read bitline against a reference . The result is then stored in an SR latch and driven onto the data bus by tri-state drivers. Because of the SR latch, does not change until the next read cycle provided that the next read occurs in the same sub-block. On the other hand, when the next read occurs in another sub-block, then this new sub-block would drive data previously held in its SR latches before its SAs resolve, thus causing glitches on . This is illustrated in Fig. 12 . From the software point of view, consecutive memory reads to different sub-blocks can occur often, for example when the program instructions and data are stored in two different blocks of memory, or when the program is accessing the software stack.
To prevent glitches, we can wait until each differential SA has resolved before enabling the tri-state drivers. Due to variation, each SA takes a different time to resolve, which can be detected by taking the XOR of the differential SA outputs. However, adding logic to every column (the XOR plus an AND to enable the tri-state driver) imposes significant leakage overhead. When the extra leakage is integrated over several clock cycles at 0.5 V, it will overcome any glitch energy savings.
Since glitches do not affect SRAM functionality, we can instead consider the average case. We add logic to two SAs and Fig. 11 . Low-leakage self-timing scheme to reduce glitching on data bus, with simulated glitch reduction and leakage overhead for a 128 kb SRAM. wait until both have resolved before enabling all 64 drivers in the block. The resulting savings can be computed as follows: label the SAs in a block with integers from 1 to 64, with 1 being the fastest (due to local variation). Then, randomly draw two integers without replacement; the expected value of their maximum is also the expected number of bits protected from glitches. Let be this maximum, then is given by . On average this prevents glitches on 43 out of 64 bits on with much less leakage overhead, as listed in Fig. 11 .
VII. MEASUREMENT AND APPLICATION RESULTS
A. Test-Chip Measurements
The processing platform pictured in Fig. 1 was fabricated in a low-leakage 0. 13 CMOS process. Various components of DMA, and software debug support logic. The core energy decreases monotonically with , reflecting its relatively high active energy component relative to leakage. The energy per access of the SRAM (averaged over reading and writing) reaches a minimum at and increases at lower , which is consistent with its low activity factor [29] .
Importantly, since the CPU is compatible with a commercial instruction set, we can use the platform to quantify how the accelerators reduce energy for signal processing compared to conventional microcontrollers. Therefore we measure the energy required for common tasks in two ways: 1) specified in C software, compiled and executed on the CPU and hardware multiplier, and 2) programmed and executed on the hardware accelerator. In the latter case, the energy of transferring data to and from the accelerators is included. Silicon measurements in Table IV show that the accelerators provide 133 to 215 energy savings in the listed processing tasks. The accelerator gate counts are also reported to show the area/performance trade-off.
As noted in Section I, other low power processors with accelerators have been demonstrated (e.g. [6] with a 32-bit Cortex-M3 core (0.13
) and [1] with a 16-bit CPU supporting a custom low-power instruction set (90 nm)). Due to major differences between the CPU architectures, it is difficult to fairly compare the savings afforded by accelerators over the CPU. However, the accelerators themselves can be contrasted more easily. The FFT accelerator energy in this work compares favorably to those in [6] and [1] . The FFT in [6] consumes 100 nJ for a 256-point real-valued transform with 16-bit data, while in [1] this requires 82 nJ (using data from [30] ). The FFT in this work consumes 63 nJ due to higher parallelism than [6] and more aggressive voltage scaling than [1] , enabled by the low-voltage SRAM. 
B. EEG Feature Extraction Application
Since an application involves much more than an FIR filter or an FFT, we also quantify the impact of accelerators in the context of two applications. The first is the feature extraction stage of a machine learning algorithm for real-time epileptic seizure detection [31] . The feature extraction stage involves estimating the energy in seven frequency bands of each EEG channel over 2-second windows. As illustrated in Fig. 15(a) , this is achieved with an FIR filter bank followed by magnitude summation; results are then used in a classification stage to detect seizures. Fig. 15(b) plots one EEG input channel and the feature extraction results.
The shaded blocks in Fig. 15(a) indicate portions of the algorithm, namely the decimation filter and the modulated filter bank, that can take advantage of the FIR accelerator. Note that the filters consist of 62 and 39 taps respectively, larger than the 32-word local FIR memory. Consequently, it would not be possible to use the FIR accelerator for this application without the feature for high order filters (Section V-A).
Since the accelerated version finishes computation in fewer clock cycles, the platform can operate at a lower while achieving the same latency as the CPU-based version. The combination of lower cycle count and energy per cycle contribute to 10.2 savings overall in the accelerated version. The energy breakdown of the accelerated application is illustrated in Fig. 16(a) . The FIR filtering category includes the energy consumed by the FIR accelerator and by the transfer of data/coefficients to the multiply-accumulate unit as it is executing one filter. Because the algorithm involves eight filters while the testchip contains one FIR accelerator, the filters are executed one at a time on a 2-second block of data. Switching between filters involves saving the current filter state from the FIR local memory to SRAM, then loading the saved state of the next filter into the FIR local memory. The required energy is captured in the Other category of Fig. 16(a) . In this application, the FIR accelerator is effective in enabling real-time processing and decreasing the energy spent on filtering. The addition of a second FIR accelerator would help reduce the data transfer energy spent on context switching between filters.
C. EKG Processing
The second algorithm [13] finds the onset and duration of a feature in the EKG called the QRS complex, which often appears as a pulse in the EKG and therefore often used for heartbeat detection. The algorithm is available as open source software from [32] . While many QRS detectors in the literature find the peak of the R wave, the work in [13] does more processing to locate the beginning of the QRS complex, which can improve the accuracy of heart rate variability analysis. In addition, it finds the duration of the QRS complex, a helpful feature for beat classification.
The algorithm is illustrated in Fig. 17(a) , where the EKG is first low pass filtered, then its arc length computed over a sliding window , which accentuates the long excursions that are typical of a QRS complex. Next, the algorithm compares the transformed signal against an adaptive threshold and searches near the crossing point to locate the start and end of the QRS complex.
The shaded blocks in Fig. 17 (a) indicate steps that can leverage the accelerators. The arc length transform requires integer division (for scaling) and the square root which can both be performed with CORDIC. The last stage uses the minimum and maximum value within a portion of the transformed signal, which can be found with the median filter. Shown in Fig. 17(b) are segments of two EKG records from the MIT/BIH Arrhythmia Database [32] and the QRS start and end points as computed by the test-chip.
This application revealed some important observations about the CORDIC accelerator. First, it would be impossible to compute the arc length transform with a conventional CORDIC design due to its limited input range. Only by improving the input range, as discussed in Section IV-B, were we able to utilize the CORDIC engine. Second, although the CORDIC engine computes in fixed point, its accuracy is sufficient for this application. The implementation employing CORDIC for both and gave final results (the QRS onset and duration) identical to a floating point C version.
As before, we compare the energy to complete this algorithm with and without the accelerators. The accelerators complete key steps in fewer clock cycles, allowing us to reduce from 1 V to 0.7 V and lower the clock frequency while meeting latency constraints. The use of accelerators provides 11.5 energy savings over the entire application. Fig. 16(b) plots the measured energy breakdown as the algorithm (using accelerators) processes one heart beat. Here the arc length transform is updated for every input sample, implying that the square root is calculated for each input as well. Consequently, the square root function contribute 24% of the total energy even when it is computed efficiently by the CORDIC engine. In the local search phase, the median filter is used to find the maximum of a signal. However, the local search occurs once per heart beat, and thus the overall energy impact of the median filter is small in this particular application.
The platform's versatility allows us to implement these two algorithms from different domains while leveraging the accelerators to save energy. As summarized in Table V, the accelerators TABLE V  MEASURED ENERGY SAVINGS PROVIDED BY ACCELERATORS IN TWO APPLICATIONS reduce the total energy to complete the EEG and EKG applications by 10.2 and 11.5 respectively.
VIII. CONCLUSION
This paper presented a flexible, energy-efficient processor for medical monitoring devices that can be applied across multiple application domains. While a wide variety of algorithms has been developed for medical monitoring, many employ common signal processing operations. Therefore, the processor features accelerators for signal processing to reduce both cycle count and energy consumption. For further energy savings, we demonstrate two techniques to decrease switching activity: reordering computations in an FFT to lower switching activity in the datapath, and using a low-leakage self-timed approach to remove glitches on the SRAM data bus. To improve flexibility, we design the accelerator architectures such that the datapaths can be leveraged for various special cases. For example, changes to the CORDIC engine allow it to support a wider range of input values with improved accuracy. Such optimizations proved useful in mapping two published EEG and EKG algorithms onto the processor, where accelerators and voltage scaling contribute to order-of-magnitude energy savings compared to a conventional microcontroller.
