A 65 nm CMOS integrated circuit implementation of a bio-physiological signal compression device is presented, reporting exceptionally low power, and extremely low silicon area cost, relative to state-of-the-art. A novel 'xor-log2-sub-band' data compression scheme is evaluated, achieving modest compression, but with very low resource cost. With the intent to design the 'simplest useful compression algorithm', the outcome is demonstrated to be very favourable where power must be saved by trading off compression effort against data storage capacity, or data transmission power, even where more complex algorithms can deliver higher compression ratios. A VLSI design and fabricated Integrated Circuit implementation are presented, and estimated performance gains and efficiency measures for various bio-medical use-cases are given. Power costs as low as 1.2 pJ per sample-bit are suggested for a 10 kSa/s data-rate, whilst utilizing a power-gating scenario, and dropping to 250 fJ/bit at continuous conversion datarates of 5 MSa/sec. This is achieved with a diminutive circuit area of 155 um 2 . Both power and area appear to be state-of-the-art in terms of compression versus resource cost, and this yields benefit for system optimization.
I. INTRODUCTION
D ATA acquisition systems, in the domain of continuous time-varying signals, are increasingly operated under highly constrained resource limitations. This is especially so in the field of wearable devices, remote and self-powered sensors, Internet of Things (IOT), body-sensor networks, and in bio-medical applications. In such systems, power constraints may well demand reductions in data transfer and temporary local storage capacity, not the least because this can be a critical factor in extending battery-limited operating times. [1] - [3] .
It is important to make a distinction between temporary local storage (in a non-volatile recording device memory, for instance), as compared to offline storage (for example, after transfer of data to some processing center or base station). In Manuscript the case of offline storage, one may require data compression for reasons of practicality of data volume, and power is not necessarily a primary concern. But, where transmission requires a wired or wireless link, and/or data requires temporary storage in a device such as flash-memory, every bit of data has an associated power cost, and every bit that can be eliminated via compression will potentially save power, given appropriate conditions. Therefore, where a power reduction can be gained, by virtue of compression of the data to be stored or transmitted, this can extend operating times, improve data-rates, and make possible systems that would otherwise not meet power targets. This is where data compression can offer significant opportunity: In principle, if a data compression element consumes less power than that saved by the reduction in storage or transmission power otherwise consumed, then there is potential to realize useful power optimizations for the overall system. This tradeoff is therefore a function of compression ratio (CR) and power cost. Whilst it may seem counter-intuitive, it can be shown that the system with the highest CR is not necessarily the best overall solution: a point well-demonstrated by this paper.
Although chip area appears to be an ever increasing resource in standard integrated circuit (IC) design, the silicon area cost of such a design, and therefore any sub-component such as a datacompressor, is still of significant concern. Where techniques such as printed organic semiconductor circuits are concerned, the area-cost constraint is even more demanding [4] , [5] , and such technologies will surely become more evident in wearable device implementations in the future. Either way, saving circuit area for other useful functionality is always desirable.
Whereas state-of-the-art design philosophy in this field often leads down the path of achieving maximum possible compression rates, with or without information loss, this typically comes at a high cost in terms of hardware area and power. The best compression ratio is not necessarily the best compression outcome as a function of area or power consumption, and this means that there is an interest also in very simple circuits that achieve modestly desirable compression ratios. Such circuits may even be near-optimal in terms of compression of sample bits per picoJoule, even if they are not superior in raw compression ratio terms. Being able to measure these factors using suitable figures of merit (FOM) would also be a very useful tool of convenience with which to compare such candidate systems and their competitors. This paper presents a compression circuit meeting those expectations of very low power and area cost. This circuit was fabricated as a 65 nm CMOS test IC, 1 a sample of which is shown in a packaged prototype glass-lid format, in Fig. 1 . Using a novel algorithm, 'Log2-Sub-Band' Encoding (L2SB), the authors are able to achieve modest compression ratios with meagre transistor counts, and ultra-low power per bit.
After briefly considering some common approaches to compression in this field of application, this paper will introduce 'Log2 Sub-band encoding', and describe it's fabrication on a 65 nm CMOS test-chip, subsequent bench testing, and consideration of the implications for compression efficiency using this design. Following this, a performance comparison is made between some alternative compression algorithms, with suitable data-sets, including EEG, ECG, and MEG data-sets. Power savings and area trade-offs are estimated for each case, based on the core power and area data obtained from the initial evaluations presented. The authors introduce several FOM measures to support this comparison process. It is concluded that the power efficiency of the presented circuit is highly competitive for pJ/bit and area cost, both of which are extremely low. Projections are made for storage and transmission power trade-offs achievable by using various compression scenarios. In spite of the presented circuit's inherent simplicity, it is concluded that L2SB could be a favourable choice for future system designs.
II. A BRIEF OVERVIEW OF COMPRESSION OPTIONS
Compression algorithms in the domain of acquisition of continuous time-varying signals, particularly those associated with bio-physiological monitoring, seek to exploit the fact that most signal content consists of long-term low frequency variations and shorter term and smaller scale fluctuations on a more local scale. These signal transitions are rarely extreme in nature. Consequently, adjacent samples are often numerically close together with respect to the full scale signal range possible from one sample to the next. Whereas a signal range might occupy 12 bits, the typical state-change between samples may 1 The NOMAD IC was fabricated as part of the NOMAD project, funded by UK Gov. Innovate-UK Grant REF 26172-182148. typically only cover a few of the least-significant bits for any two successive samples. To achieve some form of compression of this information, the simplest possible option might be to simply encode the difference between adjacent samples, also known as DPCM, or Differential Pulse Code Modulation, a technique which has already been in effective use for a very considerable time [6] . If some method is then employed to transmit only the necessary bits representative of the change, then compression would be achieved. There are a variety of ways in which this might be obtained.
In some systems, it is acceptable to use lossy compression to reduce real information content by approximating features of the signal of interest [7] - [9] , or to utilize complex hardware to deliver highly domain-specific compression rates [10] , [11] . In some cases, techniques exploit relationships between channels in multi-channel recording tasks [12] , and any of these methods may be targeted specifically at real-time mobile data acquisition scenarios [13] . However, lossless compression is often an essential requirement, especially in safety-critical and biomedical domains. Again, approaches vary in complexity. Simple Huffman code-table approaches are widely used, whereas some other techniques employ predictive techniques to reduce the data needing to be transferred whilst retaining lossless characteristics [14] , [15] . This leaves the question of complexity to deal with -a complex circuit may well reduce data transmission power significantly, but if it consumes a lot of power in doing so, then this may optimize data volume more than power. In particular, where low data-rates are used, of the order of thousands of samples per second, and indeed sometimes hundreds of samples per second, it may well be observed that the dynamic power of the compressor is rather low, but in contrast, the static power (per sample) could become significant. Small, simple compression circuits, could in theory deliver very low static and dynamic power in these scenarios, and offer valuable gains, even without achieving state-of-the-art compression ratios in themselves.
III. LOG2 SUBBAND ENCODING

A. Design Motivations
The motivation of the authors in developing Log2-SubBand Encoding (L2SB), was to derive an effective, and configurable, compression algorithm for bio-physiological signals with a minimal hardware footprint. A further consideration was that the compression algorithm should be lossless, ruling out a number of algorithms which are capable of high compression ratios (which deliver CR of the order of 5-fold to 20-fold). In any case, such algorithms often require fairly demanding and complex mathematical approaches which immediately create questions in terms of hardware constraints, especially in low-power and real-time scenarios.
In terms of hardware footprint, the authors have three specific concerns: area, static power, and dynamic power. These factors are all considered in the design of L2SB encoding scheme, and this means that the compression ratio, in isolation, is not an overriding performance metric. Rather, the goal is to have moderate compression with minimal cost, and in doing so, demonstrate that a desirable gain in overall system power should be obtainable. Two Figure- of-Merit (FOM) performance metrics are also defined, which will be utilized later in the paper:
r Data Reduction per pJ (DR/pJ) which measures the reduction in data bits transmissible after compression, versus the pico-Joule cost of achieving that degree of compression.
r Data Reduction per um 2 (DR/um 2 ), which measures the average circuit area utilized to achieve the reported data reduction in a given channel of interest. Data reduction per pJ is calculated as the percentage of data bits saved by compression, divided by the energy consumed by each compression operation. Similarly, data reduction per um 2 is calculated as the percentage reduction in data bits saved by compression, divided by the area cost attributed to the compression operation. Note the term 'attributed' is used carefully here: where a single circuit can serve multiple data channels, then the area attributable per compression operation is divided by the number of channels supported. If a circuit supports eight channels successfully at a chosen data-rate, and latency, then one eighth of its area is equitably attributable to each channel in terms of compression versus area efficiency.
B. Important Points to Note
It is important to note that data-rate, typically in units of kSa/sec, is not always equivalent to clock-rate, typically in the kHz range here. This is because some (though not all) implementations require multiple clock cycles per sample conversion. Also, whilst continuous operation can be envisaged for sample conversion at a given clock rate, a higher clock rate can allow interleaved periods of idle time. For example, a duty-cycle ratio of 10:1 would imply that the circuit spends 90% of its time idle (potentially in a sleep mode) and 10% of its time actively compressing sample words.
C. Log2 Amplitude Sub-Band Compression (L2SB)
The basic principle of L2SB is founded upon the idea of defining amplitude sub-bands, and then comparing the current sample word with the preceding sample, to detect changes between them. In theory, only the changes between samples need to be transmitted in order to convey the original information content, without loss of accuracy. However, in practice this is not easily achieved on a bit by bit basis. Instead, the L2SB encoder sub-divides a given sample word into multiple regions or 'sub-bands', each of which represents a part of the whole sample word. If changes are detected in one of these bands, it will imply that the new state of that particular band must be transmitted.
There are a large number of band permutations that are possible (as will be discussed later). The number of valid combinations increases as a function of sample bit width. In this paper, and for the case chosen for fabrication of the prototype integrated circuit, we choose a relatively straightforward case based upon a 12-bit sample word. This case is illustrated in Fig. 2 , where it can be seen that a 12-bit sample word is decomposed into three four-bit bands. In this case there is also a notional zero-width band, which represents the case where no bits change anywhere in the sample word. This results in four prefix-code and data-payload combinations, which are also shown, along with the total number of bits to be transmitted in each case.
Based on the relative frequencies of each band combination being utilized to encode data, one can see that compression ratio (CR) can vary from 0.85 to 6.00. Consider an example CR estimation as given in Table I . Here, the average contribution of bits per band combination sums to 8.4 bits in total, meaning that the CR in this case is 1.43 (12 ÷ 8.4), or a data reduction (DR) of approximately 30% of total bits representing successive samples in the original sample words.
An important point needs to be noted about how bands may be combined. This is not an arbitrary combination, but utilizes progressive aggregation of bands, such that it is assumed that if a band 'n' has changed, then band 'n-1' is also highly likely to have changed. Less significant bands are always transmitted when a more significant band is transmitted. This means that the actual band combinations that need to be transmitted are fewer, and the prefix code is kept short. Typically the prefix code might be a two-bit code. However where band frequencies make it advantageous, it is possible to have an alternative prefix code such as '0','10','110','111', where the shortest code is assigned to the most frequent band combination. In an extreme case, a zero-width band could be encoded with a single prefix bit, allowing a CR of 12.0 to be achieved, though this is unlikely to be observed happening over sustained periods in realistic data streams.
In this case we chose to make the fourth band a zero-width band, and use a simple two-bit prefix code. Using the author's proposed notation this can be represented as a {4, 4, 4, 0} configuration. However we could easily have chosen four three-bit bands with a configuration of {3, 3, 3, 3}, or indeed any combination of band widths that accumulate to 12 source-bits in total.
Having explored the compression format itself, the next thing to observe is to note what it is that is actually being sub-divided. Taking the raw sample data is one option. Here, if any band is non-zero, then it is treated as requiring transmission. However it is known that this approach is overly sensitive to signal level drift due to low frequency and near-dc signal components in the source signal creating bias toward sample values with more significant bits being set persistently. To overcome this problem, a common solution is to apply DPCM pre-processing, and thus derive a normalized signal consisting of differences between samples, rather than absolute values. Accumulated differences at the receiver allow the original signal content to be restored. It might be noted however that this introduces more susceptibility to bit-errors, an area that is worthy of more investigation, but outside the scope of this paper. However, an equally effective, but often overlooked difference method, is to use XOR operation to determine locally significant changes. This has the advantage of a significantly simpler circuit design and no cascading of arithmetic stages, therefore potentially much faster. Consider that an arithmetic-differential DPCM circuit requires one fullsubtractor circuit per bit, each consisting of two XOR-gates, two AND-gates, two inverters, and one OR-gate. Meanwhile, the XOR differential method requires only one XOR gate per bit, a very substantial reduction in gate cost and area when a simple compression system is being designed (less so if the system is of greater complexity). This is an excellent optimization, since the encoder only needs to know that one or more bits have changed in each sample word band, rather than their numeric differences. With either method, the final step is to OR together all of the changed bits within each band to create a band indicator signal to determine if that band has any active changes. Such a scheme is presented in Fig. 3 , where the same {4, 4, 4, 0} scheme is implemented, using a serial-in-serial-out (SISO) arrangement.
The algorithm is described as follows:r Let S 0 represent the previous sample, and S 1 represent the current Sample.
r Let X 1 , Y 1 and Z 1 represent three n-bit sub bands of S 1 and X 0 , Y 0 and Z 0 represent three correspondingly sized sub-bands of S 0 .
r Let a, b, and c represent the True-False or 0-1 result of detecting a difference between corresponding bands, such that a compares X 1 and X 0 , b compares Y 1 and Y 0 , and c compares Z 1 and Z 0 . r The prefix is transmitted along with no-data, band Z 1 , band Y 1 , Z 1 , or Bands X 1 , Y 1 , Z 1 , according to the first, second, third, or fourth prefix code being selected.
r Untransmitted bands can be optionally set/cleared, depending upon circuit design needs. For example, in a serial implementation one can simply not shift-out the unused bits.
r Once the comparison process for a, b, and c is completed, the current Sample S 1 may be used to overwrite the previous sample S 0 , and this becomes the new previous sample ready for the next encoding cycle. Circuit functionality is as follows: After the input word is clocked into the input register, it is fed through the XOR array to generate band indicators. Then WE 2 is enabled, capturing the output word (bottom left of diagram), which is formed from the active bands, and the relevant header. Inactive bands are forced to logic '1' in this implementation as this automatically pads the serial output line high after the valid encoded bits have been shifted out. Finally, the circuit retains the current input word, by loading it into the relevant register (top right of diagram) when WE 1 is enabled, and which then becomes the 'previous' word for the next compression cycle.
The output word is prepended by a header prefix code, in this case a two-bit code H [1−0] , generated from a very simple logic function, almost identical to a standard 4:2 priority encoder. The timing circuit is based upon a simple binary counter. Fig. 4 illustrates the coding-decoding mechanism with a simple example, to demonstrate that this coding method works correctly. As one can observe, the received data is identical to the original data. The XOR column represents the bit changes between the current sample and the preceding one. The first sample in the sequence (typically after reset) is always treated as if all bits have changed, since a full sample word is required at the start to establish the reference point. If this reference point is refreshed at regular intervals, as a frame start value, it becomes possible to detect transmission errors within frames of chosen length, since progressive reconstruction of erroneous data will eventually be found to disagree with the full-word value at these points. Indeed, full-range values are often incidentally transmitted within a frame too, thus allowing for potential earlier detection. This is a topic worthy of further investigation, but outside the scope of this paper. In the given example, 36 bits are used to convey four 12-bit samples, giving a CR of 1.33 (48/33).
D. L2SB Encoder Implementation and Verification
Now that the basic functionality of the Log2 Sub-Band Encoder has been established, and described, we turn attention to it's implementation. The authors were fortunate to have the opportunity to include an L2SB encoder test circuit on a more complex IC fabrication project, utilizing spare pins and chip area. The low pin-count available meant that a serial-in-serialout (SISO) implementation was chosen for the fabrication and verification. This design had the advantage of being pipelined, such that new data words can be shifted inward to the compression stage, whilst the previous compressed bit pattern is shifted outward.
The circuit was modelled in VHDL, and then synthesized to a gate-level description. Cadence design tools were used throughout this phase of development. The 65 nm Faraday standard cell library was used for HDL synthesis, and the design was then targeted at a 65 nm CMOS fabrication technology: UMC 65N Logic/Mixed-Mode/RF CMOS process, with core and I/O voltages of 1.2v and 1.8v respectively. Automatic clock gating was enabled at the synthesis stage, and this resulted in approximately 2/3rds of flip-flops being gated, with 70% reduction in dynamic transitions. Fig. 5 shows a trial layout for the synthesized circuit, where the dimensions of the module are 13 um by 14 um, giving a maximum circuit area of 182 um 2 , if one ignores the unused space at top right, or around 162 um 2 otherwise. For this particular tapeout, the layout relied upon standard cell abutment, but if a full custom layout methodology was used, with transistor folding and other layout optimizations, then this area cost could no-doubt be further reduced. Maximum frequency was 714 MHz, giving a minimum input-output latency of 22 ns for a raw sample to be compressed to an encoded state.
A PIPO implementation was also designed and tested at the layout level(but not fabricated) in order to give area cost, power data, etc, with identical compression behaviour. This design had an area cost of 5% less than the SISO model (approximately 155 um 2 ), and used 38 flip-flops compared to 43 for SISO as a result of eliminating the cycle-state counter needed for the SISO model. This design is very similar to the design shown for the SISO model, but the input sample word and output sample word are written to and read from (respectively) in a bit-parallel fashion, rather than bit-serial as is the case for SISO.
The L2SB circuit was incorporated into the larger project chip layout, and fabricated via the European Europractice service to academia, via the IMEC centre. The chips were then packaged with an 84-pin CFPGA package, with some also supplied as glass-lid samples for display (one of which is shown earlier in Fig. 1 ).
Validation of the fabricated L2SB encoder was performed with bench-test equipment, comprising of a Zynq FPGA board to generate test signals, and measurements taken on a LeCroy WaveSurfer-440 digital oscilloscope. Oscilloscope screen-shots, from the operational bench tests, are shown in Fig. 6 . Screencolors are inverted for clarity in print. This shows three test cases, covering the single, double and triple band encoding cases. This implementation includes a start-bit (logic-low) and end-bit (logic-high) feature, shown in blue in the figure, which allowed data items to be framed for testing. More comprehensive tests were performed using automated test-pattern stimulus. The L2SB encoder passed all validation tests, and the fabricated data compressor module was fully operational. One limitation of incorporating the SISO L2SB test circuit into a more complex system design was that taking isolated power measurements from the chip was not possible. In this paper we use a data-driven post-synthesis power estimation methodology as described in the following section. On the other hand, this does mean that SISO and PIPO can be compared on equal terms.
IV. L2SB PERFORMANCE: EVALUATION TEST CASES
In order to evaluate L2SB encoder performance, test cases that relate to real-world application scenarios were chosen, hardware implementations of suitable L2SB compressor circuits were implemented, and power measurements were obtained from synthesis tools using Value Change Dump (VCD) stimulus files generated from simulations of each compression test-case using actual data streams. Two L2SB models were tested, the SISO (Serial In Serial Out) implementation as fabricated on the test chip, and an additional PIPO (Parallel in Parallel Out) model, whereby whole data words are clocked in on each successive clock cycle, with compressed data clocked out one cycle later.
A. Initial Power Analysis
For initial power tests, the Bonn University EEG Epilepsy data-set was utilized [23] . This data-set contains the following EEG test data-sets (Referred to by the originators as File-Sets O, F, and S, respectively). Table II , where CR is tabulated alongside static and dynamic power estimates for both SISO and PIPO implementations, with maxima, minima, mean, and standard deviation to 95% limits. The scatter-graph of compression ratios is also given for the three data subsets within the whole data-set, as given in Fig. 7 , where it can be seen that CR is noticeably banded according to the three test cases. It can also be seen that compression ratios are significantly higher in the seizure patients during non-seizure EEG monitoring cases. This may well relate to the reported differences in EEG power spectra components for patients under similar conditions [16] , which could account for reduced inter-sample differences and thus the higher compression rates. It can be seen that CR is, on average, around 1.56+/−16% at 95% limit, with CR as high as 2.05. This translates into an average data reduction (DR) of over 35%. However, this figure uses an equally weighted average. In practice, a patient may have seizures infrequently (one would hope that even a 100:1 ratio of seizure versus non-seizure data is pessimistic), or a system may record only the seizure events (with assistance of a detection algorithm), though data volume in this case is relatively small anyway. This means that there are actually multiple CR scenarios to consider, some of which are postulated in Table III. Examining the power data from post-synthesis simulations, with actual data-set stimulus, static power (at a 1 kSa/sec continuous conversion mode) is found to be almost constant for both implementations, and significantly larger than dynamic power at this data-rate. Dynamic power is relatively small by comparison, and shows small variance of a few percent. It can be concluded that at these sample compression rates, power is highly consistent across a fairly significant range of signal behaviour. At 1 kSa/sec data-rate, overall power consumption averages 234 nW for PIPO. For easier comparison, it is potentially more convenient to measure power per-bit for each compression event. This measure equates to 19.6 pJ/bit for PIPO, at this data-rate.
B. Further Optimization
At low device clock-rates, such that sample compression is a continuous back-to-back operation for successive samples, the device remains powered up 100% of the time. Alternatively, power-gating, with an on/off duty cycle, allows conversions to happen at higher clock-rates, with intervening power-down 'sleep' phases. Although we have not implemented a powergated L2SB design in silicon, our initial evaluation suggests that static power could be reduced by 50% to 75% using simple on/off power gating, since almost all of the logic has non-persistent data content. The principle here is a partial power-gating strategy, to retain the previous sample in the relevant data-latches without any power-gating, whilst applying power-gating to most of the remaining circuitry, since only the previous sample represents persistent state information between successive sample encodings.
Evaluations are given of power consumption per bit in Table IV . With a 100:1 duty cycle, a moderate 50% reduction in leakage current by power-gating, and a 10 kSa/sec data-rate (heading toward the upper end of usual per-channel bio-physiological signal sampling rates of 10's of kHz), it is estimated that the PIPO L2SB compression circuit would consume only 1.22 pJ per bit, and about 1.7 pJ for this SISO CMOS implementation at a 1.2v core voltage. It is also noted that at higher frequencies, PIPO and SISO move closer together for consumption per bit, regardless of power-gating usage. Above 1 MSa/Sec power consumption reaches as low as 250 femtoJoules per bit for PIPO implementation, and this is roughly constant beyond this point, up to maximum operating data-rates. This diminishing benefit of power gating, at high frequencies, is due to static power becoming negligible with respect to increasingly dominant dynamic power consumption.
It is clear that SISO implementation is outperformed by PIPO L2SB in terms of power consumption, particularly for higher data-rates. For comparison, Fig. 8 , shows the picoJoule-per-bit energy consumption of the four cases for selected sample rates up to 20 kSa/sec, showing the comparative performances with and without power-gating.
C. Broader Compression Evaluations
Evaluation of direct power cost is one factor of interest for L2SB encoding. However, to gauge the overall benefit of L2SB encoding in a complete system, it is also necessary to evaluate its compression performance over a broader range of data-sets. Furthermore, whilst one particular configuration has been chosen for the implementation, there are many possible configurations of L2SB encoding. For an n-bit sample word, and an L2SB band configuration comprising of bands 'a','b','c', and 'd', every possible set of values of a,b,c and d, that sum to a total of 'n', are potentially valid permutations. So for example, a configuration {1, 2, 4, 5} has a total of 12 bits, but each band has its own unique width, and each permutation will deliver a different compression ratio to the default {4, 4, 4, 0} configuration. If some validity constraints are applied, for example, only allowing the least significant band to have the option of zero width, then a 12-bit sample word has around 300 valid permutations, out of over 1300 candidates.
Taking these valid permutations, and applying each of them in turn to a data-set, allows us to determine the best permutation(s) for a given data behaviour. In effect, we can tune the algorithm to suit the dynamics of the particular kind of data being compressed. For example, applying L2SB encoding, in XOR mode, to the whole Bonn EEG data-set (file groups O, F, and S), the compression ratio for the 4, 4, 4, 0 configuration is found to be 1.31 (a saving of 24% of total sample data). However, when all valid band configurations are considered, it is apparent that there were better choices that could have been made. This can be seen in Fig. 9 , which plots all 299 permutations in terms of CR achieved in each case. Analysis of the individual results identifies configuration {−4, 3, 5} as the best choice, with a CR Further work may allow this band-tuning concept to be derived from the distribution of dynamic changes in the data (such as a Gaussian curve), this is an interesting idea, but providing a mathematical basis and proof is beyond the scope of this paper.
A further observation may be made regarding band choices. In a typical system, we envisage the band configuration being static for a given application, or recording session, and either fixed in software or hardware or reset at power-on initialisation. No information about band settings need to be transmitted. A more complex solution might employ dynamically changing bands, and these might be indicated at the start of each new frame, however we do not investigate this idea here.
For the purposes of an initial 'reality check' comparison, a Huffman code-book, optimistically trained on 100% of the test data, and using DPCM precoding, delivers a CR of 1.98, a 49% reduction in data bits. However there are significantly higher costs associated with this compression approach, as will be highlighted later, whilst it is also true that compression rates may be lower if the Huffman code-book is trained using a reference data set and then used with new 'unseen' signal input. There is now a potential choice-save about 50% of data storage and/or transmission with high hardware cost, or save 30% with a much lower hardware footprint.
Proceeding with more comprehensive tests, Table V presents a number of data-sets, evaluated for achievable compression ratios. This tabulates the compression ratio achieved by the best L2SB configuration in each case, alongside Huffman code-book. In all cases the Huffman code-book is trained on the whole data-set, and code-books are fully populated with all possible input symbols. This latter point is essential: it is possible to create a code-book that partially populates a 12-bit symbol table (for example only 800 symbols out of a possible 4096 may be used by the training set), but this is not a valid representation for a real system. Otherwise, what would the hardware do when presented with a previously unseen symbol? A valid Huffman code-book must assign a code-word to every possible lookup value if it is used in a real-time compression application. The data-sets have the following details:
r Bonn University EEG Database, 12-bit 173 Hz, 300 traces, no rescaling required.
r MIT CHB EEG scalp electrode data, 256 Hz [25] , [27] . r MIT BIH ECG arrhythmia database, 360 Hz, 11 bit, rescaled to 12 bit full scale. [26] , [27] .
r York Instruments nanoTesla MEG data, 24-bit 678 Hz, rescaled to occupy 2/3rds of 12 bit scale. For the MIT BIH Arrhythmia ECG data-set, 48 separate data files were analyzed. Each data file is relatively short and contains differing aspects of ECG observation, therefore variations are likely. In this case, the standard deviation at 95% limits is 13%, centered around an average CR of 1.57.
For the MIT-CHB Scalp-electrode database, due to the size of the complete database, fourteen files were analyzed, each containing twenty-eight channels of data (40 Mbyte each file), four of which contained seizures within the data. Although this data is stated as being acquired as 16-bit, all of the files used were able to map onto a 12-bit range such that, on average, sample range covers 90% or more of the full signal range without rescaling. L2SB Compression performs well in this data-set, achieving results quite close to the Huffman compression model, where CR is found to be 2.02 for Huffman vs 1.94 for the optimal L2SB band configuration. This is much better than the Bonn EEG dataset, perhaps due to the higher sample rate.
For the MEG data-set, provided by York Instruments Ltd (UK), the data was analyzed as a whole, yielding a CR of 1.36 for the default choice of configuration {4, 4, 4, 0}. However the best case configuration, {−, 5, 2, 5} with three bands and prefix codes '0', '10' and '11', achieved a CR of 1.54, highlighting the importance of identifying the best band configuration for a given data behaviour. Note that these permutation analyses included every possible variation of 1, 2, 3 and 4 band encoding, with all possible prefix options. Further comparisons, with other work in the field with the same or broadly comparable data sets, are given in Table VI .
D. Performance Comparisons
To assess the usefulness of any compression method, and implementation, there are three major concerns. The first relates to the compression ratio achieved, since this determines how much data transmission or storage effort is saved. The second concern is the power consumed whilst achieving the reduction in data needing to be managed. The third consideration is circuit area, since large circuits may be unwelcome additions to an SOC or FPGA design. A good starting point, for the basis of comparative performance analysis, is Huffman Encoding, since this is very well defined, and widely used by researchers to measure their own compression algorithms against. It is effectively a 'standard measurement' and allows those novel techniques to be compared with those proposed here, using Huffman performance as a common reference point.
Techniques that employ code-books, such as variations of Huffman encoding, can potentially imply rather large circuit area cost, due to the need for look-up ROMs or RAMs with thousands of locations and tens of bits per location. However, one advantage of a code-book is that it can be accessed multiple times per sample epoch to permit one code-book to generate compressed encodings for multiple simultaneously acquired channels. Thus, in the case of ECG, where there are only a handful of channels, a code-book may be quite costly in terms of circuit area, whilst in a 300-channel MEG system, it might potentially be more desirable. In contrast, for the simple implementations of L2SB discussed in this paper, a multi-channel system would typically require a separate circuit for each channel, albeit each being an instance of a very small circuit.
1) A Basis for Huffman Power Estimation:
In order to evaluate Huffman compression circuitry, suitable reported 65 nm CMOS memory costs have been considered [28] - [30] , from which it was established that 200 pw for static power per bit and 3.3 um 2 area cost per bit-cell is a reasonable assumption. Huffman code-book area and static power cost can thus easily be estimated. For example, a 4 Kword by 39 bit code-book has a total of 159,744 bits, total leakage power of 31.9 uW, and an area cost of 527, 155 um 2 . This area cost is highly significant even for a low-nm silicon CMOS design, but even more so for emerging technologies such as thin-film organic semiconductor sensor circuits. An important point to note here is that the Huffman circuit methodology is a 'plain' Huffman baseline approach. Whilst techniques exist to truncate and augment Huffman code books to reduce hardware cost [33] , these are often intimately interlinked with particular predictor algorithms, and are therefore not easy to generalise. There are also many potential ways to condense a Huffman code book, and it therefore makes sense here to use a single well-defined case as a baseline.
Dynamic read-power per code-book symbol look-up can also be estimated. Based upon work in the field, a typical value was found to be 5 pW/bit per read. Implying 195 pW for a 39-bit read. Finally, therefore, at 1 kSa/Sec, total power is 32.15 uW, power per sample is 32.15 nw, and energy per bit is 2.7 nJ per bit. Huffman coding is more power efficient per-sample at higher data rates, which may include interleaving multiple channels. For example, supporting 10 channels at 10 kSa/sec gives a code-book throughput of 100 kSa/Sec and power consumption of 428 pJ per sample bit.
2) Huffman Power and Area Projections: Taking relevant scenarios implemented using L2SB and the described Huffman code-books, a comparison is presented in Table VII , where data is provided for power and area cost for Huffman and L2SB compression circuits operating at a 1 KSa/sec data throughput. Power and area calculations are given for systems with 1, 4, and 8 channels. For L2SB compression, the power and area per channel is constant, whilst total power and total area increase proportional to channel count. For Huffman compression, total power rises as a function of channel count, but power cost per channel reduces, since a single circuit with associated leakage is serving, 1, 4, 8, or more channels. Note that the Huffman code-book size is calculated as follows: suppose a 12-bit sample word is compressed by Huffman encoding and the maximum resulting code-word size in the code-book element width is 18 bits. Now, although the code-word 'w' occupies 18 bits, an associated code-word size indicator 's' is also needed to inform the next stage of the system as to the code-word size being provided in the current compression cycle. Since 's' must be a binary number, it must be 5 bits in this case, since 2 5 = 32, which is greater than or equal to 18, whilst 2 4 = 16, which is too small. Therefore, the total code-book is the binary lengths of 'w' and 's' combined (in this example case 23 bits).
Examining again Table VII , it can be seen, then, that a single channel Huffman compression circuit might consume around 32.2 nW (2.69 pW per bit) at 1 kSa/Sec. A PIPO L2SB encoder using a 100:1 duty cycle to support power gating, and assuming a 50% static power saving in sleep mode, would consume around 120 pW (10 pW/bit), representing an approximate 270:1 power advantage for L2SB. However, the L2SB circuit (in its simplest embodiment) must be duplicated in area for each additional channel concurrently supported, whereas the single Huffman code-book can be interleaved between channels, effectively sharing the static power burden. It is seen that for eight channels, Huffman consumes only 350 pw/bit, a much improved figure, but still a 35:1 ratio of power consumption per channel compared to L2SB. If a wide range of channel counts are considered, as given in Table VIII , a single-stage huffman-DPCM model tends eventually toward a figure of around 16 pw/bit for very high channel counts (of the order of 1000's). However, few bio-physiological sensing applications have such demands.
For circuit area overhead, L2SB and Huffman are also compared in Table VII . Here it is possible to see that for a single channel, Huffman is vastly larger than L2SB in terms of circuit area, averaging 529.14 um 2 for one-channel DPCM-Huffman, versus 155 um 2 for L2SB. For multiple channels, L2SB area cost is a linear product of channel count, whereas it is a fixed cost for Huffman. Consequently, area cost per channel is always 155 um 2 for L2SB, but reduces for Huffman as channel counts increase. 
3) Figure of Merit (FOM) Measurements:
The power and area data, quoted in the preceding section, is a raw power and area cost estimate, and does not take into account the differing compression ratios delivered by each case. Yet, compression ratio is an important element. Therefore, a further perspective on relative performance can be obtained by taking the calculated power and area data for the L2SB/Huffman comparisons, and utilizing these to generate figures of merit. There are two figures of merit used here, as introduced earlier: Data Reduction per pJ (DR/pJ), and Data Reduction per um 2 (DR/um 2 ).
This FOM data is also calculated in Table VII , and it can be seen that the relative efficiency of L2SB and Huffman algorithms in achieving a given compression goal can be evaluated using these measures. In power terms, Huffman is almost 27 times less power efficient per bit of data reduction achieved through compression than L2SB, for an 8-channel system, and 192 times less efficient for a single-channel mode. For area efficiency, L2SB is several thousand times more efficient in delivering data bit reduction via compression for a once-channel system, and still over 300 times better for eight channels, compared to Huffman. A further area comparison (albeit less precise) can be made with other work in the field, provided that area-costs are scaled to the same process node (65 nm). Such a comparison is given in Table IX , where a variety of loss-less compression schemes are reported. For these cases, it can be observed that L2SB delivers considerably more data compression per um 2 of silicon than the cited cases.
It is important to remember, here, that these figures relate to data compression as a function of power or area, not absolute compression ratio. L2SB is not compressing data volume 30 times more than Huffman, indeed Huffman delivers better absolute compression than L2SB. What the data shows is that L2SB achieves 30 times more compression per pico-Joule. Huffman consumes disproportionately more power and area to achieve its superiority, whilst L2SB is moderately inferior in compression ratio, but with much lower resource cost. At this point, a question must arise-is more compression at much higher cost better than less compression at extremely low cost? This can be answered when the goal of compression is finally dropped into place-we wish to reduce one or more system overheads, and primarily memory storage requirements, memory storage power, and data transmission power. This is examined in the next section.
V. SYSTEM LEVEL PERFORMANCE TRADE-OFFS
At the system level, there are two major areas of concern in the context of this study. The first concern relates to systems in which data is stored in non-volatile on-board memory, for later access. This is typically achieved via an on-board flash memory components, and often of the order of several gigabits capacity. Examples of this use-case include wire-free miniature data-recorders [17] - [19] . Such systems are designed to operate in a wearable ambulatory mode of operation, with neither a wired (umbilical) connection to a master unit, or a wireless (radio tethered) connection to a base-station. Here, the major power costs associated with data are the storage costs in terms of (a) storage capacity, and (b) data write power.
A. Compression-Storage Trade-Offs
Considering non-volatile memory storage first, it is self evident that reducing data by 30% via compression will increase potential storage capacity of the system by the same amount. There are no major insights here, other than to say a higher compression rate is better if storage capacity is the only concern. However, for power consumption, the picture is somewhat different. Typical flash memory chips have a write power consumption of around 3-4 pW per bit [20] , [21] , though write operations are typically performed in blocks after accumulating enough bytes to fill a write page. It is often overlooked that bytes must first be written to the flash page buffer, before the actual write is completed internally, and this consumes more power. Therefore, whilst the typical page write time, averaged on a per-byte basis, may be of the order of 120-150 ns per byte, the data-transfer to internal buffer may require an additional 20-40 ns of time spent in active power mode per byte written. On this basis, a figure of 5 pW per bit appears to be a reasonable benchmark for comparing flash write power against data compression power, on a bit for bit basis.
Taking the figure of 5 pW per bit for flash-memory writepower, and employing this in a trade-off between compressionpower consumed, versus flash-memory write-power saved (by reduced volume of writes), the analysis presented in Table X is   TABLE XI  SYSTEM POWER TRADEOFF -TRANSMISSION POWER VERSUS COMPRESSION derived. This analysis utilizes the same data-sets as those used for Tables VII and VIII. For both Huffman code-book compression, and L2SB, it is clear that there is no direct benefit between compression ratios achieved versus flash storage power cost. However, for a 16-channel system L2SB has relatively small power cost to achieve worthwhile storage compression, whilst Huffman has relatively high power penalty. The message here is that L2SB can deliver useful data storage capacity compression outcomes at minimal power cost in some cases. However, neither of the algorithms examined can reduce flash memory data-write power enough to compensate for the additional cost of the associated data compression.
B. Compression-Transmission Trade-Offs
A brief survey of data transmission approaches at data-rates up to the order of 1 Mb/sec, and used in relevant work in the field, yields the data presented in Table XI , and showing a wide range of power consumptions.
Power tradeoffs for compression versus transmission power can be estimated on the following simplified basis: Given a particular sample rate, such as 10 kSa/sec, with perhaps 12 bits per sample, then the minimum data-rate required to support the data transmission without compression must be 120 kbits/sec in this case. Assuming an average of 111 nj/bit, then the data transmission power cost must be at least 13,200,000 nJ, continuous power/energy cost. If compression reduces the bit rate by 30%, then the power consumption attributable to transmission would reduce by 30% also. In this case, 3,996,000 nJ would be saved. If the compression cost was, for example, 1.3 pJ per bit, then total compression power cost is found to be 156,000 pJ, and the total power saving is actually 3,840,000 pJ, or 3.8 uW continuous operating power reduction.
Clearly, the critical factors here are (a) the data reduction/compression ratio, (b) the power consumed per bit compressed, and (c) the transmission power saved for each bit no longer needing to be transmitted. Taking several compression scenarios, including L2SB and Huffman, it is possible to compare systems in terms of power saved.
For this comparison, we assume several test-case scenarios, including those already introduced (notably DPCM-Huffman and XOR-L2SB), but also including other reported work where power data is readily interpreted for known data-rates and repeatable configurations. To give representative transmission cases, power figures are based upon literature in the field, and several chosen aggregate cases, as defined in Table XI.  TABLE XII  CHUA ET AL VS HUFFMAN AND L2SB, 24 MHZ TEST CASE For the the primary comparison with the low-power lossless encoder [22] , separate static and dynamic power data is not reported, and so cannot easily be translated to a nominal operating sample rate. However, this work reports a test case of a 24 MHz operating frequency, and 170 uW power consumption at an equivalent data-rate of 1226 × 12 bits and one channel. It is possible, therefore to align L2SB and Huffman operating conditions to this model. For this scenario, L2SB is configured to operate as a single channel PIPO compressor, clocked at 24 MHz, with a sample data-rate of 1226 Sa/sec, thus allowing a power-gating duty cycle of approximately 19500:1 to be assumed. Huffman encoding also operates in a 1226 sample per second, single-channel look-up mode. This comparison results in the data presented in Table XII .
The best-case band configuration is chosen for L2SB, with power as measured in the earlier described CMOS implementation.
Although Huffman produces a higher power saving for both Bluetooth and Zigbee test cases, correlating to its higher compression ratio, the L2SB model delivers power savings very similar to Chua's low-power lossless compressor, in spite of having a lower compression rate. A lower compression rate acquired by much lower power cost, allows overall power saving to be comparable to a system with higher CR. Even Huffman encoding does not deliver a very substantial overall power saving as compared to L2SB, and when area cost is also considered, it is clear that L2SB offers a very desirable combination of overall power saving versus area cost invested at the chip layout level.
For more power-efficient transmission cases, L2SB becomes even more attractive, apparently outperforming Chua, and almost matching Huffman, and of course with much lower chip area. This is particularly clear for the 21 nj/bit transmission cost scenario, where Huffman starts to enter a region where it begins to lose its power benefit, and the 'Chua' test case actually results in a negative power optimization (i.e. it consumes more power than it saves).
This last point is a very important observation: as transmission power cost is gradually improved by better design of transmitters, the proposed XOR-L2SB methodology has significant advantages to offer here. If future improvements in compression can be made, then L2SB may well be envisaged as a preferred method of achieving valuable system power gains in radio-linked systems.
VI. SUMMARY AND CONCLUSIONS
A 65 nm CMOS circuit was designed, fabricated and validated, comprising of a novel bio-physiological signal compression circuit with state-of-the art power-per-bit and gate-area cost. Employing an XOR bit-change detection scheme, and a 'log2-subband' data encoding scheme, allows an exceptionally simple circuit design. The specific configuration of Log2-Subband encoding is capable of being tuned to the characteristics of the data stream in question, and thus able to maximize available compression ratios under this algorithm. This paper demonstrates that careful choice of configuration can boost CR significantly for a given type of data-set. EEG and ECG, for instance, are very different in their dynamic content. Designs for dynamically configurable L2SB encoders have been envisaged, and could offer further compression improvements and flexibility. For example, bi-modal compression configurations might boost overall CR for an ECG where there are two distinct signal behaviours, or where an EEG records both seizure and non-seizure data. This would be achieved with relatively small power and area penalty, especially given that the design is already extremely lightweight in these terms.
Power comparisons with alternative schemes have been presented, though this proves difficult to do comprehensively, as most work in the field reports overall power rather than static and dynamic power, or reports power with compression as an indivisible part of a system. This makes extrapolation to relevant sample rates and normalized operating conditions difficult. Nonetheless, comparisons are made where possible, and L2SB is found to have favourable capabilities.
Where power-gating techniques are employed, it has been demonstrated that power consumption per sample bit could be of the order of pico-joules, with a 100:1 power-gating scheme delivering 1.2 pj per bit power consumption for a 10 kSa/s datarate. Again, this is believed to be state-of-the art, and at the extreme low-end of what is envisaged to be possible at 65 nm, or indeed where scaled to other process technologies. There is every reason to believe that as lower process nodes are targeted, data compression with sub-picojoule per-bit power cost would be readily achievable for bio-physiological signal measurement systems.
Using the proposed figure of merit (FOM) efficiency measures, comparisons can easily be made between L2SB and other systems. A limited survey and comparison was indeed provided in this paper (see Tables IX and VII particularly) and illustrates the value of a technology-neutral figure of merit.
The authors consider the presented circuit to be potentially one of the simplest possible encoders available, yet it delivers useful compression ratios in the context of flash-memory storage compression, and data transmission power drain. Although L2SB delivers modest compression ratios compared to a wide variety of other algorithms, those algorithms come at the cost of higher complexity, large gate-area costs, and higher power consumption. Because of this, their ability to trade-off compression versus transmission power is hindered in spite of their high CR values. Consequently, even though L2SB is inferior in compression ratio, it can be shown to deliver similar or superior performance for transmission power trade-offs in conjunction with extremely small silicon area. Indeed, in extreme resource-limited situations, such as printed organic thin film semiconductors, L2SB may even enable compression where other alternatives are simply not viable. This is a very interesting conclusion, in this case the concept of 'less is more' appears to be well observed.
