Abstract-Heterogeneously integrated and miniaturized neural sensing microsystems are crucial for brain function investigation. In this paper, a 2.5D heterogeneously integrated bio-sensing microsystem with -probes and embedded through-silicon-via (TSVs) is presented for high-density neural sensing applications. This microsystem is composed of -probes with embedded TSVs, 4 dies and a silicon interposer. For capturing 16-channel neural signals, a 24 24 -probe array with embedded TSVs is fabricated on a chip and bonded on the back side of the interposer. Thus, each channel contains 6 6 -probes with embedded TSVs. Additionally, the 4 dies are bonded on the front side of the interposer and designed for biopotential acquisition, feature extraction and classification via low-power analog front-end (AFE) circuits, area-power-efficient analog-to-digital converters (ADCs), configurable discrete wavelet transforms (DWTs), filters, and a MCU. An on-interposer bus ( -SPI) is designed for transferring data on the interposer. Finally, the successful in-vivo test demonstrated the proposed 2.5D heterogeneously integrated bio-sensing microsystem. The overall power of this microsystem is only 676.3 for 16-channel neural sensing.
2.5D Heterogeneously Integrated Microsystem for
High-Density Neural Sensing Applications I. INTRODUCTION H ETEROGENEOUSLY integrated neural prosthesis devices have a promising future for neural processing, networking, sensing and prosthesis. One of the applications is to restoring and replacing lost functions in paralyzed humans [1] , [2] . Various approaches were proposed for providing stable observation with small form factor and biocompatible properties [3] . These approaches can capture neural signals accurately from subjects in their natural habitat without being burdened by those implantable devices. For brain function investigation and neural prostheses realization, however, the demands of highly integrated neural microsystems are crucial [4] . The neural-sensing microsystems typically consist of three main parts as shown in Fig. 1 : CMOS circuits for signal acquisition, conditioning and processing, neural probes or electrodes for brain signal collection, bonding/cables/wires for the interconnection between probes/electrodes and circuits [5] .
Many neural sensing microsystems have been proposed to provide small form factor and biocompatible properties, including stacked multichip [6] - [8] , microsystem with separated neural sensors [9] - [11] , monolithic packaged microsystem [12] and through-silicon-via (TSV) based double-side integrated microsystem [13] , [14] . These heterogeneous biomedical devices compose of sensors and CMOS circuits for biopotential acquisition, signal processing and transmission. However, the weak signals detected from sensors in [6] - [12] have to pass through a string of interconnections and interfaces to the CMOS circuits by wire bonding. The detailed comparisons of these schemes have been analyzed in [14] . In view of this, TSV-based double-side integration [13] , [14] uses TSV arrays to transfer the weak signals from -probe arrays to CMOS circuits for reducing noises. Nevertheless, the double-side integration requires preserving large area for separate -probe arrays and TSV arrays, and the TSV fabrication process may induce damages on CMOS circuits. Additionally, all CMOS circuits should be fabricated in the same process.
Several neural acquisition ICs were proposed to achieve area-energy-efficient neural recording systems [15] - [17] . In this paper, a 2.5D heterogeneously integrated bio-sensing microsystem with -probes is presented for high-density multi-channel neural sensing applications. The subcomponents and integration of this heterogeneously integrated microsystem have been presented in [18] - [20] and [21] , respectively. The rest of this paper is organized as follow. Section II presents the overall system architecture and physical structure of this heterogeneously integrated bio-sensing microsystem. Section III describes the proposed -probes with embedded TSVs. Section IV describes the proposed neural-signal acquisition circuits, which involve the low-power analog frond-end (AFE) circuitry and area-power-efficient analog-to-digital converter (ADC). Section V and VI elucidate the design of configurable discrete wavelet transforms (DWTs) and low-power on-interposer bus ( -SPI), respectively. Section VII summarizes the implementation, experimental results and in-vivo test. Finally, conclusions are given in Section VIII.
II. 2.5D HETEROGENEOUSLY INTEGRATED BIO-SENSING MICROSYSTEM
For neural sensing applications, such as human brain mapping, high-density neural sensing microsystems can provide high spatiotemporal resolution neuroimaging [22] , [23] . In view of this, the proposed 2.5D heterogeneously integrated bio-sensing microsystem with -probes is designed to capture high-density Electrocorticography (ECoG) signals in a small local area, called -ECoG sensing. Fig. 2 presents the structure of 2.5D integration with -probes. The MEMS -probes are formed by TSV fabrication on an 8 wafer and diced by 5 mm 5 mm. Within the area of 25 , 24 24 -probes with embedded TSVs are fabricated for collecting ECoG signals individually. The detailed process flow of -probes with embedded TSVs will be described in Section III. The other wafer is diced by 16.5 mm 10.4 mm as the silicon interposer for the 2.5D heterogeneous integration. Then, the 4 dies and 5 mm 5 mm -probe array with embedded TSVs are bonded on the top side and bottom side of the silicon interposer, respectively. This silicon interposer provides the supporting platform for multi-chip solution and connections between dies and -probes through 2 series-connected TSVs. Hence, the ECoG signal sensing from each -probe is transmitted to the front-side by these 2 cascode TSVs. On the top side of the interposer, two redistribution layers (RDLs) are fabricated for the connections between the 4 dies and TSVs. Moreover, the configurable sensing channels can be defined by the routing of RDLs. In this 2.5D heterogeneously integrated microsystem, the 24 24 -probes with embedded TSVs are divided into 16 channels, and each channel contains 6 6 -probes. The block diagrams and the cross-section view of the biosensing microsystem are shown in Fig. 3 . This microsystem composes of -probes with embedded TSVs, 4 dies and 1 interposer. These 4 dies are designed for biopotential acquisition, ECoG signal processing, feature extraction and classification via AFE readout circuits, ADCs, configurable DWT circuits, filters, and 1 microcontroller unit (MCU-Renesas RX210). Die-1 is fabricated by TSMC 0. 18 CMOS process, and the other three dies are bare dies from Reneses and Lattice. Die-1 is designed for 16-channel signal acquisition by 16-channel AFE circuits and 4 area-power efficient ADCs. Die-2 and die-3 are two Lattice FPGAs fabricated by 65 nm Low Power (LP) CMOS process, and we implement DWT and filters on these two lowpower FPGAs. Die-4 is a MCU for the system control and feature classification. Additionally, an on-interposer bus ( -SPI, Serial Peripheral Interface) is designed for transferring data in the bio-sensing microsystem. The 2-layer RDLs on the interposer are the interconnection to provide the links between 4 dies, connectors and -probes. Moreover, the RDLs are also utilized to define the configurable sensing channels by the connection of -probes. Within the 5 mm 5 mm -probe array, 16-channel ECoG is defined for -ECoG sensing, and each channel contains 36 -probes.
III. PROCESS FLOW OF BIO-SENSING MICROSYSTEM
The fabrication process flow of the proposed heterogeneously integrated bio-sensing microsystem with -probes and embedded TSVs is illustrated as Fig. 4 . The process consists of 8 inch full-wafer standard CMOS process, thin film process and bulk silicon process. In Step-1, 200-depth and 30-diameter TSVs with fully-filled Cu plating process are fabricated on one side of both the probe wafer and interposer wafer. The TSV process includes CMOS passivation/oxide/field oxide layer opening process, isolation layer deposition and metal electroplating in the TSVs. In Step-2, a deep ion-coupled plasma (ICP) etching process is utilized for -probe formulation. Thus, the tip and shaft of -probes are formed via TSVs at the TSV-side of the probe wafer. By carefully controlling the etching process, the 50 -probes with 30 TSVs are protruded at the end of shaft forming process. These two wafers are grinded/thinned to 200 via chemical mechanical polishing (CMP) for exposing the TSVs. Platinum (Pt) and titanium (Ti) thin film layers are sputtered on -probes for providing biopotential acquisition material. For the interposer wafer, one redistribution layer and two redistribution layers (RDLs) are fabricated to integrate the -probe chip on the back side and the 4 dies on the front side, respectively.
In
Step-3, the -probe and interposer wafers are diced as a 5 mm 5 mm -probe chip and a supporting platform, respectively. This -probe chip contains 24 24 -probes with embedded TSVs. The -probe chip, interposer, 4 dies and connectors are bonded and stacked by silver (Ag) and tin (Sn) glue and filled with isolation layers between stacking layers. In Step-4, this microsystem is coated with parylene-C for insulation. For providing tissue contacts of -probes, the tips of -probes are then opened to remove parylene-C on Pt by laser process. 
IV. 16-CHANNEL NEURAL-SIGNAL ACQUISITION CIRCUITS
Die-1 is designed for 16-channel neural-signal acquisition by 16-channel AFE circuits and 4 area-power efficient ADCs as shown in Fig. 5 . Thus, neural-signals are amplified by AFE circuits and converted to digital codes by ADCs. These 16 AFE circuits operate independently. Additionally, analog multiplexers are designed to select the corresponding signals from AFEs to ADCs via low impedance paths. 4 AFEs share 1 biasing circuitry and 1 ADC after the area-power optimization.
A. Design of Energy-Efficient Low-Noise AFE
1-channel AFE is composed of two amplification stages and one output stage for amplifying neural signals and providing enough current to drive the following ADC. Fig. 6 presents the schematic of 1-channel AFE circuit with feedback capacitors and pseudo-resistors. The AFE is comprised of two differential difference amplifier (DDA) stages and pseudo resistors. The DDA is designed to provide mismatch immunity, high commonmode rejection ratio (CMRR) and good amplification for neural sensing applications. The high-pass cut-off frequency is set by pseudo-resistor and feedback capacitors . The ideal DDA amplifies the differential voltage by a nearly infinite amount, but fully suppresses all common-mode voltages. Additionally, DDA can provide a differential input with very high input impedance and well controlled differential gain. Generally, the major advantage of DDAs is that the input current (i.e., the drain current of the input stage) is two to three times of the conventional operational amplifier [24] . The input noise and CMRR are mainly dominated by the mismatch of input stage. With the larger input stage, the noise and leakage of amplifier decrease, and thus the input impedance and CMRR are enhanced. However, the area of DDAs is much larger than that of differential amplifiers.
The schematic of the proposed first DDA stage is as shown in Fig. 7 with a DDA input stage, current mirrors, and a class AB output stage. The input voltage is converted to the current by the current mirror for controlling a class AB output stage. Additionally, the voltages of VP1, VP2, VN2, and VREF are generated by internal biasing circuits. At low frequencies, the flicker noise will dominate the input signal. Therefore, the amplitude and resolution of the output signal will be limited by the flick noise. To reduce the power consumption and flicker noise of the DDA, large width PMOSs are used in the input stage. These PMOSs operate in the weak inversion or sub-threshold region to realize low power consumption, low flicker noise, and high voltage gain. The drain current of each input transistor is 0.5 . In AFE design, increasing the cascaded stages induces large power consumption. At system-level, however, a conventional AFE with single stage design cannot achieve the optimal power efficiency [25] . The 1/2 LSB tracking error of ADC is modelled as in (1) where and represent the bandwidth of the amplifier and the maximum frequency of the input signal. Additionally, is the ratio of the holding time of each conversion cycle, and n is the resolution of ADC.
(1) To achieve an acceptable conversion rate of the ADC with the corresponding low-pass cut-off frequency, , the proposed AFE consists of two DDA stages and one output stage for optimizing the power efficiency. Fig. 8 shows the schematics of the 2nd stage and output stage of the AFE. The circuitry of the 2nd stage is similar to that of the 1st stage. The voltage gain and current consumption of the 2nd stage is smaller than those of the 1st stage for energy efficiency. The output stage is designed to provide large output current for driving the analog multiplexer which is connected to an ADC. The output stage consists of a common drain structure and a class AB output stage controlled by two diode-connected transistors for achieving large voltage swing. The output current is 5 . A high-pass filter is designed to filter the DC offset from the input signal. Due to the limitation of the chip size, pseudo resistors are utilized in the AFE design. The equivalent resistance of pseudo resistors is large to provide a large time constant with an internal feedback capacitors for the small high-pass cut-off frequency. Many types of pseudo resistors have been proposed [26] . However, these pseudo resistors exhibit asymmetric and nonlinear resistance when the voltages across the pseudo resistors sweep from negative levels to positive levels. Furthermore, the drain-body junction of the pseudo resistor contributes leakage current in the order of pA and affects the performance of amplifiers. As such, the variation on the resistance of pseudo resistors and the distortion of the output signal both increase near the quiescent point. The resistance of the conventional pseudo resistors cannot be higher than 100 due to the leakage current, even with ultra-low gate bias. Hence, the high-pass frequency is limited to around 100 Hz, and some neural signals will be filtered. Therefore, a modified pseudo resistor with M1/M2 is proposed to mitigate the variation on the resistance as shown in Fig. 9(a) . This pseudo resistor is composed of four current-controlled transistors and two voltage-controlled transistors, and
. Fig. 9 (b) presents the symmetric resistance property of the pseudo resistors with/without M1/M2 from the simulation results. The modified pseudo resistor is a fully balanced pseudo resistor with higher resistance by connecting two serial transistors, and . The Vbias and are the fixed bias voltages generated by the internal biasing circuitry.
B. 11-Bit SAR ADC With Delay-Line Enhanced Tuning
An area-power-efficient 11-bit hybrid ADC with delay-line enhanced tuning is designed for the neural-signal acquisitions. The hybrid ADC consists of a 3-bit delay-line-based ADC and an 8-bit SAR ADC as shown in Fig. 10 . The 8-bit SAR ADC is designed as a fine tune block to provide moderate resolution accuracy and low energy consumption. Additionally, the delayline-based ADC is designed as a coarse tune block to enhance tuning and reduce the total capacitance of the SAR ADC.
The delay-line-based enhanced tuning block is a combination of a voltage-to-time converter (VTC) and a time-to-digital converter (TDC) as shown in Fig. 11 . Based on P-type and N-type voltage controlled delay cells [27] , the time differences of different voltages can be derived, and hence the input voltage can be divided into 8 groups with 3-bit detection. Therefore, the lower part and the upper part are sensitive to 0 V 0.9 V and 0.9 V 1.8 V based on the P-type buffer and N-type buffer as shown in Fig. 11 , respectively. The conventional TDC consists of a set of buffers and flip-flops. Moreover, the vernier structure of TDC can reduce the delay difference of each stage of buffer to [27] . If the quantizing voltage of is between and with an R-bits resolution, the total processing time window, , is between and . Assume is a linear function of , then 
However, maintaining the linearity of the time window from to is a design bottleneck for delay-line-based ADCs. Therefore, a modified vernier structure of the TDC is proposed for the delay-line-based enhanced tuning block as shown in Fig. 11 . Instead of designing a linear VTC, the delay between two buffers in the modified TDC is adjusted from to of each stage, where . Therefore, the time difference of the two buffers in each stage is adjusted for the corresponding input voltage as shown in (3). The proposed structure can provide better precision compared to the original vernier structure.
In the hybrid ADC, the coarse-tune ADC propagates the three most significant bits (MSBs) to the SAR control logic. Based on these three bits, a lifting-based searching algorithm is proposed for the SAR ADC to relax the accuracy requirement of the coarse tune. After sampling the input voltage, a lifting procedure lifts up to the level slightly higher than as shown in Fig. 12 . The whole operation region is divided into 8 blocks, and the lifting procedure lifts one block higher than that of . After lifting, the split capacitor array of INP operates as a conventional DAC of the SAR ADC as shown in Fig. 12 . To further reduce the total capacitance of the capacitor array, the technique of split-capacitors [28] is utilized in the SAR ADC. Additionally, the starting bit in the SAR ADC is the 4th bit of . Thus, the reference voltage for the capacitor array of INP should be shrunk. Accordingly, the difference between two inputs of the comparator is , and the reference voltage of INP is . In the hybrid ADC, three clock cycles are required for a sample as shown in Fig. 13 . The first clock cycle is for the coarse-tune to detect the 3 MSBs, and the second clock cycle is for the fine tune. In this cycle, the sample stage and the comparison stage are executed when the clock states are "High" and "Low", respectively. However, a 30 mV offset of the 8 detection blocks in the coarse-tune ADC is designed to prevent the uncertain process-voltage-temperature (PVT) variations on the delay cells. Based on the offset detection region, the 3 MSBs may indicate the wrong block for the fine tune. Therefore, would be lifted too high, and thus produce a totally wrong output (i.e.,
). Since the offset is inevitable, a re-comparison procedure should be inserted in the third clock cycle. As soon as a comparison of the SAR ADC has been done, a verification process would examine whether the output is "1111'1111" or not. Once the straight-one output is detected, the fine-tune ADC would start over again in the third clock cycle with 3 MSBs+1. The re-comparison procedure can relax the accuracy requirement of the coarse tune with acceptable overhead on power consumption and performance.
For further realizing power-efficient SAR ADC, the self-timed SAR control is designed for reducing the operating clock frequency. Based on the self-timed circuitry, once the comparator completes the comparison of one bit, the comparison of the next bit can be activated itself without any reference clock, and the period of each comparison is not fixed but dynamically adjusted according to the corresponding delay.Thus, the speed of SAR ADC can be optimized, and the frequency of the system clock can be decreased for the three cycles only. The self-timed SAR control is implemented by Muller C-elements and SR-latches. With the cascaded Muller C-element structure, an efficient signal transmission circuitry can be realized for the self-timed control.
C. Area-Power Optimization of AFEs and ADCs
A trade-off between power and area exists in the neural-signal acquisition circuits. The ratio between the number of AFEs and ADCs affects the overall power. With the decreasing number of ADCs, the frequency of the system clock and output current of AFEs increases, and thus the overall power also increases significantly. For example, the output current of AFE in 16-channel AFE with one ADC would exceed 50 for the required conversion rate. Based on the trade-off between power and area as shown in Fig. 14, 16 -channel AFE with four ADCs are designed in the microsystem. When the sampling clock of each channel is 2 kHz, the output current and the switching frequency of each analog multiplexer are 5 and 8 kHz, respectively. 
V. DESIGN OF ENERGY-EFFICIENT CONFIGURABLE DWT
The energy-efficient configurable lifting-based DWT is designed to extract the features of neural signals by filtering the signals into different frequency bands. Additionally, both the time window and mother wavelets can be adjusted via the configurable datapath. Moreover, power-gating and clock-gating techniques are utilized to further reduce the energy consumption for the energy-limited bio-systems. The configurable DWT is realized for different applications with four types of mother wavelet, Haar, Daubechies 2 (D2), Symmlet 4 (Sym4) and Symmlet 6 (Sym6). The differences between these four types are the length of time window and the filter coefficients. Additionally, the DWT is designed based on the lifting-based algorithm to reduce the arithmetic circuits [29] . Based on the lifting algorithm, the DWT is transferred into the lifting steps of "predict" and "update" functions, and , respectively. First, the input data is split into the even and odd samples, then applying the and filters sequentially. The last step is a multiplication by a scaling factors K and 1/K. For simplifying the datapath of the configurable lifting-based DWT, the lifting steps are designed in a backbone as shown in Fig. 15 . The backbone of the lifting steps is based on the Sym6 wavelet with the maximum length of time window, and the decompositions of other mother wavelets are derived according to this backbone. Therefore, the configurable datapath can realize other mother wavelet by adjusting the filter coefficients only to reduce the complexity of the controller.
A. Datapath of Lifting-Based Configurable DWT
The lifting factorization algorithm is utilized for the configurable DWT, and the maximum step of operations is 8 steps for Sym6 mother wavelet, which is the basic model for the other three mother wavelets [30] . Fig. 16 presents the multi-level multi-channel lifting-based DWT architecture. Based on the correlated arithmetic functions in these eight steps, a computation core (CC) is realized for each step. The lifting-based DWT can be realized using only 1 CC block to calculate the equations in 1-level iteration. However, the data access is a crucial issue for the design of the datapath. Two separated input buffers are realized for efficient data accesses. The multi-channel input buffer stores the input data for different neural sensing channels, and the input buffer (CC) is utilized for the sequential samples (input data and ) or the output data from the last level. Additionally, the temporary data for 1-level iteration are stored in the buffer (computation core) which is composed of six shift registers. The buffer (multi-level/multi-channel) stories the intermediate data from different channels and levels. The coefficients of different wavelets are placed in the coefficient memory. Furthermore, the a_temp register is designed for the last step in 1-level iteration which scales the data to the approximation by the scaling factor.
1-level iteration is composed of 8 steps, and each step can be realized by the computation core. Obviously, the functionality of each step can be implemented via the simple form as . Based on the regularity of these 8 steps, the complexity of the CC block can be reduced using two multipliers and one adder. Additionally, the three 10-bit inputs of X, Y and Z are defined in 2's complemented form. and are the 6-bit quantized filter coefficients. The outputs from the two multipliers must be down-scaled by utilizing a hardwired shifting operation for removing the x16 scaling filter coefficients. Then, transform X into 12-bit data for the 3-term adder. Finally, truncating the output from this adder is the last action during the entire CC operation.
For the multi-channel and multi-level DWT operations, 1-level iteration is the baseline via 10 execution cycles. Fig. 16 also presents 1-level DWT iteration composed of three states with the total ten execution cycles fixed by Sym6. Preparing the input data and temporary data from previous sample is in the read state by one cycle, and the following eight computation cycles are in the computation state for the 8 steps. Finally, the last cycle is for the write state.
B. Power Gating and Clock Gating for DWT
Low energy dissipation is a critical challenge for miniaturized neural sensing microsystems. Therefore, the interleaving architecture for multi-level DWT has been proposed to reduce the operation frequency for active power saving [31] . However, with scaled technologies, leakage current increasingly dominates the overall power consumption, especially at low operating frequencies. A gating architecture for multi-channel and multi-level DWT is designed to reduce the active power and leakage power via clock gating and power gating, respectively. process of different levels is executed sequentially from level-1 to level-5. The inactive time is variable and shorter than that in the odd sampling period. Hence, only the clock gating is applied within the inactive iterations during the even sampling periods.
The frequency of the interleaving architecture is only related to the number of levels. However, the frequency of the gating architecture is related to both the numbers of channels and levels. For a 5-level 4-channel DWT with the sampling frequency of 2 kHz, the clock frequencies of the gating and interleaving architectures are 400 kHz and 80 kHz , respectively.
C. Power/Area Analyses for Multi-Channel DWT
For multi-channel DWT, the data between different channels are independent and the computation core can be shared. To obtain an optimal trade-off among energy and area, the time-multiplexing scheme is utilized. Therefore, 5-level 16-channel lifting-based DWTs are implemented on different folded numbers of time-multiplexing scheme, interleaving or gating architectures and different CMOS processes, including TSMC 90 nm GP (General Purpose), 90 nm LP (Low Power), 65 nm GP and 65 nm LP. The voltages of GP process and LP process are 1.0 V and 1.2 V, respectively. Fig. 18 presents the power and area analysis of different architectures on different processes. In GP processes, both the power consumption and area are decreased with the increasing folded number using the time-multiplexing scheme. The power of the gating architecture is much smaller than those of the interleaving one with the similar area because the leakage current is reduced significantly. In LP processes, the ratio of the dynamic power is increased. Therefore, the power consumption of 16-folded DWT increases rapidly due to the increasing operation frequency. Consequently, the power of the gating architecture is smaller than that of the interleaving one since the leakage current is reduced by power gating. Based on area-power optimization, 4 folded time-multiplexing 4-CH DWTs are implemented in 2 low-power FPGA dies in 65 nm LP CMOS process for 16-channel feature extractions.
VI. LOW-POWER ON-INTERPOSER BUS
The low-power on-interposer bus ( -SPI) is designed for providing low power inter-chip data communications in 2.5D heterogeneous integrations.
A. Abstraction Layers of -SPI
Based on the characteristics of on-interposer buses, the protocol of the proposed is designed using three abstraction layers. The transport layer provides end-to-end communication services for the overall systems. Additionally, the transport layer provides convenient services such as connection-oriented data stream, reliability, flow control, and low power bus coding. In the transport layer of -SPI, the back-end interface can adopt cyclic redundancy check (CRC) and crosstalk avoidance coding (CAC) for providing low-power and reliable data communication. In the data-link layer of -SPI, the master controls the data bus by generating the clock signal (SCLK) and corresponding headers for all slaves. The -SPI can provide point-to-point or broadcast communication, half duplex or full duplex transmission and received-controlled acknowledgment. Moreover, the -SPI can support pseudo multi-master via mater passing technique. The physical layer is implemented to synchronize signals according to the standard SPI but with bidirectional links.
B. Hierarchical Packetization
To reduce the overhead of the header, the header of a packet is divided into two levels by the hierarchical packetization technique as shown in Fig. 19 . The length of 1st level header is fixed as 12-bit for indicating the functionality of this packet. Based on the information of 1st level header, the 2nd level header is variable for providing wide range of the burst length, broadcasting or point-to-point selection and variable length of addresses.
C. Pseudo Multi-Master
The pseudo multi-master is proposed in -SPI to replace the complex arbitration circuits via master passing. All the master devices are implemented as master/slave modules as shown in Fig. 20 . Therefore, a master/slave device can be either a master or a slave, and only 1 master can exist in the master/slave modules by controlling MS_Flag. The M/S flag indicates the direction of SS, SCLK and data. If the M/S flag is one in a Master/Slave device, this device is the only master in -SPI until passing this flag to the other master/slave device via a specific packet. In view of this, the 1st level header contains a pseudo multi-master mode for master passing.
D. Design of Master Module and Slave Module
The block diagrams of a master module and a slave module are as shown in Fig. 21 , respectively. In the master module, the back-end interface transfers the communication mode, selected slaves, address mode, burst length and valid bit to the header encoder for generating hierarchical packets. The slave is constructed by the two-layer header decoder and PHY, the white blocks in Fig. 21 .
VII. IMPLEMENTATION AND EXPERIMENTAL RESULTS
The proposed 2.5D heterogeneously integrated bio-sensing microsystem with -probes is implemented using one -probe chip with embedded TSVs, 4 dies and 1 silicon interposer. Fig. 22 shows the micrographs of the microsystem in 2.5D integration. The size of this microsystem is 16.5 mm 10.4 mm including 4 connectors, and the total size of the four active dies is only 10.3 mm 4.8 mm. These 4 dies are designed for neural-signal acquisitions, feature extraction and classification via AFE readout circuits, ADCs, configurable DWT circuits, filters, and a MCU. Additionally, the -probe chip with embedded TSVs is bonded on the backside of this interposer as shown in Fig. 22. Fig. 23 shows the measured impedance of an embedded TSV and Scanning Electron Microphotograph (SEM) of the fabricated MEMS -probe array with embedded TSVs, including 1 cross-section view and 1 60-degree view. The measured impedance of an embedded TSV is 0.17 with phase of at 1 kHz. The white vertical metal tube in the middle of -probe denotes the Cu-TSV. The -probe chip consists of 24 24 -probes which are divided into 16 channels by the connection of RDLs. The variations of the diameter, height and platinum coating of -probes may result in different characteristics among 16 channels. To reduce the risk of collecting ECoG signals by these -probes, a conservative design is adopted for this microsystem. Each channel contains 6 6 -probes with embedded TSVs in shunt for increasing the contact area and reducing the equivalent impedance of the -probe array as shown in Fig. 23 . The measured impendence of each channel is with at 1 kHz and tolerable for ECoG sensing. Additionally, the pitch between two -probes is 250 . 
A. Experimental Results of Circuits
16-channel AFE with 4 ADCs are implemented in Die-1 and fabricated in TSMC 0.18 1P6M CMOS process. The total area of this die is 3 mm 3 mm, and the chip microphotographs of the whole chip, 1-channel AFE and 1 ADC are as shown in Fig. 24 . The active area of 1-channel AFE and 1 ADC are and , respectively. Fig. 25(a) presents the frequency response of 1-channel AFE. The AFE provides a mid-band gain of 60.3 dB with bandwidth from 3.52 Hz to 8.21 kHz. A CMRR of 85 dB and a PSRR (power supply rejection ratio) of 72 dB are achieved to suppress the common mode noise from human bodies and power supply noise. Additionally, Fig. 25(b) also presents the CMRR and PSRR with different input voltage offset, and both the CMRR and PSRR are all above 50 dB within 50 mV input voltage offset. Due to the high current gain and large area of the differential pairs in the 1st DDA stage, the total input referred noise is 0.826 from 0.1 Hz to 13 kHz. The comparisons between the proposed AFE and other similar amplifiers [32] - [34] are listed in Table I . The NEF of the proposed AFE is 2.72 only, close to the theoretical limit of 2.02 [33] .
In the area-power-efficient 11-bit ADC, the total capacitance of DAC for the SAR ADC is 7.45 pF and the unit capacitance is 0.15 pF. In the coarse tune, the capacitance of the two capacitors for the N-buffer and P-Buffer is only 80 fF. The area of the coarse tune is . Based on the design of the delay-line enhanced tuning, the size of the DAC in the SAR ADC can be reduced significantly. Compared with the conventional 11-bit SAR ADC, the area is reduced by 42% in the proposed hybrid ADC under the same resolution. The total power consumption of the hybrid ADC is 0.6 at 8 kS/s. The maximum sampling rate of this ADC is 1.2 MS/s. Fig. 26 presents the FFT spectrum with 8 ks/s sampling frequency and 180 Hz input frequency. Additionally, the DNL and INL are also as shown in Fig. 26 . The DNL is 0.7/ 1.0 LSB and the INL is [37] under similar sampling rates and resolutions. To compare the effective area of different SAR ADCs, a normalization factor considering the FoM and area is applied. This normalization factor is the product of the FoM and the active area divided by to normalize the effect of different resolutions, where R represents the resolution of an ADC. Compared with other ADCs, the proposed hybrid ADC achieves the most power-area efficiency.
In this heterogeneously integrated bio-sensing microsystem, 4-folded 16-channel DWTs are implemented via two Lattice MachXO2-1200 FPGA dies fabricated using 65 nm LP CMOS process for early evaluation with both clock/power gating. The power gating and clock gating techniques are utilized to further reduce the leakage currents and clock power for the energy-limited bio-systems, respectively. The clock gating is utilized during inactive iterations, and the power gating is only applied during inactive sampling periods. Moreover, the power gating is realized by the power saving mode of these FPGAs. Fig. 27 presents the power consumptions of different DWT architectures on these 2 FPGA dies at 1.2 V. The leakage power is also a critical issue in FPGA. The dynamic power of the gating architecture is slightly larger than that of the interleaving one because of the 5 times frequency (800 kHz). Moreover, the power gating can reduce the leakage power substantially, which is from 223.2 to 91.1 . Compared with the interleaving architecture, the gating architecture is the best solution for FPGA implementations. Table III lists the specifications of this heterogeneously integrated bio-sensing microsystem, and the comparisons with TSV-based double-side integration [13] . In the proposed microsystem, -probes with embedded TSVs are designed to collect the ECoG signals. Moreover, the number of -probes per channel is configurable and defined by the link of RDL. For the AFE design, the off-chip capacitors are removed, and the modified on-chip pseudo resistors are utilized to provide large resistance for DC-offset cancellation. Accordingly, 4 area-power-efficient 11-bit ADCs are implemented to convert the ECoG signals to digital codes for feature extraction and feature classification. The power consumption of each ADC is only 0.6 at 8 k sample per second. Furthermore, the area of the ADC is only 0.032 . In the proposed 2.5D bio-sensing microsystem, the 4-folded 16-channel DWTs and filters are implemented on two low power FPGAs for feature extraction. The power consumption of each FPGA is only 105
. Additionally, the post-layout simulation of the DWTs is also analyzed by TSMC 65 nm LP CMOS process. The power consumption of 4-folded 16-channel DWTs is only 27.6 . Consequently, the MCU is utilized for the system control and feature classification. In addition, the low power u-SPI is designed to transfer the data between the 4 dies on the interposer. The overall power of this 2.5D heterogeneously integrated microsystem is only 676.3 uw.
B. In-Vivo Test
For the in-vivo animal test, 5 thickness biocompatible parylene-C is deposited to isolate different -probes and to encapsulate the whole bio-sensing microsystem. Additionally, the connectors are bonded for providing power signals and capturing the data on the -SPI. Additionally, one adult Wistar rat, age eight to nine months and weight 800-900 g, was adopted for the in-vivo test. This Wistar rat was raised in a room with a daily 12 hours light and 12 hours dark cycle. All surgical and experimental procedures were approved by the National Chiao-Tung University Animal Care Committee. Subsequently, the rat was placed in a standard stereotaxic apparatus and anesthetized by pentobarbital. The skin of the head near the sensorimotor cortex was opened to expose the skull, and then a burr hole was drilled in the skull to expose the right brain. Dura and pia of the brain in the hole were removed and the proposed microsystem was placed into the hole. Additionally, screws were put in the skull to fix the microsystem. Moreover, UV hardened dental cement was utilized to close the opening in the skull and to prevent the separation of the microsystem. Fig. 28 shows the environmental setup of the in-vivo test, the measured 16-channel ECoG signals from neural acquisition circuits and 2 ECoG feature extractions from Symlet4 and Harr DWTs. These 16-channel ECoG signals were measured as spontaneous seizure spike-wave discharges while the rat was awaking. Furthermore, the CH-16 was served as the reference channel for the in-vivo test. The size of this microsystem is designed and implemented for human brains. Unfortunately, the brain size of the Wistar rat is too small for this bio-sensing microsystem. Thus, CH-01 and CH-09 are noises due to the brain size limitation of the rat, and these 6 6 -probe arrays touches the head bone of this rat. For the ECoG feature extractions from configurable DWTs, the mother wavelets and time windows can be adjusted for different applications. Both Harr and Symlet4 are selected for CH-06 and CH-03, respectively. The Harr lifting-based DWT illustrates the clear distinction between the high frequencies and low frequencies, and the Symlet4 DWT can detect the epilepsy by distinguishing the spikes in low frequencies. Therefore, the configurable DWTs can be utilized for different applications for the corresponding feature extractions.
VIII. CONCLUSION
Heterogeneously integrated and miniaturized neural sensing microsystems for accurately capturing and classifying signals are crucial for brain function investigation. In this paper, a 2.5D heterogeneously integrated bio-sensing microsystem with -probes is presented for neural sensing applications. This microsystem composes of -probes with embedded TSVs, 4 dies and 1 interposer. These 4 dies are designed for neural-signal acquisitions, feature extractions and classification via 16-channel AFE circuits, 4 ADCs, 4 folded configurable DWTs, filters, and a MCU. Additionally, the on-interposer bus, -SPI, is designed for transferring data on the interposer. The overall power of this microsystem is only 676.3 for 16-channel ECoG sensing. Moreover, the in-vivo test demonstrated the proposed 2.5D heterogeneously integrated bio-sensing microsystem. The Currently, he is an Assistant Research Fellow at NCTU. His research interests focus on low power/low cost TSV 2.5D/3D integrations, brain neural sensing microsystems, embedded memory design and low power SoC, and SiP design with particular emphasis on on-chip interconnection networks and memory sub-systems. He has authored or coauthored more than 40 technical papers in renowned international journals and conferences and holds over 10 patents.
Shang-Lin Wu received the M.S. degree in electronics engineering from National Tsing Hua University, Hsinchu, Taiwan.
Currently, he is working toward the Ph.D. degree at the Institute of Electronics, National Chiao Tung University, Hsinchu, Taiwan. His research interests focus on digital memory system design and low-power, low-noise analog circuit design for biomedical applications. Currently, he is working toward the Ph.D. degree in electrical and computer engineering at the University of Texas at Austin, Austin, TX, USA. His research interests include low-power ADC design for biomedical acquisition microsystems.
Yu-Chieh Huang
Tang-Hsuan Wang received the B.Sc. degree from National Chung Cheng University, Minxiong, Taiwan, and the M.Sc. degree from the Department of Electronics Engineering, National Chiao Tung University, Hsinchu, Taiwan, in 2011 and 2013, respectively.
Currently, she is a Design Engineer at MediaTek, Hsinchu, Taiwan.
Yu-Rou Lin received the B.Sc. degree from National Chung Cheng University, Minxiong, Taiwan, and the M.Sc. degree from the Department of Electronics Engineering, National Chiao Tung University, Hsinchu, Taiwan, in 2011 and 2013, respectively.
Currently, she is a Design Engineer at MStar Semiconductor, Hsinchu, Taiwan.
Chuan-An Cheng received the B.S. degree in physics from National Chung Hsing University, Taichung, Taiwan, in 2009. Currently, he is working toward the Ph.D. degree at the Institute of Electronics, National Chiao Tung University, Hsinchu, Taiwan. His research interests focus on 2.5D electronic packaging, TSV bonding, wafer thinning, heterogeneous integration, and 3-D integrated circuit technologies.
