In this work, we report on an unprecedented design where digital, analog, and MEMS technologies are combined to realize a generalpurpose single-chip CMOS microsystem. The convergence of these technologies has enabled the development of a low power, portable microinstrument ideally suited for controlling environmental and bio-implantable sensors.
A 16-Bit Mixed-Signal Microsystem with Integrated CMOS-MEMS Clock Reference

INTRODUCTION
Microprocessors and microcontrollers have become ubiquitous in electronic applications. However, the systems supported by these devices have become exceedingly complex in terms of functionality and the quantity of peripheral components. For example, a typical end-to-end system architecture might include sensors and actuators, signal conditioning and data conversion circuitry, a microprocessor, a wireless interface, and supporting electronics such as the system clock reference. In recent years, the boundaries of these systems have stretched to include additional peripheral components, marking the advent of system-on-chip (SoC) and microsystem development. The motivation to develop such devices has been to increase functionality and performance while reducing system cost, integration complexity, and power dissipation. One of the challenges that hinders the development of more advanced systems that expand these boundaries is that not all peripheral components are CMOS-compatible, while microcontrollers are manufactured almost exclusively in CMOS technology.
In this work, we have succeeded in expanding the system boundaries further while maintaining focus on key design constraints including power, size, and performance. Specifically, a complete analog front end (AFE) has been developed that supports data conversion and signal conditioning while operating at 900mV. Additionally, with the use of MEMS technology, clock generation has
.~ .
. _.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific Demission and/or a fee. been merged on-chip to reduce overall system complexity by eliminating an off-chip crystal reference. Lastly, a highly efficient low power instruction set has been combined with architectural power reducing techniques to ensure minimal power dissipation in the digital core. In the sections that follow we report on the development and performance of this microsystem which has been fabricated in the 0.18pm mixed-mode CMOS process available from Taiwan Semiconductor Manufacturing Company (TSMC). The microsystem is currently under test.
MICROSYSTEM ARCHITECTURE
This microsystem was developed with a focus on sensor control applications, but retains enough flexibility for a variety of other general-purpose uses. The microsystem shown in Figure 1 is comprised of three major subsystems: a digital core, an analog front end, and an on-chip clock reference. The core consists of an efficient 16-bit, three-stage pipeline with 64ICJ3 on-chip SRAM. A timer and multiple serial interfaces were included as necessary microcontroller peripherals for communication with external devices. The AFE conditions an analog input signal, typically from sensors, and performs analog-to-digital conversion before sending the digital results to the core. The clock reference supplies the digital core and AFE with a low-jitter, frequency selectable, tunable reference while requiring less power than an off-chip crystal reference. Due to power constraints, the entire microsystem would ideally operate at 900mV. However, currently only the AFE has been designed for this extremely low supply. The digital core supply is limited to 1.8V because the SRAM and digital libraries are characterized at this voltage. In the near future, we hope to migrate both the digital core and the clock reference to the same 900mV supply.
ANATOMY OF THE MICROSYSTEM 3.1 Microcontroller
Core
A 16-bit loadstore architecture with dual-operand register-to-register instructions was chosen to satisfy the power and performance requirements of the microsystem. The 16-bit datapath was selected to reduce the complexity and power consumption of the core while providing adequate precision in calculations, given that the sensors controlled by this chip require 12 bits of resolution. The datapath pipeline consists of three stages: fetch, decode, and execute. Typically in sensor applications, processing throughput requirements are minimal and power dissipation is a key design constraint; therefore clock frequencies should be kept as low as possible. A three-pipeline-stage architecture was chosen to obtain acceptable performance at low clock frequencies.
A 24-bit address space for unified data and instruction memory satisfies the potentially large storage requirements of remote sensor systems. The 16MB of supported memory is byte addressable and provides sufficient storage for program, data, and memorymapped peripheral components. The current implementation of the core has 64KB of on-chip SRAM with an off-chip memory bus interface to allow full utilization of the remaining memory space.
The machine contains sixteen general-purpose data registers and four address registers, that are evenly split into two access windows. The windowing scheme permits instructions to be encoded in 16 bits by reducing the number of bits required to encode register operands. Three additional non-windowed address registers (a stack pointer, frame pointer, and link register) are used by the compiler for subroutine and stack support. The hardware supports maskable interrupts which are prioritized up to 64 different levels.
Instruction Set Architecture
The ISA includes seventy-seven instructions and eight addressing modes. The addition of bit-manipulation instructions allows for bits in memory to be setheset in two cycles via one instruction instead of separate read, mask, store instructions that would otherwise be required. Two-word instructions were necessary to support 24-bit absolute addressing modes with 16-bit instructions. Address update modes provide for easy manipulation of the addresses stored in the address registers by allowing both pre-and postupdate operations. Load and store instructions are available with or without update and in word or byte mode. Standard arithmetic and logical instructions are included along with support for multi-word add and subtract operations. The multiply and divide instructions take multiple cycles to complete and are available with single or two word results and source operands. For hardware simplicity, they are implemented by shifting and acciunulating results using the ALU instead of having dedicated functional units. Subroutine and conditional change-of-flow instructions are included along with special test instructions that are described in Section 5 .
Memory Architecture
By subdividing the memory structure into different blocks, at the cost of extra area for duplicated sense amps and other peripheral circuitry, a memory structure was obtained that dissipates less power than a monolithic memory. Using the Artisan SRAM compiler for TSMC's 0.18pm process, we found that the optimal configuration for 64KB of on-chip memory was eight banks of 8KB each. This topology allows one to disable all single-port memory banks that are not being accessed on a cycle-by-cycle basis. It also allows a single instruction and data access to different banks of memory on the same cycle without stalling the pipeline. A dedicated memory-management-unit in the core routes data from the correct bank to the requesting unit and disables inactive banks of memory. The memory speed is sufficient to allow all accesses to complete within one cycle without the need for caches. As a power saving feature, a modifie'd'loop cacheas shown in [ 11 was added to the chip. The loop cache is not a true cache, but rather is a small, low power, 512-byte memory that is pre-loaded with the most commonly executed instructions (typically loop code) or frequently accessed data as determined through compiler profiling. This greatly reduces the power consumption of the controller since embedded controllers typically run the same software throughout their lifetime and much of that time is spent executing loop code. Further studies will be conducted to verify the optimal size of the loop cache once the application software has been completed.
Peripherals
Supported peripherals include two Universal Synchronous Asynchronous Receive/Transmit (USART) units, a 12-bit general-purpose parallel output port, a Serial Peripheral Interface (SPI) unit and a multifunction programmable timer. One USART is dedicated to communication with extemal components, the other is for general use and for loading programs into memory. The SPI is ideal for communicating with multiple sensors that share a single sensor bus, as presented in [2] . The timer is capable of timing both internal and extemal events. With these peripheral components easily accessible through software, communication with the microsystem is both versatile and efficient.
Functional VeriJication
An extensive environment was developed using Per1 scripts to facilitate functional verification of the digital core. Focused assembly language test cases were used to test the basic operation of each instruction in the ISA, as well as anticipated comer cases. Additional test cases were written with the sole purpose of verifying interrupts and the timing of peripherals such as the USART, SPI, and timers. A random assembly code generator was developed and used to generate millions of lines of random test cases. These detected functional bugs that might have been missed in the focused test cases, particularly any unexpected interdependencies between instructions as they progressed down the pipeline. The same verification process was repeated after logic synthesis in Synopsys Design Compiler and again after Automatic Place and Route (APR) in Cadence Silicon Ensemble, exposing functional bugs that might have been introduced by the synthesis/APR tools.
Performance Estimation
The digital core consists of 120,000 transistors, excluding SRAMs. Nanosim estimates that the core will dissipate 6.75mW at 62.5MHz. The switching vectors used by Nanosim were obtained by running assembly language test cases on the Verilog-XL simulator and capturing the resulting switching activity. The power associated with a read or write to one of the 8KB SRAM banks is significant (specific numbers are proprietary to Artisan), while the leakage and standby power are very small for this process. RAM power dominates the digital logic power. However, due to design time constraints, we were restricted to using a standard RAM generator. In the future, power intelligent RAMS will be designed to provide much lower power dissipation. Another improvement would be to use non-volatile FlashROM for instruction memory.
Analog Front End
Overview
The analog front end (AFE) consists of three major com onents: buffers, a programmable gain amplifier (PGA), and a 2 Rd order sigma delta (CA) modulator. Each component has unique features that will be highlighted in this section. General design issues are discussed first to motivate the design decisions.
Design Considerations
The AFE has been designed to operate at supply voltages ranging from 900mV to 1 .W. In this work, we demonstrate the feasibility of a 900mV system using current technologies. For the targeted sensor applications, a lOOHz Nyquist conversion speed with a resolution of 12 bits is sufficient. The interface to the sensors must be high impedance with near-zero current draw. With the-se reqrlirements, many issues in architecture and circuit design arise that would not ordinarily be encountered.
The AFE was realized using a differential architecture. This offers the advantages of increased substrate noise rejection, reduced effects of charge injection, and most significantly, increased signal swing. To understand the importance of increased signal swing, consider a single transistor abstraction of an amplifier in a switched-capacitor circuit. It can be shown that power is proportional to dynamic range, DR, and inversely proportional to the supply voltage, V D D [3] . This proportionality clearly shows the importance of maximizing signal swing in a reduced supply voltage environment. Therefore, most circuits comprising the AFE must support rail-to-rail input and output stages.
Design Parameter
Unity Gain Bandwidth (GBW)
A .switched capacitor (SC) approach was employed because it offers superior matching, inherent linearity, and implementation efficiency. However, if VDD is less than the sum of the threshold voltages of the n-and p-type devices, a region exists in which the switch does not conduct. When operating at 900mV this poor conduction situation arises, thus requiring unique circuit design techniques.
Value
3.6MHz
BufSers and PGA
Output Swing Power
In a data acquisition system, such as an AFE, a signal conditioning stage is usually required. Typically, PGAs are used to adjust the signal amplitude level to the maximum input level prior to analogto-digital conversion. The PGA, shown in Figure 2 , consists of two main components: a fully differential switched operational amplifier and a variable SC feedback network. In [4] , the switchedopamp technique was shown to enable low-voltage SC circuits. In this work, the switched-opamp technique is applied to the PGA, which permits removal of the input sampling switches of the ADC, thereby enhancing the dynamic range. The gain is programmed by switching capacitors in parallel with the input sampling capacitor of the feedback network. During the sampling phase, 41 is ON and is OFF, the opamp is enabled and the output voltage is proportional to C,IC2. During the second phase, is OFF and $2 is ON, the opamp is disabled and the output of the PGA is pulled to ground. Typically, the variable sampling capacitor, C1, is implemented as multiple capacitors selectively connected in parallel with switches. Also, most switches in this circuit are not required to pass a mid-rail voltage, thus making this architecture ideal for low-voltage applications.
20mV to Rail 40pW
The fully differential operational amplifier that was used to realize the PGA is a class AB input stage, class AB rail-to-rail output stage amplifier. The performance characteristics are summarized in Table I . Resistive common-mode feedback was used because it does not limit the output swing. Devices biased in weak inversion were used to enhance operation at low-voltage levels.
Design Parameter
Effective Number of Bits As noted previously, the switches at the input of the PGA were implemented using a switched operational-amplifier technique which provided two benefits. First, the switched operationalamplifier is capable of passing mid-rail signals and second, it provides a high impedance buffer between the PGA and the input. 
Second Order CA Modulator
A second order EA modulator was chosen because of a superior trade-off between resolution and bandwidth compared with oversampling alone. The ZA topology utilizes feedback to provide noise shaping, thus pushing the quantization noise produced by the 1-bit ADC to higher out-of-band frequencies. The feedback also reduces the performance requirements of the analog circuits, making this architecture suitable for deep-submicron digital processes.
Since the XA modulator represents the digital data in a pulse-codemodulated (PCM) format, digital filtering is needed. The filtering, which is done through software on the microcontroller, removes the out-of-band noise and recovers the digital data. A summary of the performance characteristics of the ADC are shown in Table 11 . These results have been verified through SPICE simulation. 
CMOS-MEMS Clock Reference
Overview
The clock reference has been built around a high-frequency MEMS-LC oscillator. From this oscillator, the signal is squared and divided by a series of flip-flops. The architecture was designed to provide multiple clock frequencies, each with a 50% duty cycle. The primary motivation for using MEMS technology is simply that the quality factor (Q-factor) of these components is higher than that of alternative integrated technologies, such as MOS varactors or simple spiral inductors.
Devices
The MEMS technology in this work includes a suspended inductor (L) and a micromechanical varactor (C). Together, these components comprise a high quality factor, or high-Q, LC-tank as a precision reference for the clock oscillator. A significant focus of this work has been to develop these components in a manner that is compatible with commercial CMOS manufacturing processes. Specifically, a simple maskless post-process has been developed to release the micromachined components without damaging the active electronics. Moreover, the components have been designed such that the structural layers of each device are defined by the material layers in the standard CMOS process. 
3.8mW/4.1mW 69ps/48ps
The micromechanical varactor in this work is of a parallel plate topology similar to that presented in [ 5 ] . The device is constructed by mechanically suspending a metal top plate in air above a fixed metal bottom plate. A mechanical suspension network provides support for the top plate. The device presents a nominal capacitance set by the device geometry and the gap between the plates, xo. When a positive DC voltage is applied across the plates, an electrostatic force will deflect the moveable top plate a distance x in the vertical direction, thus modulating the capacitance. This variable capacitance is described by C = EA/(x, -x) where E is the permittivity of air, A is the plate overlap area, x, is the nominal distance between the plates, and x is some displacement forced by the DC tuning voltage.
The varactor was designed and modeled with theoretical mechanical analysis and then verified with the finite element analysis (FEA) tool, Coventorware. Figure 3 shows an image of an electrostatic FEA of the device. The color contours indicate relative displacement, x, from the nominal position, x,. The figure represents the results of a coupled electromechanical simulation where top plate displacement is forced by an applied voltage. Performance of the device is dependent on material and device geometry. Specifically, the mechanical spring constant associated with the mechanical suspension network will determine the tuning voltage response. Moreover, the device geometry determines the achievable capacitance. The varactor was designed to realize a nominal capacitance near 1pF and to respond to a tuning voltage ranging from 0 to 1.2V.
Coupled with the MEMS varactor is a suspended inductor. The device is also fabricated in standard CMOS and is suspended above the substrate by anchors that are defined by the standard process. The dielectric material around the device is removed to increase the quality factor and reduce capacitive coupling between the inductor and the substrate, thus minimizing energy loss from eddy currents in the substrate. The device is fabricated in the topmost, thickest metal layer to prevent loss due to series resistance. Using previously reported inductor modeling techniques [6], a theoretical model was derived for the device and again the Coventorware environment was employed to determine the @factor and the mechanical stability of the device.
Oscillator and Clock Generation Circuitry
In this work we have designed and enhanced a 2GHz doubly-balanced cross-coupled CMOS-LC oscillator, shown in Figure 4 . For matched devices, this topology presents the tank with a negative resistance of -2/g, and thus cancels the loss in the tank. The total capacitance was realized by combining two varactors in parallel with the suspended inductor in order to manage the dimensions of the device. These LC devices were modeled with parasitic components that were extracted from simulation of the inductor and varactor structures. Bypass capacitors were required to isolate the Duty Cycle (HighLow) Period Jitter (1GHz) 
44/56
8.5fs
varactor tuning voltage, Vtune, from the remainder of the circuit. The design has been completed to realize a loop gain of at least 5 at a phase shift of 0 degrees in order to satisfy the Barkhaussen startup criterion with adequate margin.
The clock oscillator provides a differential output signal that drives a differential-to-single-ended converting amplifier with unity gain. A series of flip-flops then divide the signal to the appropriate frequency. Six different clock frequencies are synthesized (1GHz to 3 1.25MHz) and each flip-flop output is gated by an enable signal in order to manage power dissipation. The clock signals are generated from high to low frequency in order to overcome the jitter accumulation inherent in phase locked loop systems. In such systems, clock frequencies are generated by multiplying a low frequency reference signal to a high frequency. In this system, topdown frequency synthesis is employed by dividing a high frequency reference to lower frequencies, all with improved jitter.
Performance
Worst case frequency stability was analyzed only for the lGHz clock signal since it is the least stable signal in the system. Results were acquired in both the frequency and time domains using the Cadence SpectreRF and Agilent ADS environments respectively. Time domain and DC performance was determined using the Cadence SpectreRF environment. The clock reference supply voltage is 1.8V in order to interface to the digital core. However, the supply can be scaled easily to 900mV because start-up is based on current, not voltage, and the current source for the reference has been designed to be power supply independent over a broad range. A summary of these performance parameters is given in Table 111 . 
DESIGN METHODOLOGY
Building this microsystem involved several challenges as this design includes not only a union of the analog and digital circuit domains, but also the mechanical and electrical domains. Although no single design framework met all of the design requirements, we found that the Cadence AMS environment is well-suited for system-level development of microsystems technology. Additional tools used in this work included Spectre for analog subsystem and transistor-level design, Coventorware for FEA of MEMS components, Synopsys Design Compiler for digital synthesis, Cadence Silicon Ensemble for APR, and Mentor Graphics Calibre for DRG and LVS. Results from the subsystem designs have been reported in previous sections.
A top-down design methodology was employed and is summarized in Figure 5 . The system specification was developed with the C programming language and Verilog-AMS. MEMS and analog components were modeled in Erilog-A, while the microprocessor core and digital peripherals were modeled in Verilog. From the system model, a natural partition of top-down subsystem design activities followed. Each block was specified with an abstraction for the hardware. In parallel with behavioral verification of the digital section, the blocks in the mechanical and analog domains were developed and performance metrics were determined. Updated Verilog-A was developed to model performance from FEA in the mechanical domain, while device-level design and analysis using Spectre helped achieve the analog specification. The digital electronics were developed such that a complete behavioral description of the hardware was realized. At this point, the first cross-domain verification of the system was performed. Once the hardware description language (HDL) from each domain had been updated with the simulated performance, verification of the system model was trivial. In the Cadence AMS environment, HDL and primitives may be mixed and critical subsystem performance metrics can be determined quickly with a detailed model for the subsystem and an abstract model for the remainder of the system. This was particularly significant when considering analog and MEMS device-level performance that required digital programming which was described only by HDL.
A system-wide simulation was completed and iteration in the mechanical and analog design activities continued, dependent upon system performance. This initial cross-domain simulation reduced the design cycle on three fronts. First, design effort had not been expended synthesizing the digital electronics. Second, iteration in the design of the MEMS and analog circuits occurred early in the design flow. Last, the system simulation was fast, as it was described by behavioral HDL, not a device-level netlist.
System development continued with a typical physical design methodology. The digital sections were synthesizedplaced-androuted while the mechanical and analog sections were custom designed. Timing information from the AF' R tool and back-annotated parasitics were used to achieve timing closure in all domains, after which a second cross-domain simulation was executed for system verification based on physical design. Again, the HDL for the subsystems was updated for accurate system simulation. Physical design iteration continued until timing closure was achieved for the system. The domain-specific design activities concluded with the delivery of a hard macro. Final system development activities included system-level APR, physical design verification (DRC, LVS), layout parasitic extraction (LPE), and back-annotation. A final cross-domain verification was completed once parasitic extraction data for the interconnect between macros was determined. APR iteration was also necessary. Finally, the design was transferred to TSMC and fabricated. A die micrograph of the microsystem is shown in Figure 6 . Figure 7 shows an electron micrograph of the coupled MEMS varactors.
TEST
Many testability features have been incorporated into the digital core to facilitate post-fabrication functional testing. Most notable is the Test Interface, which is a modified USART that provides read or write access to any one of the internal system registers through specialized test instructions. The test interface supports the injection of an instruction into the pipeline and then single-stepping that instruction through the pipeline. Additionally, the test interface provides the benefit of testability using little more than a notebook computer with a standard RS-232 serial port. This will be especially valuable for verifying sensor systems in the field.
At-speed testing of the fabricated digital core will be performed using our in-house HP82000 digital IC tester. The same test cases that were used in simulation have been converted into test vectors and will be run through the HP82000 tester. Initial tests will be loaded into the tester memory and fetched through the chip's external memory bus. A special feature was integrated into the boot ROM so that when an external interrupt is asserted during boot-up, the external interrupt handler in the boot ROM immediately jumps to an address in external memory (in this case it will be mapped to the tester memory). This bypasses the remainder of the boot ROM and most importantly, allows testing of the chip even if the on-chip memory is not functioning. An additional testability feature includes hardware breakpoint modes to assist in stopping the microcontroller accurately.
Testing of the AFE will be done through external probe points. Visibility and overdrive pads were added at the outputs of the buffers, PGA, and the XA modulator. This allows for an input signal to be traced through the AFE into the microcontroller at crucial points. Also, all current and voltage reference points were made extemally visible for monitoring. The output of the ADC, in PCM format, is accessible externally for recording and subsequent signal processing in Matlab. The internal clock reference signal has been routed to the padframe. Additionally, an external clock input has been included for diagnostics. The microcontroller contains a programmable multiplexor that selects between the on-chip reference and the external clock. The on-chip clock frequency and time domain stability will be measured using external instrumentation.
CONCLUSION
As the field of microsystems matures, the boundaries of system integration will continue to be challenged. The focus of future work will include achieving the goals of reduced power dissipation, cost and size, while realizing increased integration and functionality. In this work we have taken a significant and unprecedented step toward reaching these goals. We have demonstrated a low power, compact, fully integrated microsystem that was developed in TSMC's 0.18pm mixed-mode CMOS process. The presented design merges not only the analog and digital domains, but also the electrical and mechanical domains. The system contains a 16-bit microcontroller, a low-voltage analog front end, and a MEMS clock reference, making it ideal for a variety of microinstrumentation applications, particularly micro-sensor control. Rigorous testing is in progress and results are expected to be reported in 2003. 
ACKNOWLEDGEMENTS
