36 research outputs found
Developing Trustworthy Hardware with Security-Driven Design and Verification
Over the past several decades, computing hardware has evolved to become smaller, yet more performant and energy-efficient. Unfortunately these advancements have come at a cost of increased complexity, both physically and functionally. Physically, the nanometer-scale transistors used to construct Integrated Circuits (ICs), have become astronomically expensive to fabricate. Functionally, ICs have become increasingly dense and feature rich to optimize application-specific tasks. To cope with these trends, IC designers outsource both fabrication and portions of Register-Transfer Level (RTL) design. Outsourcing, combined with the increased complexity of modern ICs, presents a security risk: we must trust our ICs have been designed and fabricated to specification, i.e., they do not contain any hardware Trojans.
Working in a bottom-up fashion, I initially study the threat of outsourcing fabrication. While prior work demonstrates fabrication-time attacks (modifications) on IC layouts, it is unclear what makes a layout vulnerable to attack. To answer this, in my IC Attack Surface (ICAS) work, I develop a framework that quantifies the security of IC layouts. Using ICAS, I show that modern ICs leave a plethora of both placement and routing resources available for attackers to exploit. Next, to plug these gaps, I construct the first routing-centric defense (T-TER) against fabrication-time Trojans. T-TER wraps security-critical interconnects in IC layouts with tamper-evident guard wires to prevent foundry-side attackers from modifying a design.
After hardening layouts against fabrication-time attacks, outsourced designs become the most critical threat. To address this, I develop a dynamic verification technique (Bomberman) to vet untrusted third-party RTL hardware for Ticking Timebomb Trojans (TTTs). By targeting a specific type of Trojan behavior, Bomberman does not suffer from false negatives (missed TTTs), and therefore systematically reduces the overall design-time attack surface. Lastly, to generalize the Bomberman approach to automatically discover other behaviorally-defined classes of malicious logic, I adapt coverage-guided software fuzzers to the RTL verification domain. Leveraging software fuzzers for RTL verification enables IC design engineers to optimize test coverage of third-party designs without intimate implementation knowledge. Overall, this dissertation aims to make security a first-class design objective, alongside power, performance, and area, throughout the hardware development process.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169761/1/trippel_1.pd
Multi-Threshold Low Power-Delay Product Memory and Datapath Components Utilizing Advanced FinFET Technology Emphasizing the Reliability and Robustness
Indiana University-Purdue University Indianapolis (IUPUI)In this thesis, we investigated the 7 nm FinFET technology for its delay-power product performance. In our study, we explored the ASAP7 library from Arizona State University, developed in collaboration with ARM Holdings. The FinFET technology was chosen since it has a subthreshold slope of 60mV/decade that enables cells to function at 0.7V supply voltage at the nominal corner. An emphasis was focused on characterizing the Non-Ideal effects, delay variation, and power for the FinFET device. An exhaustive analysis of the INVx1 delay variation for different operating conditions was also included, to assess the robustness.
The 7nm FinFET device was then employed into 6T SRAM cells and 16 function ALU. The SRAM cells were approached with advanced multi-corner stability evaluation. The system-level architecture of the ALU has demonstrated an ultra-low power system operating at 1 GHz clock frequency
Recommended from our members
Ultra-Low Leakage, Energy-Efficient Digital Integrated Circuit and System Design
The advances of the complementary metal-oxide-semiconductor (CMOS) technology manufacturing and design over the years have enabled a diverse range of applications across the power consumption, performance, and area (PPA) spectra. Many of the recent and prospective applications rely on the availability of energy-autonomous, miniaturized systems, i.e., ultra-low power (ULP) VLSI systems, which are generally characterized by extreme resource limitations. Some examples of applications are wireless sensing platforms, body-area sensor networks (BASN), biomedical and implantable devices, wearables, hearables, and monitors. Within the context of such applications, the key requirements are long lifetime and miniaturized size (sub-/millimeter-scale). In order to enable both requirements, energy-efficiency is the key metric. It allows for extended battery lifetime and operation with the energy that can be harvested from the environment, and it limits the size (volume) of the energy sources utilized to power these systems.
Ultra-low voltage (ULV) operation is a key technique in which the VDD of circuits is reduced from nominal to near or below the threshold voltage of the transistor. It is a powerful knob that has been largely exploited by designers in order to achieve ultra-low power consumption and high energy-efficiency in CMOS. Existing ULP VLSI systems typically operate at a lower supply voltage thereby reducing their energy consumption by one to two orders of magnitude in order to enable the aforementioned applications.
While supply voltage scaling is a promising measure for achieving low power and reducing energy consumption, it brings up several challenges. One critical issue is the leakage energy dissipated by the devices, which is magnified in portion to the total energy consumption at ULV. The reason is that, as VDD scales from nominal to near-threshold and sub-threshold, transistors become increasingly slower and they accumulate more leakage (i.e., static) power over longer cycle times. This energy waste accounts for a significant portion of the system's total energy consumption, offsets the gains provided by voltage scaling, defines the minimum energy per operation, and poses a practical limit for the system's energy-efficiency.
This thesis presents selected research works on ultra-low leakage, energy-efficient digital integrated circuit design. More specifically, it describes novel and key techniques for minimizing the energy waste of idle/underutilized and always-on hardware. The main goal of such techniques is to push the envelope of energy-efficiency in energy-autonomous, miniaturized VLSI systems. Such techniques are applied to key building blocks of emerging mobile and embedded computing devices resulting in state-of-the-art energy-efficiencies
Design and Optimization for Resilient Energy Efficient Computing
Heutzutage sind moderne elektronische Systeme ein integraler Bestandteil unseres Alltags. Dies wurde unter anderem durch das exponentielle Wachstum der Integrationsdichte von integrierten Schaltkreisen ermöglicht zusammen mit einer Verbesserung der Energieeffizienz, welche in den letzten 50 Jahren stattfand, auch bekannt als Mooreâs Gesetz. In diesem Zusammenhang ist die Nachfrage von energieeffizienten digitalen Schaltkreisen enorm angestiegen, besonders in Anwendungsfeldern wie dem Internet of Things (IoT). Da der Leistungsverbrauch von Schaltkreisen stark mit der Versorgungsspannung verknĂŒpft ist, wurden effiziente Verfahren entwickelt, welche die Versorgungsspannung in den nahen Schwellenspannung-Bereich skalieren, zusammengefasst unter dem Begriff Near-Threshold-Computing (NTC). Mithilfe dieser Verfahren kann eine Erhöhung der Energieeffizienz von Schaltungen um eine ganze GröĂenordnung ermöglicht werden.
Neben der verbesserten Energiebilanz ergeben sich jedoch zahlreiche Herausforderungen was den Schaltungsentwurf angeht. Zum Beispiel fĂŒhrt das Reduzieren der Versorgungsspannung in den nahen Schwellenspannungsbereich zu einer verzehnfachten Erhöhung der SensibilitĂ€t der Schaltkreise gegenĂŒber Prozessvariation, Spannungsfluktuationen und TemperaturverĂ€nderungen. Die EinflĂŒsse dieser Variationen reduzieren die ZuverlĂ€ssigkeit von NTC Schaltkreisen und sind ihr gröĂtes Hindernis bezĂŒglich einer umfassenden Nutzung. Traditionelle AnsĂ€tze und Methoden aus dem nominalen Spannungsbereich zur Kompensation von VariabilitĂ€t können
nicht effizient angewandt werden, da die starken Performance-Variationen und SensitivitĂ€ten im nahen Schwellenspannungsbereich dessen KapazitĂ€ten ĂŒbersteigen. Aus diesem Grund sind neue Entwurfsparadigmen und Entwurfsautomatisierungskonzepte fĂŒr die Anwendung von NTC erforderlich.
Das Ziel dieser Arbeit ist die zuvor erwÀhnten Probleme durch die Bereitstellung von ganzheitlichen Methoden zum Design von NTC Schaltkreisen sowie dessen Entwurfsautomatisierung anzugehen, welche insbesondere auf der Schaltungs- sowie Logik-Ebene angewandt werden. Dabei werden tiefgehende Analysen der ZuverlÀssigkeit von NTC Systemen miteinbezogen und Optimierungsmethoden werden vorgeschlagen welche die ZuverlÀssigkeit, Performance und Energieeffizienz verbessern. Die BeitrÀge dieser Arbeit sind wie folgt:
Schaltungssynthese und Timing Closure unter Einbezug von Variationen: Das Einhalten von Anforderungen an das zeitliche Verhalten und ZuverlĂ€ssigkeit von NTC ist eine anspruchsvolle Aufgabe. Die Auswirkungen von VariabilitĂ€t kommen bei starken Performance-Schwankungen, welche zu teuren zeitlichen Sicherheitsmargen fĂŒhren, oder sich in Hold-Time VerstöĂen ausdrĂŒcken, verursacht durch funktionale Störungen, zum Vorschein. Die konventionellen AnsĂ€tze beschrĂ€nken sich dabei alleine auf die Erhöhung von zeitlichen Sicherheitsmargen. Dies ist jedoch sehr ineffizient fĂŒr NTC, wegen dem starken AusmaĂ an Variationen und den erhöhten Leckströmen.
In dieser Arbeit wird ein Konzept zur Synthese und Timing Closure von Schaltkreisen unter Variationen vorgestellt, welches sowohl die SensitivitĂ€t gegenĂŒber Variationen reduziert als auch die Energieeffizienz, Performance und ZuverlĂ€ssigkeit verbessert und zugleich den Mehraufwand von Timing Closures [1, 2] verringert. Simulationsergebnisse belegen, dass unser vorgeschlagener Ansatz die Verzögerungszeit um 87% reduziert und die Performance und Energieeffizienz um 25% beziehungsweise 7.4% verbessert, zu Kosten eines erhöhten FlĂ€chenbedarfs von 4.8%.
SchichtĂŒbergreifende ZuverlĂ€ssigkeits-, Energieeffizienz- und Performance-Optimierung von Datenpfaden: SchichtĂŒbergreifende Analyse von Prozessor-Datenpfaden, welche den ganzen Weg spannen vom Kompilierer zum Schaltungsentwurf, kann potenzielle OptimierungsansĂ€tze aufzeigen. Ein Datenpfad ist eine Kombination von mehreren funktionalen Einheiten, welche diverse Instruktionen verarbeiten können. Unsere Analyse zeigt, dass die AusfĂŒhrungszeiten von Instruktionen bei niedrigen Versorgungsspannungen stark variieren, weshalb eine Klassifikation in schnelle und langsame Instruktionen vorgenommen werden kann. Des Weiteren können funktionale Instruktionen als hĂ€ufig und selten genutzte Instruktionen kategorisiert werden.
Diese Arbeit stellt eine Multi-Zyklen-Instruktionen-Methode vor, welche die Energieeffizienz und Belastbarkeit von funktionalen Einheiten erhöhen kann [3]. ZusÀtzlich stellen wir einen Partitionsalgorithmus vor, welcher ein fein-granulares Power-gating von selten genutzten Einheiten ermöglicht [4] durch Partition von einzelnen funktionalen Einheiten in mehrere kleinere Einheiten. Die vorgeschlagenen Methoden verbessern das zeitliche Schaltungsverhalten signifikant, und begrenzen zugleich die Leckströme betrÀchtlich, durch Einsatz einer Kombination von Schaltungs-Redesign- und Code-Replacement-Techniken. Simulationsresultate
zeigen, dass die entwickelten Methoden die Performance und Energieeffizienz von arithmetisch-logischen Einheiten (ALU) um 19% beziehungsweise 43% verbessern. Des Weiteren kann der Zuwachs in Performance der optimierten Schaltungen in eine Verbesserung der ZuverlÀssigkeit umgewandelt werden [5, 6].
Post-Fabrication und Laufzeit-Tuning: Prozess- und Laufzeitvariationen haben einen starken Einfluss auf den Minimum Energy Point (MEP) von NTC-Schaltungen, welcher mit der energieeffizientesten Versorgungsspannung assoziiert ist. Es ist ein besonderes Anliegen, die NTC-Schaltung nach der Herstellung (post-fabrication) so zu kalibrieren, dass sich die Schaltung im MEP-Zustand befindet, um die beste Energieeffizient zu erreichen.
In dieser Arbeit, werden Post-Fabrication und Laufzeit-Tuning vorgeschlagen, welche die Schaltung basierend auf Geschwindigkeits- und Leistungsverbrauch-Messungen nach der Herstellung auf den MEP kalibrieren. Die vorgestellten Techniken ermitteln den MEP per Chip-Basis um den Einfluss von Prozessvariationen mit einzubeziehen und dynamisch die Versorgungsspannung und Frequenz zu adaptieren um zeitabhĂ€ngige Variationen wie Workload und Temperatur zu adressieren. Zu diesem Zweck wird in die Firmware eines Chips ein Regression-Modell integriert, welches den MEP basierend auf Workload- und Temperatur-Messungen zur Laufzeit extrahiert. Das Regressions-Modell ist fĂŒr jeden Chip einzigartig und basiert lediglich auf Post-Fabrication-Messungen. Simulationsergebnisse zeigen das der entwickelte Ansatz eine sehr hohe prognostische Treffsicherheit und Energieeffizienz hat, Ă€hnlich zu hardware-implementierten Methoden, jedoch ohne hardware-seitigen Mehraufwand [7, 8].
Selektierte Flip-Flop Optimierung: Ultra-Low-Voltage Schaltungen mĂŒssen im nominalen Versorgungsspannungs-Mode arbeiten um zeitliche Anforderungen von laufenden Anwendungen zu erfĂŒllen. In diesem Fall ist die Schaltung von starken Alterungsprozessen betroffen, welche die Transistoren durch Erhöhung der Schwellenspannungen degradieren. Unsere tiefgehenden Analysen haben gezeigt das gewisse Flip-Flop-Architekturen von diesen Alterungserscheinungen beeinflusst werden indem fĂ€lschlicherweise konstante Werte ( \u270\u27 oder \u271\u27) fĂŒr eine lange Zeit gespeichert sind. Im Vergleich zu anderen Komponenten sind Flip-Flops sensitiver zu Alterungsprozessen und versagen unter anderem dabei einen neuen Wert innerhalb des vorgegebenen zeitlichen Rahmens zu ĂŒbernehmen. AuĂerdem kann auch ein geringfĂŒgiger Spannungsabfall zu diesen zeitlichen VerstöĂen fĂŒhren, falls die betreffenden gealterten Flip-Flops zum kritischen Pfad zuzuordnen sind.
In dieser Arbeit wird eine selektiver Flip-Flop-Optimierungsmethode vorgestellt, welche die Schaltungen bezĂŒglich Robustheit gegen statische Alterung und Spannungsabfall optimieren. Dabei werden zuerst optimierte robuste Flip-Flops generiert und diese dann anschlieĂend in die Standard-Zellen-Bibliotheken integriert. Flip-Flops, die in der Schaltung zum kritischen Pfad gehören und Alterung sowie Spannungsabfall erfahren, werden durch die optimierten robusten Versionen ersetzt, um das Zeitverhalten und die ZuverlĂ€ssigkeit der Schaltung zu verbessern [9, 10]. Simulationsergebnisse zeigen, dass die erwartete Lebenszeit eines Prozessors
um 37% verbessert werden kann, wÀhrend Leckströme um nur 0.1% erhöht werden.
WĂ€hrend NTC das Potenzial hat groĂe Energieeffizienz zu ermöglichen, ist der Einsatz in neue Anwendungsfeldern wie IoT wegen den zuvor erwĂ€hnten Problemen bezĂŒglich der hohen SensitivitĂ€t gegenĂŒber Variationen und deshalb mangelnder ZuverlĂ€ssigkeit, noch nicht durchsetzbar. In dieser Dissertation und in noch nicht publizierten Werken [11â17], stellen wir Lösungen zu diesen Problemen vor, die eine Integration von NTC in heutige Systeme ermöglichen
Doctor of Philosophy
dissertationCommunication surpasses computation as the power and performance bottleneck in forthcoming exascale processors. Scaling has made transistors cheap, but on-chip wires have grown more expensive, both in terms of latency as well as energy. Therefore, the need for low energy, high performance interconnects is highly pronounced, especially for long distance communication. In this work, we examine two aspects of the global signaling problem. The first part of the thesis focuses on a high bandwidth asynchronous signaling protocol for long distance communication. Asynchrony among intellectual property (IP) cores on a chip has become necessary in a System on Chip (SoC) environment. Traditional asynchronous handshaking protocol suffers from loss of throughput due to the added latency of sending the acknowledge signal back to the sender. We demonstrate a method that supports end-to-end communication across links with arbitrarily large latency, without limiting the bandwidth, so long as line variation can be reliably controlled. We also evaluate the energy and latency improvements as a result of the design choices made available by this protocol. The use of transmission lines as a physical interconnect medium shows promise for deep submicron technologies. In our evaluations, we notice a lower energy footprint, as well as vastly reduced wire latency for transmission line interconnects. We approach this problem from two sides. Using field solvers, we investigate the physical design choices to determine the optimal way to implement these lines for a given back-end-of-line (BEOL) stack. We also approach the problem from a system designer's viewpoint, looking at ways to optimize the lines for different performance targets. This work analyzes the advantages and pitfalls of implementing asynchronous channel protocols for communication over long distances. Finally, the innovations resulting from this work are applied to a network-on-chip design example and the resulting power-performance benefits are reported
Cross-Layer Optimization for Power-Efficient and Robust Digital Circuits and Systems
With the increasing digital services demand, performance and power-efficiency
become vital requirements for digital circuits and systems. However, the
enabling CMOS technology scaling has been facing significant challenges of
device uncertainties, such as process, voltage, and temperature variations. To
ensure system reliability, worst-case corner assumptions are usually made in
each design level. However, the over-pessimistic worst-case margin leads to
unnecessary power waste and performance loss as high as 2.2x. Since
optimizations are traditionally confined to each specific level, those safe
margins can hardly be properly exploited.
To tackle the challenge, it is therefore advised in this Ph.D. thesis to
perform a cross-layer optimization for digital signal processing circuits and
systems, to achieve a global balance of power consumption and output quality.
To conclude, the traditional over-pessimistic worst-case approach leads to
huge power waste. In contrast, the adaptive voltage scaling approach saves
power (25% for the CORDIC application) by providing a just-needed supply
voltage. The power saving is maximized (46% for CORDIC) when a more aggressive
voltage over-scaling scheme is applied. These sparsely occurred circuit errors
produced by aggressive voltage over-scaling are mitigated by higher level error
resilient designs. For functions like FFT and CORDIC, smart error mitigation
schemes were proposed to enhance reliability (soft-errors and timing-errors,
respectively). Applications like Massive MIMO systems are robust against lower
level errors, thanks to the intrinsically redundant antennas. This property
makes it applicable to embrace digital hardware that trades quality for power
savings.Comment: 190 page
Recommended from our members
Variation-Tolerant and Voltage-Scalable Integrated Circuits Design
Ultra-low-voltage (ULV) operation where the supply voltage of the digital computing hardware is scaled down to the level near or below transistor threshold voltage (e.g. 300-500mV) is a key technique to achieve high computing energy efficiency. It has enabled many new exciting applications in the field of Internet of Things (IoT) devices and energy-constrained applications such as medical implants, environment sensors, and micro-robots. Ultra-low-voltage (ULV) operation is also commonly used with the emerging architectures that are often non Von-Neumann style to empower energy-efficient cognitive computing.
One the biggest challenge in realizing ULV design is the large circuit delay variability. To guarantee functionality in the worst-case process, voltage, and temperature (PVT) condition, the traditional safety margin approach requires operating at a slower clock frequency or higher supply voltage which significantly limits the achievable energy efficiency of the hardware. To fully claim the energy efficiency of ULV, the large circuit delay variation needs to be adaptively handled. However, the existing adaptive techniques that are optimized for nominal supply voltage operation and traditional Von-Neumann architectures become inefficient for ULV designs and emerging architectures.
This thesis presents adaptive techniques based on timing error detection and correction (EDAC) that are more suitable for the energy-constrained ULV designs and the emerging architectures. The proposed techniques are demonstrated in three test chips: (1) R-Processor: A 0.4V resilient processor with a voltage-scalable and low-overhead in-situ EDAC technique. It achieves 38% energy efficiency improvement or 2.3X throughput improvement as compared to the traditional safety margin approach. (2) A 450mV timing-margin-free waveform sorter for brain computer interface (BCI) microsystem. It achieves 49.3% higher energy efficiency and 35.6% higher throughput than the traditional safety margin approach. (3) Ultra-low-power and robust power-management system which consists of a microprocessor employing ULV EDAC, 63-ratio integrated switched-capacitor DC-DC converter, and a fully-digital error based regulation controller.
In this thesis, we also explore circuits for emerging techniques. The first is temperature sensors for dynamic-thermal-management (DTM). The modern high-performance microprocessors suffer from ever-increasing power densities which has led to reliability concerns and increased cooling costs from excessive heat. In order to monitor and manage the thermal behavior, DTM techniques embed multiple temperature sensors and use its information. The size, accuracy, and voltage-scalability of the sensor are critical for the performance of DTM. Therefore, we propose a temperature sensor that directly senses transistor threshold voltage and the test chip demonstrates 9X smaller area, 3X higher accuracy, and 200mV lower voltage scalability (down to 400mV) than the previous state-of-art.
Another area of exploration is interconnect design for ultra-dynamic-voltage-scaling (UDVS) systems. UDVS has been proposed for applications that require both high performance and high energy efficiency. UDVS can provide peak performance with nominal supply voltage when work load is high. When work load is moderate or low, UDVS systems can switch to ULV operation for higher energy efficiency. One of the critical challenges for developing UDVS systems is the inflexibility in various circuit fabrics that are often optimized for a single supply voltage. One critical example is conventional repeater based long interconnects which suffers from non-optimal performance and energy efficiency in UDVS systems. Therefore, in this thesis, we propose a reconfigurable interconnect design based on regenerators and demonstrate near optimal performance and energy efficiency across the supply voltage of 0.3V and 1V
Power Reductions with Energy Recovery Using Resonant Topologies
The problem of power densities in system-on-chips (SoCs) and processors has become more exacerbated recently, resulting in high cooling costs and reliability issues. One of the largest components of power consumption is the low skew clock distribution network (CDN), driving large load capacitance. This can consume as much as 70% of the total dynamic power that is lost as heat, needing elaborate sensing and cooling mechanisms. To mitigate this, resonant clocking has been utilized in several applications over the past decade. An improved energy recovering reconfigurable generalized series resonance (GSR) solution with all the critical support circuitry is developed in this work. This LC resonant clock driver is shown to save about 50% driver power (\u3e40% overall), on a 22nm process node and has 50% less skew than a non-resonant driver at 2GHz. It can operate down to 0.2GHz to support other energy savings techniques like dynamic voltage and frequency scaling (DVFS).
As an example, GSR can be configured for the simpler pulse series resonance (PSR) operation to enable further power saving for double data rate (DDR) applications, by using de-skewing latches instead of flip-flop banks. A PSR based subsystem for 40% savings in clocking power with 40% driver active area reduction xii is demonstrated. This new resonant driver generates tracking pulses at each transition of clock for dual edge operation across DVFS. PSR clocking is designed to drive explicit-pulsed latches with negative setup time. Simulations using 45nm IBM/PTM device and interconnect technology models, clocking 1024 flip-flops show the reductions, compared to non-resonant clocking. DVFS range from 2GHz/1.3V to 200MHz/0.5V is obtained. The PSR frequency is set \u3e3Ă the clock rate, needing only 1/10th the inductance of prior-art LC resonance schemes. The skew reductions are achieved without needing to increase the interconnect widths owing to negative set-up times.
Applications in data circuits are shown as well with a 90nm example. Parallel resonant and split-driver non-resonant configurations as well are derived from GSR. Tradeoffs in timing performance versus power, based on theoretical analysis, are compared for the first time and verified. This enables synthesis of an optimal topology for a given application from the GSR
Recommended from our members
High-Speed Wide-Field Time-Correlated Single-Photon Counting Fluorescence Lifetime Imaging Microscopy
Fluorescence microscopy is a powerful imaging technique used in the biological sciences to identify labeled components of a sample with specificity. This is usually accomplished through labeling with fluorescent dyes, isolating these dyes by their spectral signatures with optical filters, and recording the intensity of the fluorescent response. Although these techniques are widely used, fluorescence intensity images can be negatively affected by a variety of factors that impact the fluorescence intensity. Fluorescence lifetime imaging microscopy (FLIM) is an imaging technique that is relatively immune to intensity fluctuations and also provides the unique ability to directly monitor the microenvironment surrounding a fluorophore. Despite the benefits associated with FLIM, the applications to which it is applied are fairly limited due to long image acquisition times and high cost of traditional hardware. Recent advances in complementary metal-oxide-semiconductor (CMOS) single-photon avalanche diodes (SPADs) have enabled the design of low-cost imaging arrays that are capable of recording lifetime images with acquisition times greater than one order of magnitude faster than existing systems. However, these SPAD arrays have yet to realize the full potential of the technology due to limitations in their ability to handle the vast amount of data generated during the commonly used time-correlated single-photon counting (TCSPC) lifetime imaging technique. This thesis presents the design, implementation, characterization, and demonstration of a high speed FLIM imaging system. The components of this design include a CMOS imager chip in a standard 0.13 ÎŒm technology containing a custom CMOS SPAD, a 64-by-64 array of these SPADs, pixel control circuitry, independent time-to-digital converters (TDCs), a FLIM specific datapath, and high bandwidth output buffers. In addition to the CMOS imaging array, a complete system was designed and implemented using a printed circuit board (PCB) for capturing data from the imager, creating histograms for the photon arrival data using field-programmable gate arrays, and transferring the data to a computer using a cabled PCIe interface. Finally, software is used to communicate between the imaging system and a computer.The dark count rate of the SPAD was measured to be only 231 Hz at room temperature while maintaining a photon detection probability of up to 30\%. TDCs included on the array have a 62.5 ps resolution and a 64 ns range, which is suitable for measuring the lifetime of most biological fluorophores. Additionally, the on-chip datapath was designed to handle continuous data transfers at rates capable of supporting TCSPC-based lifetime imaging at 100 frames per second. The system level implementation also provides sufficient data throughput for transferring up to 750 frames per second from the imaging system to a computer. The lifetime imaging system was characterized using standard techniques for evaluating SPAD performance and an electrical delay signal for measuring the TDC performance. This thesis concludes with a demonstration of TCSPC-FLIM imaging at 100 frames per second -- the fastest 64-by-64 TCSPC FLIM that has been demonstrated. This system overcomes some of the limitations of existing FLIM systems and has the potential to enable new application domains in dynamic FLIM imaging