285 research outputs found

    FPGA-SPICE: A Simulation-Based Architecture Evaluation Framework for FPGAs

    Get PDF
    In this paper, we developed a simulation-based architecture evaluation framework for field-programmable gate arrays (FPGAs), called FPGA-SPICE, which enables automatic layout-level estimation and electrical simulations of FPGA architectures. FPGA-SPICE can automatically generate Verilog and SPICE netlists based on realistic FPGA configurations and a high-level eTtensible Markup Language-based FPGA architectural description language. The outputted Verilog netlists can be used to generate layouts of full FPGA fabrics through a semicustom design flow. SPICE simulation decks can be generated at three levels of complexity, namely, full-chip-level, grid-level, and component-level, providing different tradeoff between accuracy and simulation time. In order to enable such level of analysis, we presented two SPICE netlist partitioning techniques: loads extraction and parasitic net activity estimation. Electrical simulations showed that averaged over the selected benchmarks, the grid-/component-level approach can achieve 6.1x/7.5x execution speed-up with 9.9%/8.3% accuracy loss, respectively, compared to the full-chip level simulation. FPGA-SPICE was showcased through three different case studies: 1) an area breakdown analysis for static random access memory-based FPGAs, showing that configuration memories are a dominant factor; 2) a power breakdown comparison to analytical models, analyzing the source of accuracy loss; and 3) a robustness evaluation against process corners, studying their impact on energy consumption of full FPGA fabrics

    Approximate and timing-speculative hardware design for high-performance and energy-efficient video processing

    Get PDF
    Since the end of transistor scaling in 2-D appeared on the horizon, innovative circuit design paradigms have been on the rise to go beyond the well-established and ultraconservative exact computing. Many compute-intensive applications – such as video processing – exhibit an intrinsic error resilience and do not necessarily require perfect accuracy in their numerical operations. Approximate computing (AxC) is emerging as a design alternative to improve the performance and energy-efficiency requirements for many applications by trading its intrinsic error tolerance with algorithm and circuit efficiency. Exact computing also imposes a worst-case timing to the conventional design of hardware accelerators to ensure reliability, leading to an efficiency loss. Conversely, the timing-speculative (TS) hardware design paradigm allows increasing the frequency or decreasing the voltage beyond the limits determined by static timing analysis (STA), thereby narrowing pessimistic safety margins that conventional design methods implement to prevent hardware timing errors. Timing errors should be evaluated by an accurate gate-level simulation, but a significant gap remains: How these timing errors propagate from the underlying hardware all the way up to the entire algorithm behavior, where they just may degrade the performance and quality of service of the application at stake? This thesis tackles this issue by developing and demonstrating a cross-layer framework capable of performing investigations of both AxC (i.e., from approximate arithmetic operators, approximate synthesis, gate-level pruning) and TS hardware design (i.e., from voltage over-scaling, frequency over-clocking, temperature rising, and device aging). The cross-layer framework can simulate both timing errors and logic errors at the gate-level by crossing them dynamically, linking the hardware result with the algorithm-level, and vice versa during the evolution of the application’s runtime. Existing frameworks perform investigations of AxC and TS techniques at circuit-level (i.e., at the output of the accelerator) agnostic to the ultimate impact at the application level (i.e., where the impact is truly manifested), leading to less optimization. Unlike state of the art, the framework proposed offers a holistic approach to assessing the tradeoff of AxC and TS techniques at the application-level. This framework maximizes energy efficiency and performance by identifying the maximum approximation levels at the application level to fulfill the required good enough quality. This thesis evaluates the framework with an 8-way SAD (Sum of Absolute Differences) hardware accelerator operating into an HEVC encoder as a case study. Application-level results showed that the SAD based on the approximate adders achieve savings of up to 45% of energy/operation with an increase of only 1.9% in BD-BR. On the other hand, VOS (Voltage Over-Scaling) applied to the SAD generates savings of up to 16.5% in energy/operation with around 6% of increase in BD-BR. The framework also reveals that the boost of about 6.96% (at 50°) to 17.41% (at 75° with 10- Y aging) in the maximum clock frequency achieved with TS hardware design is totally lost by the processing overhead from 8.06% to 46.96% when choosing an unreliable algorithm to the blocking match algorithm (BMA). We also show that the overhead can be avoided by adopting a reliable BMA. This thesis also shows approximate DTT (Discrete Tchebichef Transform) hardware proposals by exploring a transform matrix approximation, truncation and pruning. The results show that the approximate DTT hardware proposal increases the maximum frequency up to 64%, minimizes the circuit area in up to 43.6%, and saves up to 65.4% in power dissipation. The DTT proposal mapped for FPGA shows an increase of up to 58.9% on the maximum frequency and savings of about 28.7% and 32.2% on slices and dynamic power, respectively compared with stat

    New FPGA design tools and architectures

    Get PDF

    AI/ML Algorithms and Applications in VLSI Design and Technology

    Full text link
    An evident challenge ahead for the integrated circuit (IC) industry in the nanometer regime is the investigation and development of methods that can reduce the design complexity ensuing from growing process variations and curtail the turnaround time of chip manufacturing. Conventional methodologies employed for such tasks are largely manual; thus, time-consuming and resource-intensive. In contrast, the unique learning strategies of artificial intelligence (AI) provide numerous exciting automated approaches for handling complex and data-intensive tasks in very-large-scale integration (VLSI) design and testing. Employing AI and machine learning (ML) algorithms in VLSI design and manufacturing reduces the time and effort for understanding and processing the data within and across different abstraction levels via automated learning algorithms. It, in turn, improves the IC yield and reduces the manufacturing turnaround time. This paper thoroughly reviews the AI/ML automated approaches introduced in the past towards VLSI design and manufacturing. Moreover, we discuss the scope of AI/ML applications in the future at various abstraction levels to revolutionize the field of VLSI design, aiming for high-speed, highly intelligent, and efficient implementations

    Design Disjunction for Resilient Reconfigurable Hardware

    Get PDF
    Contemporary reconfigurable hardware devices have the capability to achieve high performance, power efficiency, and adaptability required to meet a wide range of design goals. With scaling challenges facing current complementary metal oxide semiconductor (CMOS), new concepts and methodologies supporting efficient adaptation to handle reliability issues are becoming increasingly prominent. Reconfigurable hardware and their ability to realize self-organization features are expected to play a key role in designing future dependable hardware architectures. However, the exponential increase in density and complexity of current commercial SRAM-based field-programmable gate arrays (FPGAs) has escalated the overhead associated with dynamic runtime design adaptation. Traditionally, static modular redundancy techniques are considered to surmount this limitation; however, they can incur substantial overheads in both area and power requirements. To achieve a better trade-off among performance, area, power, and reliability, this research proposes design-time approaches that enable fine selection of redundancy level based on target reliability goals and autonomous adaptation to runtime demands. To achieve this goal, three studies were conducted: First, a graph and set theoretic approach, named Hypergraph-Cover Diversity (HCD), is introduced as a preemptive design technique to shift the dominant costs of resiliency to design-time. In particular, union-free hypergraphs are exploited to partition the reconfigurable resources pool into highly separable subsets of resources, each of which can be utilized by the same synthesized application netlist. The diverse implementations provide reconfiguration-based resilience throughout the system lifetime while avoiding the significant overheads associated with runtime placement and routing phases. Evaluation on a Motion-JPEG image compression core using a Xilinx 7-series-based FPGA hardware platform has demonstrated the potential of the proposed FT method to achieve 37.5% area saving and up to 66% reduction in power consumption compared to the frequently-used TMR scheme while providing superior fault tolerance. Second, Design Disjunction based on non-adaptive group testing is developed to realize a low-overhead fault tolerant system capable of handling self-testing and self-recovery using runtime partial reconfiguration. Reconfiguration is guided by resource grouping procedures which employ non-linear measurements given by the constructive property of f-disjunctness to extend runtime resilience to a large fault space and realize a favorable range of tradeoffs. Disjunct designs are created using the mosaic convergence algorithm developed such that at least one configuration in the library evades any occurrence of up to d resource faults, where d is lower-bounded by f. Experimental results for a set of MCNC and ISCAS benchmarks have demonstrated f-diagnosability at the individual slice level with average isolation resolution of 96.4% (94.4%) for f=1 (f=2) while incurring an average critical path delay impact of only 1.49% and area cost roughly comparable to conventional 2-MR approaches. Finally, the proposed Design Disjunction method is evaluated as a design-time method to improve timing yield in the presence of large random within-die (WID) process variations for application with a moderately high production capacity

    Degradation in FPGAs: Monitoring, Modeling and Mitigation

    Get PDF
    This dissertation targets the transistor aging degradation as well as the associated thermal challenges in FPGAs (since there is an exponential relation between aging and chip temperature). The main objectives are to perform experimentation, analysis and device-level model abstraction for modeling the degradation in FPGAs, then to monitor the FPGA to keep track of aging rates and ultimately to propose an aging-aware FPGA design flow to mitigate the aging

    Potential and Challenges of Analog Reconfigurable Computation in Modern and Future CMOS

    Get PDF
    In this work, the feasibility of the floating-gate technology in analog computing platforms in a scaled down general-purpose CMOS technology is considered. When the technology is scaled down the performance of analog circuits tends to get worse because the process parameters are optimized for digital transistors and the scaling involves the reduction of supply voltages. Generally, the challenge in analog circuit design is that all salient design metrics such as power, area, bandwidth and accuracy are interrelated. Furthermore, poor flexibility, i.e. lack of reconfigurability, the reuse of IP etc., can be considered the most severe weakness of analog hardware. On this account, digital calibration schemes are often required for improved performance or yield enhancement, whereas high flexibility/reconfigurability can not be easily achieved. Here, it is discussed whether it is possible to work around these obstacles by using floating-gate transistors (FGTs), and analyze problems associated with the practical implementation. FGT technology is attractive because it is electrically programmable and also features a charge-based built-in non-volatile memory. Apart from being ideal for canceling the circuit non-idealities due to process variations, the FGTs can also be used as computational or adaptive elements in analog circuits. The nominal gate oxide thickness in the deep sub-micron (DSM) processes is too thin to support robust charge retention and consequently the FGT becomes leaky. In principle, non-leaky FGTs can be implemented in a scaled down process without any special masks by using “double”-oxide transistors intended for providing devices that operate with higher supply voltages than general purpose devices. However, in practice the technology scaling poses several challenges which are addressed in this thesis. To provide a sufficiently wide-ranging survey, six prototype chips with varying complexity were implemented in four different DSM process nodes and investigated from this perspective. The focus is on non-leaky FGTs, but the presented autozeroing floating-gate amplifier (AFGA) demonstrates that leaky FGTs may also find a use. The simplest test structures contain only a few transistors, whereas the most complex experimental chip is an implementation of a spiking neural network (SNN) which comprises thousands of active and passive devices. More precisely, it is a fully connected (256 FGT synapses) two-layer spiking neural network (SNN), where the adaptive properties of FGT are taken advantage of. A compact realization of Spike Timing Dependent Plasticity (STDP) within the SNN is one of the key contributions of this thesis. Finally, the considerations in this thesis extend beyond CMOS to emerging nanodevices. To this end, one promising emerging nanoscale circuit element - memristor - is reviewed and its applicability for analog processing is considered. Furthermore, it is discussed how the FGT technology can be used to prototype computation paradigms compatible with these emerging two-terminal nanoscale devices in a mature and widely available CMOS technology.Siirretty Doriast

    Printed Electronics-Based Physically Unclonable Functions for Lightweight Security in the Internet of Things

    Get PDF
    Die moderne Gesellschaft strebt mehr denn je nach digitaler Konnektivität - überall und zu jeder Zeit - was zu Megatrends wie dem Internet der Dinge (Internet of Things, IoT) führt. Bereits heute kommunizieren und interagieren „Dinge“ autonom miteinander und werden in Netzwerken verwaltet. In Zukunft werden Menschen, Daten und Dinge miteinander verbunden sein, was auch als Internet von Allem (Internet of Everything, IoE) bezeichnet wird. Milliarden von Geräten werden in unserer täglichen Umgebung allgegenwärtig sein und über das Internet in Verbindung stehen. Als aufstrebende Technologie ist die gedruckte Elektronik (Printed Electronics, PE) ein Schlüsselelement für das IoE, indem sie neuartige Gerätetypen mit freien Formfaktoren, neuen Materialien auf einer Vielzahl von Substraten mit sich bringt, die flexibel, transparent und biologisch abbaubar sein können. Darüber hinaus ermöglicht PE neue Freiheitsgrade bei der Anpassbarkeit von Schaltkreisen sowie die kostengünstige und großflächige Herstellung am Einsatzort. Diese einzigartigen Eigenschaften von PE ergänzen herkömmliche Technologien auf Siliziumbasis. Additive Fertigungsprozesse ermöglichen die Realisierung von vielen zukunftsträchtigen Anwendungen wie intelligente Objekte, flexible Displays, Wearables im Gesundheitswesen, umweltfreundliche Elektronik, um einige zu nennen. Aus der Sicht des IoE ist die Integration und Verbindung von Milliarden heterogener Geräte und Systeme eine der größten zu lösenden Herausforderungen. Komplexe Hochleistungsgeräte interagieren mit hochspezialisierten, leichtgewichtigen elektronischen Geräten, wie z.B. Smartphones mit intelligenten Sensoren. Daten werden in der Regel kontinuierlich gemessen, gespeichert und mit benachbarten Geräten oder in der Cloud ausgetauscht. Dabei wirft die Fülle an gesammelten und verarbeiteten Daten Bedenken hinsichtlich des Datenschutzes und der Sicherheit auf. Herkömmliche kryptografische Operationen basieren typischerweise auf deterministischen Algorithmen, die eine hohe Schaltungs- und Systemkomplexität erfordern, was sie wiederum für viele leichtgewichtige Geräte ungeeignet macht. Es existieren viele Anwendungsbereiche, in denen keine komplexen kryptografischen Operationen erforderlich sind, wie z.B. bei der Geräteidentifikation und -authentifizierung. Dabei hängt das Sicherheitslevel hauptsächlich von der Qualität der Entropiequelle und der Vertrauenswürdigkeit der abgeleiteten Schlüssel ab. Statistische Eigenschaften wie die Einzigartigkeit (Uniqueness) der Schlüssel sind von großer Bedeutung, um einzelne Entitäten genau unterscheiden zu können. In den letzten Jahrzehnten hat die Hardware-intrinsische Sicherheit, insbesondere Physically Unclonable Functions (PUFs), eine große Strahlkraft hinsichtlich der Bereitstellung von Sicherheitsfunktionen für IoT-Geräte erlangt. PUFs verwenden ihre inhärenten Variationen, um gerätespezifische eindeutige Kennungen abzuleiten, die mit Fingerabdrücken in der Biometrie vergleichbar sind. Zu den größten Potenzialen dieser Technologie gehören die Verwendung einer echten Zufallsquelle, die Ableitung von Sicherheitsschlüsseln nach Bedarf sowie die inhärente Schlüsselspeicherung. In Kombination mit den einzigartigen Merkmalen der PE-Technologie werden neue Möglichkeiten eröffnet, um leichtgewichtige elektronische Geräte und Systeme abzusichern. Obwohl PE noch weit davon entfernt ist, so ausgereift und zuverlässig wie die Siliziumtechnologie zu sein, wird in dieser Arbeit gezeigt, dass PE-basierte PUFs vielversprechende Sicherheitsprimitiven für die Schlüsselgenerierung zur eindeutigen Geräteidentifikation im IoE sind. Dabei befasst sich diese Arbeit in erster Linie mit der Entwicklung, Untersuchung und Bewertung von PE-basierten PUFs, um Sicherheitsfunktionen für ressourcenbeschränkte gedruckte Geräte und Systeme bereitzustellen. Im ersten Beitrag dieser Arbeit stellen wir das skalierbare, auf gedruckter Elektronik basierende Differential Circuit PUF (DiffC-PUF) Design vor, um sichere Schlüssel für Sicherheitsanwendungen für ressourcenbeschränkte Geräte bereitzustellen. Die DiffC-PUF ist als hybride Systemarchitektur konzipiert, die siliziumbasierte und gedruckte Komponenten enthält. Es wird eine eingebettete PUF-Plattform entwickelt, um die Charakterisierung von siliziumbasierten und gedruckten PUF-Cores in großem Maßstab zu ermöglichen. Im zweiten Beitrag dieser Arbeit werden siliziumbasierte PUF-Cores auf Basis diskreter Komponenten hergestellt und statistische Tests unter realistischen Betriebsbedingungen durchgeführt. Eine umfassende experimentelle Analyse der PUF-Sicherheitsmetriken wird vorgestellt. Die Ergebnisse zeigen, dass die DiffC-PUF auf Siliziumbasis nahezu ideale Werte für die Uniqueness- und Reliability-Metriken aufweist. Darüber hinaus werden die Identifikationsfähigkeiten der DiffC-PUF untersucht, und es stellte sich heraus, dass zusätzliches Post-Processing die Identifizierbarkeit des Identifikationssystems weiter verbessern kann. Im dritten Beitrag dieser Arbeit wird zunächst ein Evaluierungsworkflow zur Simulation von DiffC-PUFs basierend auf gedruckter Elektronik vorgestellt, welche auch als Hybrid-PUFs bezeichnet werden. Hierbei wird eine Python-basierte Simulationsumgebung vorgestellt, welche es ermöglicht, die Eigenschaften und Variationen gedruckter PUF-Cores basierend auf Monte Carlo (MC) Simulationen zu untersuchen. Die Simulationsergebnisse zeigen, dass die Sicherheitsmetriken im besten Betriebspunkt nahezu ideal sind. Des Weiteren werden angefertigte PE-basierte PUF-Cores für statistische Tests unter verschiedenen Betriebsbedingungen, einschließlich Schwankungen der Umgebungstemperatur, der relativen Luftfeuchtigkeit und der Versorgungsspannung betrieben. Die experimentell bestimmten Resultate der Uniqueness-, Bit-Aliasing- und Uniformity-Metriken stimmen gut mit den Simulationsergebnissen überein. Der experimentell ermittelte durchschnittliche Reliability-Wert ist relativ niedrig, was durch die fehlende Passivierung und Einkapselung der gedruckten Transistoren erklärt werden kann. Die Untersuchung der Identifikationsfähigkeiten basierend auf den PUF-Responses zeigt, dass die Hybrid-PUF ohne zusätzliches Post-Processing nicht für kryptografische Anwendungen geeignet ist. Die Ergebnisse zeigen aber auch, dass sich die Hybrid-PUF zur Geräteidentifikation eignet. Der letzte Beitrag besteht darin, in die Perspektive eines Angreifers zu wechseln. Um die Sicherheitsfähigkeiten der Hybrid-PUF beurteilen zu können, wird eine umfassende Sicherheitsanalyse nach Art einer Kryptoanalyse durchgeführt. Die Analyse der Entropie der Hybrid-PUF zeigt, dass seine Anfälligkeit für Angriffe auf Modellbasis hauptsächlich von der eingesetzten Methode zur Generierung der PUF-Challenges abhängt. Darüber hinaus wird ein Angriffsmodell eingeführt, um die Leistung verschiedener mathematischer Klonangriffe auf der Grundlage von abgehörten Challenge-Response Pairs (CRPs) zu bewerten. Um die Hybrid-PUF zu klonen, wird ein Sortieralgorithmus eingeführt und mit häufig verwendeten Classifiers für überwachtes maschinelles Lernen (ML) verglichen, einschließlich logistischer Regression (LR), Random Forest (RF) sowie Multi-Layer Perceptron (MLP). Die Ergebnisse zeigen, dass die Hybrid-PUF anfällig für modellbasierte Angriffe ist. Der Sortieralgorithmus profitiert von kürzeren Trainingszeiten im Vergleich zu den ML-Algorithmen. Im Falle von fehlerhaft abgehörten CRPs übertreffen die ML-Algorithmen den Sortieralgorithmus

    Design of approximate overclocked datapath

    Get PDF
    Embedded applications can often demand stringent latency requirements. While high degrees of parallelism within custom FPGA-based accelerators may help to some extent, it may also be necessary to limit the precision used in the datapath to boost the operating frequency of the implementation. However, by reducing the precision, the engineer introduces quantisation error into the design. In this thesis, we describe an alternative circuit design methodology when considering trade-offs between accuracy, performance and silicon area. We compare two different approaches that could trade accuracy for performance. One is the traditional approach where the precision used in the datapath is limited to meet a target latency. The other is a proposed new approach which simply allows the datapath to operate without timing closure. We demonstrate analytically and experimentally that for many applications it would be preferable to simply overclock the design and accept that timing violations may arise. Since the errors introduced by timing violations occur rarely, they will cause less noise than quantisation errors. Furthermore, we show that conventional forms of computer arithmetic do not fail gracefully when pushed beyond the deterministic clocking region. In this thesis we take a fresh look at Online Arithmetic, originally proposed for digit serial operation, and synthesize unrolled digit parallel online arithmetic operators to allow for graceful degradation. We quantify the impact of timing violations on key arithmetic primitives, and show that substantial performance benefits can be obtained in comparison to binary arithmetic. Since timing errors are caused by long carry chains, these result in errors in least significant digits with online arithmetic, causing less impact than conventional implementations.Open Acces
    • …
    corecore