150 research outputs found
Benchmark methodologies for the optimized physical synthesis of RISC-V microprocessors
As technology continues to advance and chip sizes shrink, the complexity and design time required for integrated circuits have significantly increased. To address these challenges, Electronic Design Automation (EDA) tools have been introduced to streamline the design flow. These tools offer various methodologies and options to optimize power, performance, and chip area. However, selecting the most suitable methods from these options can be challenging, as they may lead to trade-offs among power, performance, and area. While architectural and Register Transfer Level (RTL) optimizations have been extensively studied in existing literature, the impact of optimization methods available in EDA tools on performance has not been thoroughly researched. This thesis aims to optimize a semiconductor processor through EDA tools within the physical synthesis domain to achieve increased performance while maintaining a balance between power efficiency and area utilization. By leveraging floorplanning tools and carefully selecting technology libraries and optimization options, the CV32E40P open-source processor is subjected to various floorplans to analyze their impact on chip performance. The employed techniques, including multibit components prefer option, multiplexer tree prefer option, identification and exclusion of problematic cells, and placement blockages, lead to significant improvements in cell density, congestion mitigation, and timing. The optimized synthesis results demonstrate a 71\% enhancement in chip design performance without a substantial increase in area, showcasing the effectiveness of these techniques in improving large-scale integrated circuits' performance, efficiency, and manufacturability. By exploring and implementing the available options in EDA tools, this study demonstrates how the processor's performance can be significantly improved while maintaining a balanced and efficient chip design. The findings contribute valuable insights to the field of electronic design automation, offering guidance to designers in selecting suitable methodologies for optimizing processors and other integrated circuits
Low Power Memory/Memristor Devices and Systems
This reprint focusses on achieving low-power computation using memristive devices. The topic was designed as a convenient reference point: it contains a mix of techniques starting from the fundamental manufacturing of memristive devices all the way to applications such as physically unclonable functions, and also covers perspectives on, e.g., in-memory computing, which is inextricably linked with emerging memory devices such as memristors. Finally, the reprint contains a few articles representing how other communities (from typical CMOS design to photonics) are fighting on their own fronts in the quest towards low-power computation, as a comparison with the memristor literature. We hope that readers will enjoy discovering the articles within
TransCODE: Co-design of Transformers and Accelerators for Efficient Training and Inference
Automated co-design of machine learning models and evaluation hardware is
critical for efficiently deploying such models at scale. Despite the
state-of-the-art performance of transformer models, they are not yet ready for
execution on resource-constrained hardware platforms. High memory requirements
and low parallelizability of the transformer architecture exacerbate this
problem. Recently-proposed accelerators attempt to optimize the throughput and
energy consumption of transformer models. However, such works are either
limited to a one-sided search of the model architecture or a restricted set of
off-the-shelf devices. Furthermore, previous works only accelerate model
inference and not training, which incurs substantially higher memory and
compute resources, making the problem even more challenging. To address these
limitations, this work proposes a dynamic training framework, called DynaProp,
that speeds up the training process and reduces memory consumption. DynaProp is
a low-overhead pruning method that prunes activations and gradients at runtime.
To effectively execute this method on hardware for a diverse set of transformer
architectures, we propose ELECTOR, a framework that simulates transformer
inference and training on a design space of accelerators. We use this simulator
in conjunction with the proposed co-design technique, called TransCODE, to
obtain the best-performing models with high accuracy on the given task and
minimize latency, energy consumption, and chip area. The obtained
transformer-accelerator pair achieves 0.3% higher accuracy than the
state-of-the-art pair while incurring 5.2 lower latency and 3.0
lower energy consumption
It is too hot in here! A performance, energy and heat aware scheduler for Asymmetric multiprocessing processors in embedded systems.
Modern architecture present in self-power devices such as mobiles or tablet computers proposes the use of asymmetric processors that allow either energy-efficient or performant computation on the same SoC. For energy efficiency and performance consideration, the asymmetry resides in differences in CPU micro-architecture design and results in diverging raw computing capability. Other components such as the processor memory subsystem also show differences resulting in different memory transaction timing. Moreover, based on a bus-snoop protocol, cache coherency between processors comes with a peculiarity in memory latency depending on the processors operating frequencies. All these differences come with challenging decisions on both application schedulability and processor operating frequencies. In addition, because of the small form factor of such embedded systems, these devices generally cannot afford active cooling systems. Therefore thermal mitigation relies on dynamic software solutions. Current operating systems for embedded systems such as Linux or Android do not consider all these particularities. As such, they often fail to satisfy user expectations of a powerful device with long battery life. To remedy this situation, this thesis proposes a unified approach to deliver high-performance and energy-efficiency computation in each of its flavours, considering the memory subsystem and all computation units available in the system. Performance is maximized even when the device is under heavy thermal constraints. The proposed unified solution is based on accurate models targeting both performance and thermal behaviour and resides at the operating systems kernel level to manage all running applications in a global manner. Particularly, the performance model considers both the computation part and also the memory subsystem of symmetric or asymmetric processors present in embedded devices. The thermal model relies on the accurate physical thermal properties of the device. Using these models, application schedulability and processor frequency scaling decisions to either maximize performance or energy efficiency within a thermal budget are extensively studied. To cover a large range of application behaviour, both models are built and designed using a generative workload that considers fine-grain details of the underlying microarchitecture of the SoC. Therefore, this approach can be derived and applied to multiple devices with little effort. Extended evaluation on real-world benchmarks for high performance and general computing, as well as common applications targeting the mobile and tablet market, show the accuracy and completeness of models used in this unified approach to deliver high performance and energy efficiency under high thermal constraints for embedded devices
Combining dynamic and static scheduling in high-level synthesis
Field Programmable Gate Arrays (FPGAs) are starting to become mainstream devices for custom computing, particularly deployed in data centres. However, using these FPGA devices requires familiarity with digital design at a low abstraction level. In order to enable software engineers without a hardware background to design custom hardware, high-level synthesis (HLS) tools automatically transform a high-level program, for example in C/C++, into a low-level hardware description.
A central task in HLS is scheduling: the allocation of operations to clock cycles. The classic approach to scheduling is static, in which each operation is mapped to a clock cycle at compile time, but recent years have seen the emergence of dynamic scheduling, in which an operation’s clock cycle is only determined at run-time. Both approaches have their merits: static scheduling can lead to simpler circuitry and more resource sharing, while dynamic scheduling can lead to faster hardware when the computation has a non-trivial control flow.
This thesis proposes a scheduling approach that combines the best of both worlds. My idea is to use existing program analysis techniques in software designs, such as probabilistic analysis and formal verification, to optimize the HLS hardware. First, this thesis proposes a tool named DASS that uses a heuristic-based approach to identify the code regions in the input program that are amenable to static scheduling and synthesises them into statically scheduled components, also known as static islands, leaving the top-level hardware dynamically scheduled. Second, this thesis addresses a problem of this approach: that the analysis of static islands and their dynamically scheduled surroundings are separate, where one treats the other as black boxes. We apply static analysis including dependence analysis between static islands and their dynamically scheduled surroundings to optimize the offsets of static islands for high performance. We also apply probabilistic analysis to estimate the performance of the dynamically scheduled part and use this information to optimize the static islands for high area efficiency. Finally, this thesis addresses the problem of conservatism in using sequential control flow designs which can limit the throughput of the hardware. We show this challenge can be solved by formally proving that certain control flows can be safely parallelised for high performance. This thesis demonstrates how to use automated formal verification to find out-of-order loop pipelining solutions and multi-threading solutions from a sequential program.Open Acces
AI/ML Algorithms and Applications in VLSI Design and Technology
An evident challenge ahead for the integrated circuit (IC) industry in the
nanometer regime is the investigation and development of methods that can
reduce the design complexity ensuing from growing process variations and
curtail the turnaround time of chip manufacturing. Conventional methodologies
employed for such tasks are largely manual; thus, time-consuming and
resource-intensive. In contrast, the unique learning strategies of artificial
intelligence (AI) provide numerous exciting automated approaches for handling
complex and data-intensive tasks in very-large-scale integration (VLSI) design
and testing. Employing AI and machine learning (ML) algorithms in VLSI design
and manufacturing reduces the time and effort for understanding and processing
the data within and across different abstraction levels via automated learning
algorithms. It, in turn, improves the IC yield and reduces the manufacturing
turnaround time. This paper thoroughly reviews the AI/ML automated approaches
introduced in the past towards VLSI design and manufacturing. Moreover, we
discuss the scope of AI/ML applications in the future at various abstraction
levels to revolutionize the field of VLSI design, aiming for high-speed, highly
intelligent, and efficient implementations
On the Reliability of Neural Networks Implemented on SRAM-based FPGAs for Low-cost Satellites
Recent development in the neural network inference frameworks on Field-Programmable Gate Arrays (FPGAs) enables the rapid deployment of neural network applications on low-power FPGA devices. FPGAs are a promising platform for implementing neural network capabilities on board satellites thanks to the high energy efficiency of quantised neural networks on FPGAs. Furthermore, the reconfigurability of FPGAs allows neural network accelerators to share the FPGA with other onboard computer systems for reduced hardware complexity. However, the reliability against radiation-induced upsets of existing neural network inference frameworks on commercial FPGA devices was not previously studied.
The reliability of neural network applications on FPGA is complicated by the perceptrons’ inherent algorithm-based fault tolerance, quantisation techniques, the varying sensitivity of non-neural layers like pooling layers, the architecture of the accelerator, and the software stack. This thesis explores the effect of single event upsets (SEUs) in potential spaceborne FPGA-based neural network applications using fully connected and convolutional networks, on applications using binary, 4-bit and 8-bit quantisation levels, and on applications created from both FINN and Vitis AI frameworks. We study the failure modes in neural network applications caused by SEUs, including loss of accuracy, reduction of throughput/timeout, and catastrophic system failure on FPGA SoC.
We conducted fault injection experiments on fully connected and convolutional neural networks (CNNs) trained for classifying images from the MNIST handwritten digits dataset and the Airbus ship detection dataset. We found that SEUs have an insignificant impact on fully-connected binary networks trained on the MNIST dataset. However, the more complex CNN applications created from the FINN and Vitis-AI frameworks showed much higher sensitivity to SEUs and had more failure modes, including loss of accuracy, hardware hang-up, and even catastrophic failure in the OS of SoC devices due to erroneous driver behaviour. We found that the SEU cross-section of model-specific neural network accelerators like FINN can be reduced significantly by quantising the network to a lower precision. We also studied the efficacy of fault-tolerant design techniques, including full TMR and partial TMR, on the binary neural network and FINN accelerator
Analysis and Mitigation of Remote Side-Channel and Fault Attacks on the Electrical Level
In der fortlaufenden Miniaturisierung von integrierten Schaltungen werden physikalische Grenzen erreicht, wobei beispielsweise Einzelatomtransistoren eine mögliche untere Grenze für Strukturgrößen darstellen.
Zudem ist die Herstellung der neuesten Generationen von Mikrochips heutzutage finanziell nur noch von großen, multinationalen Unternehmen zu stemmen.
Aufgrund dieser Entwicklung ist Miniaturisierung nicht länger die treibende Kraft um die Leistung von elektronischen Komponenten weiter zu erhöhen.
Stattdessen werden klassische Computerarchitekturen mit generischen Prozessoren weiterentwickelt zu heterogenen Systemen mit hoher Parallelität und speziellen Beschleunigern.
Allerdings wird in diesen heterogenen Systemen auch der Schutz von privaten Daten gegen Angreifer zunehmend schwieriger.
Neue Arten von Hardware-Komponenten, neue Arten von Anwendungen und eine allgemein erhöhte Komplexität sind einige der Faktoren, die die Sicherheit in solchen Systemen zur Herausforderung machen.
Kryptografische Algorithmen sind oftmals nur unter bestimmten Annahmen über den Angreifer wirklich sicher.
Es wird zum Beispiel oft angenommen, dass der Angreifer nur auf Eingaben und Ausgaben eines Moduls zugreifen kann, während interne Signale und Zwischenwerte verborgen sind.
In echten Implementierungen zeigen jedoch Angriffe über Seitenkanäle und Faults die Grenzen dieses sogenannten Black-Box-Modells auf.
Während bei Seitenkanalangriffen der Angreifer datenabhängige Messgrößen wie Stromverbrauch oder elektromagnetische Strahlung ausnutzt, wird bei Fault Angriffen aktiv in die Berechnungen eingegriffen, und die falschen Ausgabewerte zum Finden der geheimen Daten verwendet.
Diese Art von Angriffen auf Implementierungen wurde ursprünglich nur im Kontext eines lokalen Angreifers mit Zugriff auf das Zielgerät behandelt.
Jedoch haben bereits Angriffe, die auf der Messung der Zeit für bestimmte Speicherzugriffe basieren, gezeigt, dass die Bedrohung auch durch Angreifer mit Fernzugriff besteht.
In dieser Arbeit wird die Bedrohung durch Seitenkanal- und Fault-Angriffe über Fernzugriff behandelt, welche eng mit der Entwicklung zu mehr heterogenen Systemen verknüpft sind.
Ein Beispiel für neuartige Hardware im heterogenen Rechnen sind Field-Programmable Gate Arrays (FPGAs), mit welchen sich fast beliebige Schaltungen in programmierbarer Logik realisieren lassen.
Diese Logik-Chips werden bereits jetzt als Beschleuniger sowohl in der Cloud als auch in Endgeräten eingesetzt.
Allerdings wurde gezeigt, wie die Flexibilität dieser Beschleuniger zur Implementierung von Sensoren zur Abschätzung der Versorgungsspannung ausgenutzt werden kann.
Zudem können durch eine spezielle Art der Aktivierung von großen Mengen an Logik Berechnungen in anderen Schaltungen für Fault Angriffe gestört werden.
Diese Bedrohung wird hier beispielsweise durch die Erweiterung bestehender Angriffe weiter analysiert und es werden Strategien zur Absicherung dagegen entwickelt
Designing Effective Logic Obfuscation: Exploring Beyond Gate-Level Boundaries
The need for high-end performance and cost savings has driven hardware design houses to outsource integrated circuit (IC) fabrication to untrusted manufacturing facilities. During fabrication, the entire chip design is exposed to these potentially malicious facilities, raising concerns of intellectual property (IP) piracy, reverse engineering, and counterfeiting. This is a major concern of both government and private organizations, especially in the context of military hardware. Logic obfuscation techniques have been proposed to prevent these supply-chain attacks. These techniques lock a chip by inserting additional key logic into combinational blocks of a circuit. The resulting design only exhibits correct functionality when a correct key is applied after fabrication. To date, the majority of obfuscation research centers on evaluating combinational constructions with gate-level criteria. However, this approach ignores critical high-level context, such as the interaction between modules and application error resilience. For this dissertation, we move beyond the traditional gate-level view of logic obfuscation, developing criteria and methodologies to design and evaluate obfuscated circuits for hardware-oriented security guarantees that transcend gate-level boundaries.
To begin our work, we characterize the security of obfuscation when viewed in the context of a larger IC and consider how to effectively apply logic obfuscation for security beyond gate-level boundaries. We derive a fundamental trade-off underlying all logic obfuscation that is between security and attack resilience. We then develop an open-source, GEM5-based simulator called ObfusGEM, which evaluates logic obfuscation at the architecture/application-level in processor ICs. Using ObfusGEM, we perform an architectural design space exploration of logic obfuscation in processor ICs. This exploration indicates that current obfuscation schemes cannot simultaneously achieve security and attack resilience goals. Based on the lessons learned from this design space exploration, we explore 2 orthogonal approaches to design ICs with strong security guarantees beyond gate-level boundaries.
For the first approach, we consider how logic obfuscation constructions can be modified to overcome the limitations identified in our design space exploration. This approach results in the development of 3 novel obfuscation techniques targeted towards securing 3 distinct applications. The first technique is Trace Logic Locking which enhances existing obfuscation techniques to provably expand the derived trade-off between security and attack resilience. The second technique is Memory Locking which defines an automatable approach to processor design obfuscation through locking the analog timing effects that govern the function of on-chip SRAM arrays. The third technique is High Error Rate Keys which protect probabilistic circuits against a SAT-based attacker by hiding the correct secret key value under stochastic noise. We demonstrate that all 3 techniques are capable of overcoming the limitations of obfuscation when viewed beyond gate-level boundaries in their respective applications.
For the second approach, we consider how architectural design decisions can influence hardware security. We begin by exploring security-aware architecture design, an approach where minor architectural modifications are identified and applied to improve security in processor ICs. We then develop resource binding algorithms for high-level synthesis that optimally bind operations onto obfuscated functional units to amplify security guarantees. In both cases, we show that by designing logic obfuscation using architectural context a designer can secure ICs beyond gate-level boundaries despite the presence of the rigid trade-off that rendered prior obfuscation techniques insecure
- …