5 research outputs found

    Fehlertolerante Mehrkernprozessoren fĂĽr gemischt-kritische Echtzeitsysteme

    Get PDF
    Current and future computing systems must be appropriately designed to cope with random hardware faults in order to provide a dependable service and correct functionality. Dependability has many facets to be addressed when designing a system and that is specially challenging in mixed-critical real-time systems, where safety standards play an important role and where responding in time can be as important as responding correctly or even responding at all. The thesis addresses the dependability of mixed-critical real-time systems, considering three important requirements: integrity, resilience and real-time. More specifically, it looks into the architectural and performance aspects of achieving dependability, concentrating its scope on error detection and handling in hardware -- more specifically in the Network-on-Chip (NoC), the backbone of modern MPSoC -- and on the performance of error handling and recovery in software. The thesis starts by looking at the impacts of random hardware faults on the NoC and on the system, with special focus on soft errors. Then, it addresses the uncovered weaknesses in the NoC by proposing a resilient NoC for mixed-critical real-time systems that is able to provide a highly reliable service with transparent protection for the applications. Formal communication time analysis is provided with common ARQ protocols modeled for NoCs and including a novel ARQ-based protocol optimized for DMAs. After addressing the efficient use of ARQ-based protocols in NoCs, the thesis proposes the Advanced Integrity Q-service (AIQ), a low-overhead mechanism to achieve integrity and real-time guarantees of NoC transactions on an End-to-End (E2E) basis. Inspired by transactions in distributed systems, the mechanism differs from the previous approach in that it does not provide error recovery in hardware but delegates the task to software, making use of existing functionality in cross-layer fault-tolerance solutions. Finally, the thesis addresses error handling in software as seen in cross-layer approaches. It addresses the performance of replicated software execution in many-core platforms. Replicated software execution provides protection to the system against random hardware faults. It relies on hardware-supported error detection and error handling in software. The replica-aware co-scheduling is proposed to achieve high performance with replicated execution, which is not possible with standard real-time schedulers.Um einen zuverlässigen Betrieb und korrekte Funktionalität zu gewährleisten, müssen aktuelle und zukünftige Computersysteme so ausgelegt werden, dass sie mit diesen Fehlern umgehen können. Zuverlässigkeit hat viele Aspekte, die bei der Entwicklung eines Systems berücksichtigt werden müssen. Das gilt insbesondere für Echtzeitsysteme mit gemischter Kritikalität, bei denen Sicherheitsstandards, die ein korrektes und rechtzeitiges Verhalten fordern, eine wichtige Rolle spielen. Diese Dissertation befasst sich mit der Zuverlässigkeit von gemischt-kritischen Echtzeitsystemen unter Berücksichtigung von drei wichtigen Anforderungen: Integrität, Resilienz und Echtzeit. Genauer gesagt, behandelt sie Architektur- und Leistungsaspekte die notwendig sind um Zuverlässigkeit zu erreichen, wobei der Schwerpunkt auf der Fehlererkennung und -behandlung in der Hardware – genauer gesagt im Network-on-Chip (NoC), dem Rückgrat des modernen MPSoC – und auf der Leistung der Fehlerbehandlung und -behebung in der Software liegt. Die Arbeit beginnt mit der Untersuchung der Auswirkung von zufälligen Hardwarefehlern auf das NoC und das System, wobei der Schwerpunkt auf weichen Fehler (soft errors) liegt. Anschließend werden die aufgedeckten Schwachstellen im NoC behoben, indem ein widerstandsfähiges NoC für gemischt-kritische Echtzeitsysteme vorgeschlagen wird, das in der Lage ist, einen höchst zuverlässigen Betrieb mit transparentem Schutz für die Anwendungen zu bieten. Nach der Auseinandersetzung mit der effizienten Nutzung von ARQ-basierten Protokolle in NoCs, wird der Advanced Integrity Q-Service (AIQ) vorgestellt, der ein Mechanismus mit geringem Overhead ist, um Integrität und Echtzeit-Garantien von NoC-Transaktionen auf Ende-zu-Ende (E2E)-Basis zu erreichen. Inspiriert von Transaktionen in verteilten Systemen unterscheidet sich der Mechanismus vom bisherigen Konzept dadurch, dass er keine Fehlerbehebung in der Hardware vorsieht, sondern diese Aufgabe an die Software delegiert. Schließlich befasst sich die Dissertation mit der Fehlerbehandlung in Software, wie sie in schichtübergreifenden Methoden zu sehen ist. Sie behandelt die Leistung der replizierten Software-Ausführung in Many-Core-Plattformen. Es setzt auf hardwaregestützte Fehlererkennung und Fehlerbehandlung in der Software. Das Replika-bewusste Co-Scheduling wird vorgeschlagen, um eine hohe Performance bei replizierter Ausführung zu erreichen, was mit Standard-Echtzeit-Schedulern nicht möglich ist

    Decompose and Conquer: Addressing Evasive Errors in Systems on Chip

    Full text link
    Modern computer chips comprise many components, including microprocessor cores, memory modules, on-chip networks, and accelerators. Such system-on-chip (SoC) designs are deployed in a variety of computing devices: from internet-of-things, to smartphones, to personal computers, to data centers. In this dissertation, we discuss evasive errors in SoC designs and how these errors can be addressed efficiently. In particular, we focus on two types of errors: design bugs and permanent faults. Design bugs originate from the limited amount of time allowed for design verification and validation. Thus, they are often found in functional features that are rarely activated. Complete functional verification, which can eliminate design bugs, is extremely time-consuming, thus impractical in modern complex SoC designs. Permanent faults are caused by failures of fragile transistors in nano-scale semiconductor manufacturing processes. Indeed, weak transistors may wear out unexpectedly within the lifespan of the design. Hardware structures that reduce the occurrence of permanent faults incur significant silicon area or performance overheads, thus they are infeasible for most cost-sensitive SoC designs. To tackle and overcome these evasive errors efficiently, we propose to leverage the principle of decomposition to lower the complexity of the software analysis or the hardware structures involved. To this end, we present several decomposition techniques, specific to major SoC components. We first focus on microprocessor cores, by presenting a lightweight bug-masking analysis that decomposes a program into individual instructions to identify if a design bug would be masked by the program's execution. We then move to memory subsystems: there, we offer an efficient memory consistency testing framework to detect buggy memory-ordering behaviors, which decomposes the memory-ordering graph into small components based on incremental differences. We also propose a microarchitectural patching solution for memory subsystem bugs, which augments each core node with a small distributed programmable logic, instead of including a global patching module. In the context of on-chip networks, we propose two routing reconfiguration algorithms that bypass faulty network resources. The first computes short-term routes in a distributed fashion, localized to the fault region. The second decomposes application-aware routing computation into simple routing rules so to quickly find deadlock-free, application-optimized routes in a fault-ridden network. Finally, we consider general accelerator modules in SoC designs. When a system includes many accelerators, there are a variety of interactions among them that must be verified to catch buggy interactions. To this end, we decompose such inter-module communication into basic interaction elements, which can be reassembled into new, interesting tests. Overall, we show that the decomposition of complex software algorithms and hardware structures can significantly reduce overheads: up to three orders of magnitude in the bug-masking analysis and the application-aware routing, approximately 50 times in the routing reconfiguration latency, and 5 times on average in the memory-ordering graph checking. These overhead reductions come with losses in error coverage: 23% undetected bug-masking incidents, 39% non-patchable memory bugs, and occasionally we overlook rare patterns of multiple faults. In this dissertation, we discuss the ideas and their trade-offs, and present future research directions.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147637/1/doowon_1.pd

    Adaptive Distributed Architectures for Future Semiconductor Technologies.

    Full text link
    Year after year semiconductor manufacturing has been able to integrate more components in a single computer chip. These improvements have been possible through systematic shrinking in the size of its basic computational element, the transistor. This trend has allowed computers to progressively become faster, more efficient and less expensive. As this trend continues, experts foresee that current computer designs will face new challenges, in utilizing the minuscule devices made available by future semiconductor technologies. Today's microprocessor designs are not fit to overcome these challenges, since they are constrained by their inability to handle component failures by their lack of adaptability to a wide range of custom modules optimized for specific applications and by their limited design modularity. The focus of this thesis is to develop original computer architectures, that can not only survive these new challenges, but also leverage the vast number of transistors available to unlock better performance and efficiency. The work explores and evaluates new software and hardware techniques to enable the development of novel adaptive and modular computer designs. The thesis first explores an infrastructure to quantitatively assess the fallacies of current systems and their inadequacy to operate on unreliable silicon. In light of these findings, specific solutions are then proposed to strengthen digital system architectures, both through hardware and software techniques. The thesis culminates with the proposal of a radically new architecture design that can fully adapt dynamically to operate on the hardware resources available on chip, however limited or abundant those may be.PHDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/102405/1/apellegr_1.pd

    Dependable Embedded Systems

    Get PDF
    This Open Access book introduces readers to many new techniques for enhancing and optimizing reliability in embedded systems, which have emerged particularly within the last five years. This book introduces the most prominent reliability concerns from today’s points of view and roughly recapitulates the progress in the community so far. Unlike other books that focus on a single abstraction level such circuit level or system level alone, the focus of this book is to deal with the different reliability challenges across different levels starting from the physical level all the way to the system level (cross-layer approaches). The book aims at demonstrating how new hardware/software co-design solution can be proposed to ef-fectively mitigate reliability degradation such as transistor aging, processor variation, temperature effects, soft errors, etc. Provides readers with latest insights into novel, cross-layer methods and models with respect to dependability of embedded systems; Describes cross-layer approaches that can leverage reliability through techniques that are pro-actively designed with respect to techniques at other layers; Explains run-time adaptation and concepts/means of self-organization, in order to achieve error resiliency in complex, future many core systems

    Asynchronous techniques for new generation variation-tolerant FPGA

    Get PDF
    PhD ThesisThis thesis presents a practical scenario for asynchronous logic implementation that would benefit the modern Field-Programmable Gate Arrays (FPGAs) technology in improving reliability. A method based on Asynchronously-Assisted Logic (AAL) blocks is proposed here in order to provide the right degree of variation tolerance, preserve as much of the traditional FPGAs structure as possible, and make use of asynchrony only when necessary or beneficial for functionality. The newly proposed AAL introduces extra underlying hard-blocks that support asynchronous interaction only when needed and at minimum overhead. This has the potential to avoid the obstacles to the progress of asynchronous designs, particularly in terms of area and power overheads. The proposed approach provides a solution that is complementary to existing variation tolerance techniques such as the late-binding technique, but improves the reliability of the system as well as reducing the design’s margin headroom when implemented on programmable logic devices (PLDs) or FPGAs. The proposed method suggests the deployment of configurable AAL blocks to reinforce only the variation-critical paths (VCPs) with the help of variation maps, rather than re-mapping and re-routing. The layout level results for this method's worst case increase in the CLB’s overall size only of 6.3%. The proposed strategy retains the structure of the global interconnect resources that occupy the lion’s share of the modern FPGA’s soft fabric, and yet permits the dual-rail iv completion-detection (DR-CD) protocol without the need to globally double the interconnect resources. Simulation results of global and interconnect voltage variations demonstrate the robustness of the method
    corecore