752 research outputs found

    DeSyRe: on-Demand System Reliability

    No full text
    The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect and fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints

    A Test Vector Minimization Algorithm Based On Delta Debugging For Post-Silicon Validation Of Pcie Rootport

    Get PDF
    In silicon hardware design, such as designing PCIe devices, design verification is an essential part of the design process, whereby the devices are subjected to a series of tests that verify the functionality. However, manual debugging is still widely used in post-silicon validation and is a major bottleneck in the validation process. The reason is a large number of tests vectors have to be analyzed, and this slows process down. To solve the problem, a test vector minimizer algorithm is proposed to eliminate redundant test vectors that do not contribute to reproduction of a test failure, hence, improving the debug throughput. The proposed methodology is inspired by the Delta Debugging algorithm which is has been used in automated software debugging but not in post-silicon hardware debugging. The minimizer operates on the principle of binary partitioning of the test vectors, and iteratively testing each subset (or complement of set) on a post-silicon System-Under-Test (SUT), to identify and eliminate redundant test vectors. Test results using test vector sets containing deliberately introduced erroneous test vectors show that the minimizer is able to isolate the erroneous test vectors. In test cases containing up to 10,000 test vectors, the minimizer requires about 16ns per test vector in the test case when only one erroneous test vector is present. In a test case with 1000 vectors including erroneous vectors, the same minimizer requires about 140μs per erroneous test vector that is injected. Thus, the minimizer’s CPU consumption is significantly smaller than the typical amount of time of a test running on SUT. The factors that significantly impact the performance of the algorithm are number of erroneous test vectors and distribution (spacing) of the erroneous vectors. The effect of total number of test vectors and position of the erroneous vectors are relatively minor compared to the other two. The minimization algorithm therefore was most effective for cases where there are only a few erroneous test vectors, with large number of test vectors in the set

    Integration of FAPEC as data compressor stage in a SpaceFibre link

    Get PDF
    SpaceFibre is a new technology for use onboard spacecraft that provides point-to-point and networked interconnections at 3.125 Gbits/s in flight qualified technology. SpaceFibre is an European Space Agency (ESA) initiative and will substitute the ubiquitous SpaceWire for high speed applications in space. FAPEC is a lossless data compression algorithm that typically offers better ratios than the CCSDS 121.0 Lossless Data Compression Recommendation on realistic data sets. FAPEC was designed for space communications, where requirements are very strong in terms of energy consumption and efficiency. In this project we have demonstrated that FAPEC can be easily integrated on top of SpaceFibre to reduce the amount of information that the spacecraft network has to deal with. The integration of FAPEC with SpaceFibre has successfully been validated in a representative FPGA platform. In the developed design FAPEC operated at ~12 Msamples/s (~200 Mbit/s) using a Xilinx Spartan-6 but it is expected to reach Gbit/s speeds with some additional work. The speed of the algorithm has been improved by a factor 6 while the resource usage remains low, around 2% of a Xilinx Virtex-5QV or a Microsemi RTG4. The combination of these two technologies can help to reduce the large amounts of data generated by some satellite instruments in a transparent way, without the need of user intervention, and to provide a solution to the increasing data volumes in spacecrafts. Consequently the combination of FAPEC with SpaceFibre can help to save mass, power consumption and reduce system complexity.SpaceFibre es una nueva tecnología para uso embarcado en satélites que proporciona conexiones punto a punto y de red a 3.125 Gbit/s en tecnología calificada para espacio. SpaceFibre es una iniciativa de la Agencia Espacial Europea (ESA) y sustituirá al popular SpaceWire en aplicaciones espaciales de alta velocidad. FAPEC es un algoritmo de compresión sin pérdidas que normalmente ofrece relaciones de compresión para conjuntos de datos realistas mejores que las de la recomendación CCSDS 121.0. FAPEC ha sido diseñado para las comunicaciones espaciales, donde las restricciones de consumo de energía y eficiencia son muy fuertes. En este proyecto hemos demostrado que FAPEC puede ser integrado fácilmente con SpaceFibre para reducir la cantidad de información que la red del satélite tiene que procesar. La integración de FAPEC con SpaceFibre ha sido validada con éxito en una plataforma FPGA representativa. En el diseño desarrollado, FAPEC funciona a ~12 Mmuestras/s (~200 Mbit/s) usando una Xilinx Spartan-6 pero se espera que alcance velocidades de Gbit/s con un poco más de trabajo. La velocidad del algoritmo se ha mejorado un factor 6 mientras que el uso de recursos continua siendo bajo, alrededor de un 2% de una Xilinx Virtex-5QV o Microsemi RTG4. La combinación de estas dos tecnologías puede ayudar a reducir las grandes cantidades de datos generados por los instrumentos de los satélites de una manera transparente, sin necesidad de una intervención por parte del usuario, y de proporcionar una solución al continuo incremento de datos generados. En consecuencia, la combinación de FAPEC y SpaceFibre puede ayudar a ahorrar masa y consumo de energía, y reducir la complejidad de los sistemas.SpaceFibre és una nova tecnologia per a ús embarcat en satèl·lits que proporciona connexions punt a punt i de xarxa a 3.125 Gbit/s en tecnologia qualificada per espai. SpaceFibre és una iniciativa de l'Agència Espacial Europea (ESA) i substituirà el popular SpaceWire en aplicacions espacials d'alta velocitat. FAPEC és un algorisme de compressió sense pèrdues que normalment ofereix relacions de compressió per a conjunts de dades realistes millors que les de la recomanació CCSDS 121.0. FAPEC ha estat dissenyat per a les comunicacions espacials, on les restriccions de consum d'energia i eficiència són molt fortes. En aquest projecte hem demostrat que FAPEC pot ser integrat fàcilment amb SpaceFibre per reduir la quantitat d'informació que la xarxa del satèl·lit ha de processar. La integració de FAPEC amb SpaceFibre ha estat validada amb èxit en una plataforma FPGA representativa. En el disseny desenvolupat, FAPEC funciona a ~12 Mmostres/s (~200 Mbit/s) utilitzant una Xilinx Spartan-6 però s'espera que arribi velocitats de Gbit/s amb una mica més de feina. La velocitat de l'algorisme s'ha millorat un factor 6 mentre que l'ús de recursos continua sent baix, al voltant d'un 2% d'una Xilinx Virtex-5QV o Microsemi RTG4. La combinació d'aquestes dues tecnologies pot ajudar a reduir les grans quantitats de dades generades pels instruments dels satèl·lits d'una manera transparent, sense necessitat d'una intervenció per part de l'usuari, i de proporcionar una solució al continu increment de dades generades. En conseqüència, la combinació de FAPEC i SpaceFibre pot ajudar a estalviar massa i consum d'energia, i reduir la complexitat dels sistemes

    Reining in the Functional Verification of Complex Processor Designs with Automation, Prioritization, and Approximation

    Full text link
    Our quest for faster and efficient computing devices has led us to processor designs with enormous complexity. As a result, functional verification, which is the process of ascertaining the correctness of a processor design, takes up a lion's share of the time and cost spent on making processors. Unfortunately, functional verification is only a best-effort process that cannot completely guarantee the correctness of a design, often resulting in defective products that may have devastating consequences.Functional verification, as practiced today, is unable to cope with the complexity of current and future processor designs. In this dissertation, we identify extensive automation as the essential step towards scalable functional verification of complex processor designs. Moreover, recognizing that a complete guarantee of design correctness is impossible, we argue for systematic prioritization and prudent approximation to realize fast and far-reaching functional verification solutions. We partition the functional verification effort into three major activities: planning and test generation, test execution and bug detection, and bug diagnosis. Employing a perspective we refer to as the automation, prioritization, and approximation (APA) approach, we develop solutions that tackle challenges across these three major activities. In pursuit of efficient planning and test generation for modern systems-on-chips, we develop an automated process for identifying high-priority design aspects for verification. In addition, we enable the creation of compact test programs, which, in our experiments, were up to 11 times smaller than what would otherwise be available at the beginning of the verification effort. To tackle challenges in test execution and bug detection, we develop a group of solutions that enable the deployment of automatic and robust mechanisms for catching design flaws during high-speed functional verification. By trading accuracy for speed, these solutions allow us to unleash functional verification platforms that are over three orders of magnitude faster than traditional platforms, unearthing design flaws that are otherwise impossible to reach. Finally, we address challenges in bug diagnosis through a solution that fully automates the process of pinpointing flawed design components after detecting an error. Our solution, which identifies flawed design units with over 70% accuracy, eliminates weeks of diagnosis effort for every detected error.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/137057/1/birukw_1.pd

    ENHANCING CLOUD SYSTEM RUNTIME TO ADDRESS COMPLEX FAILURES

    Get PDF
    As the reliance on cloud systems intensifies in our progressively digital world, understanding and reinforcing their reliability becomes more crucial than ever. Despite impressive advancements in augmenting the resilience of cloud systems, the growing incidence of complex failures now poses a substantial challenge to the availability of these systems. With cloud systems continuing to scale and increase in complexity, failures not only become more elusive to detect but can also lead to more catastrophic consequences. Such failures question the foundational premises of conventional fault-tolerance designs, necessitating the creation of novel system designs to counteract them. This dissertation aims to enhance distributed systems’ capabilities to detect, localize, and react to complex failures at runtime. To this end, this dissertation makes contributions to address three emerging categories of failures in cloud systems. The first part delves into the investigation of partial failures, introducing OmegaGen, a tool adept at generating tailored checkers for detecting and localizing such failures. The second part grapples with silent semantic failures prevalent in cloud systems, showcasing our study findings, and introducing Oathkeeper, a tool that leverages past failures to infer rules and expose these silent issues. The third part explores solutions to slow failures via RESIN, a framework specifically designed to detect, diagnose, and mitigate memory leaks in cloud-scale infrastructures, developed in collaboration with Microsoft Azure. The dissertation concludes by offering insights into future directions for the construction of reliable cloud systems

    DeSyRe: On-demand system reliability

    Get PDF
    The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect-/fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints. (C) 2013 Elsevier B.V. All rights reserved

    Specification and verification of network algorithms using temporal logic

    Get PDF
    In software engineering, formal methods are mathematical-based techniques that are used in the specification, development and verification of algorithms and programs in order to provide reliability and robustness of systems. One of the most difficult challenges for software engineering is to tackle the complexity of algorithms and software found in concurrent systems. Networked systems have come to prominence in many aspects of modern life, and therefore software engineering techniques for treating concurrency in such systems has acquired a particular importance. Algorithms in the software of concurrent systems are used to accomplish certain tasks which need to comply with the properties required of the system as a whole. These properties can be broadly subdivided into `safety properties', where the requirement is `nothing bad will happen', and `liveness properties', where the requirement is that `something good will happen'. As such, specifying network algorithms and their safety and liveness properties through formal methods is the aim of the research presented in this thesis. Since temporal logic has proved to be a successful technique in formal methods, which have various practical applications due to the availability of powerful model-checking tools such as the NuSMV model checker, we will investigate the specification and verification of network algorithms using temporal logic and model checking. In the first part of the thesis, we specify and verify safety properties for network algorithms. We will use temporal logic to prove the safety property of data consistency or serializability for a model of the execution of an unbounded number of concurrent transactions over time, which could represent software schedulers for an unknown number of transactions being present in a network. In the second part of the thesis, we will specify and verify the liveness properties of networked flooding algorithms. Considering the above in more detail, the first part of this thesis specifies a model of the execution of an unbounded number of concurrent transactions over time in propositional Linear Temporal Logic (LTL) in order to prove serializability. This is made possible by assuming that data items are ordered and that the transactions accessing these data items respects this order, as then there is a bound on the number of transactions that need to be considered to prove serializability. In particular, we make use of recent work which places such bounds on the number of transactions needed when data items are accessed in order, but do not have to be accessed contiguously, i.e., there may be `gaps' in the data items being accessed by individual transactions. Our aim is to specify the concurrent modification of data held on routers in a network as a transactional model. The correctness of the routing protocol and ensuring safety and reliability then corresponds to the serializability of the transactions. We specify an example of routing in a network and the corresponding serializability condition in LTL. This is then coded up in the NuSMV model checker and proofs are performed. The novelty of this part is that no previous research has used a method for detecting serializablity and cycles for unlimited number of transactions accessing the data on routers where the transactions way of accessing the data items on the routers have a gap. In addition to this, linear temporal logic has not been used in this scenario to prove correctness of the network system. This part is very helpful in network administrative protocols where it is critical to maintain correctness of the system. This safety property can be maintained using the presented work where detection of cycles in transactions accessing the data items can be detected by only checking a limited number of cycles rather than checking all possible cycles that can be caused by the network transactions. The second part of the thesis offers two contributions. Firstly, we specify the basic synchronous network flooding algorithm, for any fixed size of network, in LTL. The specification can be customized to any single network topology or class of topologies. A specification for the termination problem is formulated and used to compare different topologies with regards to earlier termination. We give a worked example of one topology resulting in earlier termination than another, for which we perform a formal verification using the NuSMV model checker. The novelty of the second part comes in using linear temporal logic and the NuSMV model checker to specify and verify the liveness property of the flooding algorithm. The presented work shows a very difficult scenario where the network nodes are memoryless. This makes detecting the termination of network flooding very complicated especially with networks of complex topologies. In the literature, researchers focussed on using testing and simulations to detect flooding termination. In this work, we used a robust technique and a rigorous method to specify and verify the synchronous flooding algorithm and its termination. We also showed that we can use linear temporal logic and the model checker NuSMV to compare synchronous flooding termination between topologies. Adding to the novelty of the second contribution, in addition to the synchronous form of the network flooding algorithm, we further provide a formal model of bounded asynchronous network flooding by extending the synchronous flooding model to allow a sent message, non-deterministically, to either be received instantaneously, or enter a transit phase prior to being received. A generalization of `rounds' from synchronous flooding to the asynchronous case is used as a unit of time to provide a measure of time to termination, as the number of rounds taken, for a run of an asynchronous system. The model is encoded into temporal logic and a proof obligation is given for comparing the termination times of asynchronous and synchronous systems. Worked examples are formally verified using the NuSMV model checker. This work offers a constraint-based methodology for the verification of liveness properties of software algorithms distributed across the nodes in a network.</div
    corecore