349 research outputs found

    Decompose and Conquer: Addressing Evasive Errors in Systems on Chip

    Full text link
    Modern computer chips comprise many components, including microprocessor cores, memory modules, on-chip networks, and accelerators. Such system-on-chip (SoC) designs are deployed in a variety of computing devices: from internet-of-things, to smartphones, to personal computers, to data centers. In this dissertation, we discuss evasive errors in SoC designs and how these errors can be addressed efficiently. In particular, we focus on two types of errors: design bugs and permanent faults. Design bugs originate from the limited amount of time allowed for design verification and validation. Thus, they are often found in functional features that are rarely activated. Complete functional verification, which can eliminate design bugs, is extremely time-consuming, thus impractical in modern complex SoC designs. Permanent faults are caused by failures of fragile transistors in nano-scale semiconductor manufacturing processes. Indeed, weak transistors may wear out unexpectedly within the lifespan of the design. Hardware structures that reduce the occurrence of permanent faults incur significant silicon area or performance overheads, thus they are infeasible for most cost-sensitive SoC designs. To tackle and overcome these evasive errors efficiently, we propose to leverage the principle of decomposition to lower the complexity of the software analysis or the hardware structures involved. To this end, we present several decomposition techniques, specific to major SoC components. We first focus on microprocessor cores, by presenting a lightweight bug-masking analysis that decomposes a program into individual instructions to identify if a design bug would be masked by the program's execution. We then move to memory subsystems: there, we offer an efficient memory consistency testing framework to detect buggy memory-ordering behaviors, which decomposes the memory-ordering graph into small components based on incremental differences. We also propose a microarchitectural patching solution for memory subsystem bugs, which augments each core node with a small distributed programmable logic, instead of including a global patching module. In the context of on-chip networks, we propose two routing reconfiguration algorithms that bypass faulty network resources. The first computes short-term routes in a distributed fashion, localized to the fault region. The second decomposes application-aware routing computation into simple routing rules so to quickly find deadlock-free, application-optimized routes in a fault-ridden network. Finally, we consider general accelerator modules in SoC designs. When a system includes many accelerators, there are a variety of interactions among them that must be verified to catch buggy interactions. To this end, we decompose such inter-module communication into basic interaction elements, which can be reassembled into new, interesting tests. Overall, we show that the decomposition of complex software algorithms and hardware structures can significantly reduce overheads: up to three orders of magnitude in the bug-masking analysis and the application-aware routing, approximately 50 times in the routing reconfiguration latency, and 5 times on average in the memory-ordering graph checking. These overhead reductions come with losses in error coverage: 23% undetected bug-masking incidents, 39% non-patchable memory bugs, and occasionally we overlook rare patterns of multiple faults. In this dissertation, we discuss the ideas and their trade-offs, and present future research directions.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147637/1/doowon_1.pd

    Artificial Intelligence for Small Satellites Mission Autonomy

    Get PDF
    Space mission engineering has always been recognized as a very challenging and innovative branch of engineering: since the beginning of the space race, numerous milestones, key successes and failures, improvements, and connections with other engineering domains have been reached. Despite its relative young age, space engineering discipline has not gone through homogeneous times: alternation of leading nations, shifts in public and private interests, allocations of resources to different domains and goals are all examples of an intrinsic dynamism that characterized this discipline. The dynamism is even more striking in the last two decades, in which several factors contributed to the fervour of this period. Two of the most important ones were certainly the increased presence and push of the commercial and private sector and the overall intent of reducing the size of the spacecraft while maintaining comparable level of performances. A key example of the second driver is the introduction, in 1999, of a new category of space systems called CubeSats. Envisioned and designed to ease the access to space for universities, by standardizing the development of the spacecraft and by ensuring high probabilities of acceptance as piggyback customers in launches, the standard was quickly adopted not only by universities, but also by agencies and private companies. CubeSats turned out to be a disruptive innovation, and the space mission ecosystem was deeply changed by this. New mission concepts and architectures are being developed: CubeSats are now considered as secondary payloads of bigger missions, constellations are being deployed in Low Earth Orbit to perform observation missions to a performance level considered to be only achievable by traditional, fully-sized spacecraft. CubeSats, and more in general the small satellites technology, had to overcome important challenges in the last few years that were constraining and reducing the diffusion and adoption potential of smaller spacecraft for scientific and technology demonstration missions. Among these challenges were: the miniaturization of propulsion technologies, to enable concepts such as Rendezvous and Docking, or interplanetary missions; the improvement of telecommunication state of the art for small satellites, to enable the downlink to Earth of all the data acquired during the mission; and the miniaturization of scientific instruments, to be able to exploit CubeSats in more meaningful, scientific, ways. With the size reduction and with the consolidation of the technology, many aspects of a space mission are reduced in consequence: among these, costs, development and launch times can be cited. An important aspect that has not been demonstrated to scale accordingly is operations: even for small satellite missions, human operators and performant ground control centres are needed. In addition, with the possibility of having constellations or interplanetary distributed missions, a redesign of how operations are management is required, to cope with the innovation in space mission architectures. The present work has been carried out to address the issue of operations for small satellite missions. The thesis presents a research, carried out in several institutions (Politecnico di Torino, MIT, NASA JPL), aimed at improving the autonomy level of space missions, and in particular of small satellites. The key technology exploited in the research is Artificial Intelligence, a computer science branch that has gained extreme interest in research disciplines such as medicine, security, image recognition and language processing, and is currently making its way in space engineering as well. The thesis focuses on three topics, and three related applications have been developed and are here presented: autonomous operations by means of event detection algorithms, intelligent failure detection on small satellite actuator systems, and decision-making support thanks to intelligent tradespace exploration during the preliminary design of space missions. The Artificial Intelligent technologies explored are: Machine Learning, and in particular Neural Networks; Knowledge-based Systems, and in particular Fuzzy Logics; Evolutionary Algorithms, and in particular Genetic Algorithms. The thesis covers the domain (small satellites), the technology (Artificial Intelligence), the focus (mission autonomy) and presents three case studies, that demonstrate the feasibility of employing Artificial Intelligence to enhance how missions are currently operated and designed

    TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems

    Full text link
    Root Cause Analysis (RCA) is becoming increasingly crucial for ensuring the reliability of microservice systems. However, performing RCA on modern microservice systems can be challenging due to their large scale, as they usually comprise hundreds of components, leading significant human effort. This paper proposes TraceDiag, an end-to-end RCA framework that addresses the challenges for large-scale microservice systems. It leverages reinforcement learning to learn a pruning policy for the service dependency graph to automatically eliminates redundant components, thereby significantly improving the RCA efficiency. The learned pruning policy is interpretable and fully adaptive to new RCA instances. With the pruned graph, a causal-based method can be executed with high accuracy and efficiency. The proposed TraceDiag framework is evaluated on real data traces collected from the Microsoft Exchange system, and demonstrates superior performance compared to state-of-the-art RCA approaches. Notably, TraceDiag has been integrated as a critical component in the Microsoft M365 Exchange, resulting in a significant improvement in the system's reliability and a considerable reduction in the human effort required for RCA

    Integrated design optimization methods for optimal sensor placement and cooling system architecture design for electro-thermal systems

    Get PDF
    Dynamic thermal management plays a very important role in the design and development of electro-thermal systems as these become more active and complex in terms of their functionalities. In highly power dense electronic systems, the heat is concentrated over small spatial domains. Thermal energy dissipation in any electrified system increases the temperature and might cause component failure, degradation of heat sensitive materials, thermal burnouts and failure of active devices. So thermal management needs to be done both accurately (by thermal monitoring using sensors) and efficiently (by applying fluid-based cooling techniques). In this work, two important aspects of dynamic thermal management of a highly dense power electronic system have been investigated. The first aspect is the problem of optimal temperature sensor placement for accurate thermal monitoring aimed toward achieving thermally-aware electrified systems. Strategic placement of temperature sensors can improve the accuracy of real-time temperature distribution estimates. Enhanced temperature estimation supports increased power throughput and density because Power Electronic Systems (PESs) can be operated in a less conservative manner while still preventing thermal failure. This work presents new methods for temperature sensor placement for 2- and 3-dimensional PESs that 1) improve computational efficiency (by orders of magnitude in at least one case), 2) support use of more accurate evaluation metrics, and 3) are scalable to high-dimension sensor placement problems. These new methods are tested via sensor placement studies based on a 2-kW, 60Hz, single-phase, Flying Capacitor Multi-Level (FCML) prototype inverter. Information-based metrics are derived from a reduced-order Resistance-Capacitance (RC) lumped parameter thermal model. Other more general metrics and system models are possible through application of a new continuous relaxation strategy introduced here for placement representation. A new linear Programming (LP) formulation is presented that is compatible with a particular type of information-based metric. This LP strategy is demonstrated to support the efficient solution of finely-discretized large-scale placement problems. The optimal sensor locations obtained from these methods were tested via physical experiments. The new methods and results presented here may aid the development of thermally-aware PESs with significantly enhanced capabilities. The second aspect is to design optimal fluid-based thermal management architectures through enumerative methods that help operate the system efficiently within its operating temperature limits using the minimum feasible coolant flow level. Expert intuition based on physics knowledge and vast experience may not be adequate to identify optimal thermal management designs as systems increase in size and complexity. This work also presents a design framework supporting comprehensive exploration of a class of single-phase fluid-based cooling architectures. The candidate cooling system architectures are represented using labeled rooted tree graphs. Dynamic models are automatically generated from these trees using a graph-based thermal modeling framework. Optimal performance is determined by solving an appropriate fluid flow distribution problem, handling temperature constraints in the presence of exogenous heat loads. Rigorous case studies are performed in simulation, with components having variable sets of heat loads and temperature constraints. Results include optimization of thermal endurance for an enumerated set of 4,051 architectures. In addition, cooling system architectures capable of steady-state operation under a given loading are identified. Optimization of the cooling system design has been done subject to a representative mission, consisting of multiple time-varying loads. Work presented in this thesis clearly shows that the transient effects of heat loads are expected to have important impacts on design decisions when compared to steady-state operating conditions

    The assessment of non-technical skills in ENT surgery: a multidisciplinary simulation programme to improve patient safety

    Get PDF
    Surgical patients are at particular risk of harm, with 41% of all adverse events in hospital occurring in the operating theatre. Failures in Human factors are the leading cause. Despite recognition of the importance of human factors training to patient safety, there is a lack of theatre ENT crisis management simulation, and no formal assessment of the requisite skills. Aims: To Develop a psychometrically robust assessment tool for assessing Non-technical skills in the ENT theatre – to be termed ENT-NOTECHS. To Develop and validate an ENT themed multidisciplinary simulation programme for the assessment and feedback of non-technical skills. Methods: A multimodal method approach was used to create a novel behavioural marker tool to capture non-technical skills in the ENT theatre environment. Alongside this, a prospective, observational study involving a multidisciplinary team training day in ENT and airway themed crisis’ in a high fidelity simulated theatre environment was designed. Teams undertook 6 high fidelity simulation scenarios and non-technical skills were assessed using the ENT-NOTECHS tool. The ENT-NOTECHS tool was assessed for its psychometric robustness; reliability and construct validity. Candidate feedback was obtained to determine overall effectiveness of training. Results: We successfully designed and delivered a novel multidisciplinary team ENT themed training day. Over 15 months, 74 trainees (surgeons, anaesthetists and nurses) participated in 6 MDT simulation days, totalling 54 hours of simulation training and 210 assessments. Excellent Face and content validity was demonstrated. 100% of participants reported improved confidence in managing ENT crisis scenarios and demonstrated an improvement in non-technical skills (ENT-NOTECHS). The ENT-NOTECHS tool demonstrated excellent psychometric robustness. Good inter-rater reliability scores (cronbachs >0.7) were shown and the tool discriminated between novice and expert trainees (p<0.001). Conclusion: Multidisciplinary team training in ENT-themed crisis is feasible and well received training intervention. The simulated operating theatre serves as an excellent environment for the assessment and training of non-technical skills. ENT -NOTECHS is a novel assessment tool with evidence for reliability, content and construct validity in ENT teams.Open Acces
    • …
    corecore