2,847 research outputs found
DeSyRe: on-Demand System Reliability
The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect and fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints
PULP-HD: Accelerating Brain-Inspired High-Dimensional Computing on a Parallel Ultra-Low Power Platform
Computing with high-dimensional (HD) vectors, also referred to as
, is a brain-inspired alternative to computing with
scalars. Key properties of HD computing include a well-defined set of
arithmetic operations on hypervectors, generality, scalability, robustness,
fast learning, and ubiquitous parallel operations. HD computing is about
manipulating and comparing large patterns-binary hypervectors with 10,000
dimensions-making its efficient realization on minimalistic ultra-low-power
platforms challenging. This paper describes HD computing's acceleration and its
optimization of memory accesses and operations on a silicon prototype of the
PULPv3 4-core platform (1.5mm, 2mW), surpassing the state-of-the-art
classification accuracy (on average 92.4%) with simultaneous 3.7
end-to-end speed-up and 2 energy saving compared to its single-core
execution. We further explore the scalability of our accelerator by increasing
the number of inputs and classification window on a new generation of the PULP
architecture featuring bit-manipulation instruction extensions and larger
number of 8 cores. These together enable a near ideal speed-up of 18.4
compared to the single-core PULPv3
Engineering Resilient Space Systems
Several distinct trends will influence space exploration missions in the next decade. Destinations are
becoming more remote and mysterious, science questions more sophisticated, and, as mission experience
accumulates, the most accessible targets are visited, advancing the knowledge frontier to more difficult,
harsh, and inaccessible environments. This leads to new challenges including: hazardous conditions that
limit mission lifetime, such as high radiation levels surrounding interesting destinations like Europa or
toxic atmospheres of planetary bodies like Venus; unconstrained environments with navigation hazards,
such as free-floating active small bodies; multielement missions required to answer more sophisticated
questions, such as Mars Sample Return (MSR); and long-range missions, such as Kuiper belt exploration,
that must survive equipment failures over the span of decades. These missions will need to be successful
without a priori knowledge of the most efficient data collection techniques for optimum science return.
Science objectives will have to be revised âon the flyâ, with new data collection and navigation decisions
on short timescales.
Yet, even as science objectives are becoming more ambitious, several critical resources remain
unchanged. Since physics imposes insurmountable light-time delays, anticipated improvements to the
Deep Space Network (DSN) will only marginally improve the bandwidth and communications cadence to
remote spacecraft. Fiscal resources are increasingly limited, resulting in fewer flagship missions, smaller
spacecraft, and less subsystem redundancy. As missions visit more distant and formidable locations, the
job of the operations team becomes more challenging, seemingly inconsistent with the trend of shrinking
mission budgets for operations support. How can we continue to explore challenging new locations
without increasing risk or system complexity?
These challenges are present, to some degree, for the entire Decadal Survey mission portfolio, as
documented in Vision and Voyages for Planetary Science in the Decade 2013â2022 (National Research
Council, 2011), but are especially acute for the following mission examples, identified in our recently
completed KISS Engineering Resilient Space Systems (ERSS) study:
1. A Venus lander, designed to sample the atmosphere and surface of Venus, would have to perform
science operations as components and subsystems degrade and fail;
2. A Trojan asteroid tour spacecraft would spend significant time cruising to its ultimate destination
(essentially hibernating to save on operations costs), then upon arrival, would have to act as its
own surveyor, finding new objects and targets of opportunity as it approaches each asteroid,
requiring response on short notice; and
3. A MSR campaign would not only be required to perform fast reconnaissance over long distances
on the surface of Mars, interact with an unknown physical surface, and handle degradations and
faults, but would also contain multiple components (launch vehicle, cruise stage, entry and
landing vehicle, surface rover, ascent vehicle, orbiting cache, and Earth return vehicle) that
dramatically increase the need for resilience to failure across the complex system.
The concept of resilience and its relevance and application in various domains was a focus during the
study, with several definitions of resilience proposed and discussed. While there was substantial variation
in the specifics, there was a common conceptual core that emergedâadaptation in the presence of
changing circumstances. These changes were couched in various waysâanomalies, disruptions,
discoveriesâbut they all ultimately had to do with changes in underlying assumptions. Invalid
assumptions, whether due to unexpected changes in the environment, or an inadequate understanding of
interactions within the system, may cause unexpected or unintended system behavior. A system is
resilient if it continues to perform the intended functions in the presence of invalid assumptions.
Our study focused on areas of resilience that we felt needed additional exploration and integration,
namely system and software architectures and capabilities, and autonomy technologies. (While also an
important consideration, resilience in hardware is being addressed in multiple other venues, including
2
other KISS studies.) The study consisted of two workshops, separated by a seven-month focused study
period. The first workshop (Workshop #1) explored the âproblem spaceâ as an organizing theme, and the
second workshop (Workshop #2) explored the âsolution spaceâ. In each workshop, focused discussions
and exercises were interspersed with presentations from participants and invited speakers.
The study period between the two workshops was organized as part of the synthesis activity during the
first workshop. The study participants, after spending the initial days of the first workshop discussing the
nature of resilience and its impact on future science missions, decided to split into three focus groups,
each with a particular thrust, to explore specific ideas further and develop material needed for the second
workshop. The three focus groups and areas of exploration were:
1. Reference missions: address/refine the resilience needs by exploring a set of reference missions
2. Capability survey: collect, document, and assess current efforts to develop capabilities and
technology that could be used to address the documented needs, both inside and outside NASA
3. Architecture: analyze the impact of architecture on system resilience, and provide principles and
guidance for architecting greater resilience in our future systems
The key product of the second workshop was a set of capability roadmaps pertaining to the three
reference missions selected for their representative coverage of the types of space missions envisioned for
the future. From these three roadmaps, we have extracted several common capability patterns that would
be appropriate targets for near-term technical development: one focused on graceful degradation of
system functionality, a second focused on data understanding for science and engineering applications,
and a third focused on hazard avoidance and environmental uncertainty. Continuing work is extending
these roadmaps to identify candidate enablers of the capabilities from the following three categories:
architecture solutions, technology solutions, and process solutions.
The KISS study allowed a collection of diverse and engaged engineers, researchers, and scientists to think
deeply about the theory, approaches, and technical issues involved in developing and applying resilience
capabilities. The conclusions summarize the varied and disparate discussions that occurred during the
study, and include new insights about the nature of the challenge and potential solutions:
1. There is a clear and definitive need for more resilient space systems. During our study period,
the key scientists/engineers we engaged to understand potential future missions confirmed the
scientific and risk reduction value of greater resilience in the systems used to perform these
missions.
2. Resilience can be quantified in measurable termsâproject cost, mission risk, and quality of
science return. In order to consider resilience properly in the set of engineering trades performed
during the design, integration, and operation of space systems, the benefits and costs of resilience
need to be quantified. We believe, based on the work done during the study, that appropriate
metrics to measure resilience must relate to risk, cost, and science quality/opportunity. Additional
work is required to explicitly tie design decisions to these first-order concerns.
3. There are many existing basic technologies that can be applied to engineering resilient space
systems. Through the discussions during the study, we found many varied approaches and
research that address the various facets of resilience, some within NASA, and many more
beyond. Examples from civil architecture, Department of Defense (DoD) / Defense Advanced
Research Projects Agency (DARPA) initiatives, âsmartâ power grid control, cyber-physical
systems, software architecture, and application of formal verification methods for software were
identified and discussed. The variety and scope of related efforts is encouraging and presents
many opportunities for collaboration and development, and we expect many collaborative
proposals and joint research as a result of the study.
4. Use of principled architectural approaches is key to managing complexity and integrating
disparate technologies. The main challenge inherent in considering highly resilient space
systems is that the increase in capability can result in an increase in complexity with all of the
3
risks and costs associated with more complex systems. What is needed is a better way of
conceiving space systems that enables incorporation of capabilities without increasing
complexity. We believe principled architecting approaches provide the needed means to convey a
unified understanding of the system to primary stakeholders, thereby controlling complexity in
the conception and development of resilient systems, and enabling the integration of disparate
approaches and technologies. A representative architectural example is included in Appendix F.
5. Developing trusted resilience capabilities will require a diverse yet strategically directed
research program. Despite the interest in, and benefits of, deploying resilience space systems, to
date, there has been a notable lack of meaningful demonstrated progress in systems capable of
working in hazardous uncertain situations. The roadmaps completed during the study, and
documented in this report, provide the basis for a real funded plan that considers the required
fundamental work and evolution of needed capabilities.
Exploring space is a challenging and difficult endeavor. Future space missions will require more
resilience in order to perform the desired science in new environments under constraints of development
and operations cost, acceptable risk, and communications delays. Development of space systems with
resilient capabilities has the potential to expand the limits of possibility, revolutionizing space science by
enabling as yet unforeseen missions and breakthrough science observations.
Our KISS study provided an essential venue for the consideration of these challenges and goals.
Additional work and future steps are needed to realize the potential of resilient systemsâthis study
provided the necessary catalyst to begin this process
Multi-criteria optimization for energy-efficient multi-core systems-on-chip
The steady down-scaling of transistor dimensions has made possible the evolutionary progress leading to todayâs high-performance multi-GHz microprocessors and core based System-on-Chip (SoC) that offer superior performance, dramatically reduced cost per function, and much-reduced physical size compared to their predecessors. On the negative side, this rapid scaling however also translates to high power densities, higher operating temperatures and reduced reliability making it imperative to address design issues that have cropped up in its wake. In particular, the aggressive physical miniaturization have increased CMOS fault sensitivity to the extent that many reliability constraints pose threat to the device normal operation and accelerate the onset of wearout-based failures. Among various wearout-based failure mechanisms, Negative biased temperature instability (NBTI) has been recognized as the most critical source of device aging.
The urge of reliable, low-power circuits is driving the EDA community to develop new design techniques, circuit solutions, algorithms, and software, that can address these critical issues. Unfortunately, this challenge is complicated by the fact that power and reliability are known to be intrinsically conflicting metrics: traditional solutions to improve reliability such as redundancy, increase of voltage levels, and up-sizing of critical devices do contrast with traditional low-power solutions, which rely on compact architectures, scaled supply voltages, and small devices.
This dissertation focuses on methodologies to bridge this gap and establishes an important link between low-power solutions and aging effects. More specifically, we proposed new architectural solutions based on power management strategies to enable the design of low-power, aging aware cache memories.
Cache memories are one of the most critical components for warranting reliable and timely operation. However, they are also more susceptible to aging effects. Due to symmetric structure of a memory cell, aging occurs regardless of the fact that a cell (or word) is accessed or not. Moreover, aging is a worst-case matric and line with worst-case access pattern determines the aging of the entire cache. In order to stop the aging of a memory cell, it must be put into a proper idle state when a cell (or word) is not accessed which require proper management of the idleness of each atomic unit of power management.
We have proposed several reliability management techniques based on the idea of cache partitioning to alleviate NBTI-induced aging and obtain joint energy and lifetime benefits. We introduce graceful degradation mechanism which allows different cache blocks into which a cache is partitioned to age at different rates. This implies that various sub-blocks become unreliable at different times, and the cache keeps functioning with reduced efficiency. We extended the capabilities of this architecture by integrating the concept of reconfigurable caches to maintain the performance of the cache throughout its lifetime. By this strategy, whenever a block becomes unreliable, the remaining cache is reconfigured to work as a smaller size cache with only a marginal degradation of performance.
Many mission-critical applications require guaranteed lifetime of their operations and therefore the hardware implementing their functionality. Such constraints are usually enforced by means of various reliability enhancing solutions mostly based on redundancy which are not energy-friendly. In our work, we have proposed a novel cache architecture in which a smart use of cache partitions for redundancy allows us to obtain cache that meet a desired lifetime target with minimal energy consumption
Autonomous Recovery Of Reconfigurable Logic Devices Using Priority Escalation Of Slack
Field Programmable Gate Array (FPGA) devices offer a suitable platform for survivable hardware architectures in mission-critical systems. In this dissertation, active dynamic redundancy-based fault-handling techniques are proposed which exploit the dynamic partial reconfiguration capability of SRAM-based FPGAs. Self-adaptation is realized by employing reconfiguration in detection, diagnosis, and recovery phases. To extend these concepts to semiconductor aging and process variation in the deep submicron era, resilient adaptable processing systems are sought to maintain quality and throughput requirements despite the vulnerabilities of the underlying computational devices. A new approach to autonomous fault-handling which addresses these goals is developed using only a uniplex hardware arrangement. It operates by observing a health metric to achieve Fault Demotion using Recon- figurable Slack (FaDReS). Here an autonomous fault isolation scheme is employed which neither requires test vectors nor suspends the computational throughput, but instead observes the value of a health metric based on runtime input. The deterministic flow of the fault isolation scheme guarantees success in a bounded number of reconfigurations of the FPGA fabric. FaDReS is then extended to the Priority Using Resource Escalation (PURE) online redundancy scheme which considers fault-isolation latency and throughput trade-offs under a dynamic spare arrangement. While deep-submicron designs introduce new challenges, use of adaptive techniques are seen to provide several promising avenues for improving resilience. The scheme developed is demonstrated by hardware design of various signal processing circuits and their implementation on a Xilinx Virtex-4 FPGA device. These include a Discrete Cosine Transform (DCT) core, Motion Estimation (ME) engine, Finite Impulse Response (FIR) Filter, Support Vector Machine (SVM), and Advanced Encryption Standard (AES) blocks in addition to MCNC benchmark circuits. A iii significant reduction in power consumption is achieved ranging from 83% for low motion-activity scenes to 12.5% for high motion activity video scenes in a novel ME engine configuration. For a typical benchmark video sequence, PURE is shown to maintain a PSNR baseline near 32dB. The diagnosability, reconfiguration latency, and resource overhead of each approach is analyzed. Compared to previous alternatives, PURE maintains a PSNR within a difference of 4.02dB to 6.67dB from the fault-free baseline by escalating healthy resources to higher-priority signal processing functions. The results indicate the benefits of priority-aware resiliency over conventional redundancy approaches in terms of fault-recovery, power consumption, and resource-area requirements. Together, these provide a broad range of strategies to achieve autonomous recovery of reconfigurable logic devices under a variety of constraints, operating conditions, and optimization criteria
How realistic is the mixed-criticality real-time system model?
23rd International Conference on Real-Time Networks and Systems (RTNS 2015). 4 to 6, Nov, 2015, Main Track. Lille, France. Best Paper Award NomineeWith the rapid evolution of commercial hardware platforms, in most application domains, the industry has shown
a growing interest in integrating and running independently-developed applications of different âcriticalitiesâ in the
same multicore platform. Such integrated systems are commonly referred to as mixed-criticality systems (MCS).
Most of the MCS-related research published in the state-of-the-art cite the safety-related standards associated to
each application domain (e.g. aeronautics, space, railway, automotive) to justify their methods and results.
However, those standards are not, in most cases, freely available, and do not always clearly and explicitly specify
the requirements for mixed-criticality systems. This paper addresses the important challenge of unveiling the
relevant information available in some of the safety-related standards, such that the mixed-criticality concept is
understood from an industrialistâs perspective. Moreover, the paper evaluates the state-of-the-art mixed-criticality
real-time scheduling models and algorithms against the safety-related standards and clarifies some
misconceptions that are commonly encountered
Toward Biologically-Inspired Self-Healing, Resilient Architectures for Digital Instrumentation and Control Systems and Embedded Devices
Digital Instrumentation and Control (I&C) systems in safety-related applications of next generation industrial automation systems require high levels of resilience against different fault classes. One of the more essential concepts for achieving this goal is the notion of resilient and survivable digital I&C systems. In recent years, self-healing concepts based on biological physiology have received attention for the design of robust digital systems. However, many of these approaches have not been architected from the outset with safety in mind, nor have they been targeted for the automation community where a significant need exists. This dissertation presents a new self-healing digital I&C architecture called BioSymPLe, inspired from the way nature responds, defends and heals: the stem cells in the immune system of living organisms, the life cycle of the living cell, and the pathway from Deoxyribonucleic acid (DNA) to protein. The BioSymPLe architecture is integrating biological concepts, fault tolerance techniques, and operational schematics for the international standard IEC 61131-3 to facilitate adoption in the automation industry. BioSymPLe is organized into three hierarchical levels: the local function migration layer from the top side, the critical service layer in the middle, and the global function migration layer from the bottom side. The local layer is used to monitor the correct execution of functions at the cellular level and to activate healing mechanisms at the critical service level. The critical layer is allocating a group of functional B cells which represent the building block that executes the intended functionality of critical application based on the expression for DNA genetic codes stored inside each cell. The global layer uses a concept of embryonic stem cells by differentiating these type of cells to repair the faulty T cells and supervising all repair mechanisms. Finally, two industrial applications have been mapped on the proposed architecture, which are capable of tolerating a significant number of faults (transient, permanent, and hardware common cause failures CCFs) that can stem from environmental disturbances and we believe the nexus of its concepts can positively impact the next generation of critical systems in the automation industry
A Novel Method for Online Detection of Faults Affecting Execution-Time in Multicore-Based Systems
This article proposes a bounded interference method, based on statistical evaluations, for online detection
and tolerance of any fault capable of causing a deadline miss. The proposed method requires data that can be
gathered during the profiling and worst-case execution time (WCET) analysis phase. This article describes
the method, its application, and then it presents an avionic mixed-criticality use case for experimental
evaluation, considering both dual-core and quad-core platforms. Results show that faults that can cause
a timing violation are correctly identified while other faults that do not introduce a significant temporal
interference can be tolerated to avoid high recovery overheads
- âŠ