4 research outputs found

    Runtime resource management for lifetime extension in multi-core systems

    Get PDF
    The availability of numerous, possibly heterogeneous, processing resources in multi-core systems allows one to exploit them to optimize performance and/or power/energy consumption. In particular, strategies have been defined to map and schedule tasks on the system resources, with the aim of optimizing the adopted figure of merit, at design time, if the working context is known in advance and relatively stable, at run time when facing changing/unpredictable working conditions. However, it is important to be aware that such strategies may have an impact on the overall lifetime of the system because of aging and wear-out mechanisms. Therefore such management strategies, generally adopted for handling performance and power consumption aspects, should be enhanced in order to consider such issues. Furthermore, specific Dynamic Reliability Management (DRM) policies have been devised to deal with lifetime issues in multi-core systems, acting mainly on the workload distribution (and eventually on architectural knobs, such as voltage/frequency scaling) to mitigate the stress caused by the running applications. Here we will focus on DRM strategies, whose goal is pursuing the improvement of lifetime reliability by means of load distribution policies that identify the resource where to map a new application entering the system, or where to periodically migrate tasks to balance stress. More precisely, a selection of state-of-the-art solutions will be presented and analysed, with respect to the achieved expected lifetime, evaluated when considering the first failure as well as the sequence of failures leading to the system being unable to fulfill the user's performance of service requirements

    Energy-Efficient and Reliable Computing in Dark Silicon Era

    Get PDF
    Dark silicon denotes the phenomenon that, due to thermal and power constraints, the fraction of transistors that can operate at full frequency is decreasing in each technology generation. Moore’s law and Dennard scaling had been backed and coupled appropriately for five decades to bring commensurate exponential performance via single core and later muti-core design. However, recalculating Dennard scaling for recent small technology sizes shows that current ongoing multi-core growth is demanding exponential thermal design power to achieve linear performance increase. This process hits a power wall where raises the amount of dark or dim silicon on future multi/many-core chips more and more. Furthermore, from another perspective, by increasing the number of transistors on the area of a single chip and susceptibility to internal defects alongside aging phenomena, which also is exacerbated by high chip thermal density, monitoring and managing the chip reliability before and after its activation is becoming a necessity. The proposed approaches and experimental investigations in this thesis focus on two main tracks: 1) power awareness and 2) reliability awareness in dark silicon era, where later these two tracks will combine together. In the first track, the main goal is to increase the level of returns in terms of main important features in chip design, such as performance and throughput, while maximum power limit is honored. In fact, we show that by managing the power while having dark silicon, all the traditional benefits that could be achieved by proceeding in Moore’s law can be also achieved in the dark silicon era, however, with a lower amount. Via the track of reliability awareness in dark silicon era, we show that dark silicon can be considered as an opportunity to be exploited for different instances of benefits, namely life-time increase and online testing. We discuss how dark silicon can be exploited to guarantee the system lifetime to be above a certain target value and, furthermore, how dark silicon can be exploited to apply low cost non-intrusive online testing on the cores. After the demonstration of power and reliability awareness while having dark silicon, two approaches will be discussed as the case study where the power and reliability awareness are combined together. The first approach demonstrates how chip reliability can be used as a supplementary metric for power-reliability management. While the second approach provides a trade-off between workload performance and system reliability by simultaneously honoring the given power budget and target reliability

    A lifetime-aware runtime mapping approach for many-core systems in the dark silicon era

    No full text
    In this paper, we propose a novel lifetime reliability-aware resource management approach for many-core architectures. The approach is based on hierarchical architecture, composed of a long-term runtime reliability analysis unit and a short-term runtime mapping unit. The former periodically analyses the aging status of the various processing units with respect to a target value specified by the designer, and performs recovery actions on highly stressed cores. The calculated reliability metrics are utilized in runtime mapping of the newly arrived applications to maximize the performance of the system while fulfilling reliability requirements and the available power budget. Our extensive experimental results reveal that the proposed reliability-aware approach can efficiently select the processing cores to be used over time in order to enhance the reliability at the end of the operational life (up to 62%) while offering the comparable performance level of the state-of-the-art runtime mapping approach

    Self-aware reliable monitoring

    Get PDF
    Cyber-Physical Systems (CPSs) can be found in almost all technical areas where they constitute a key enabler for anticipated autonomous machines and devices. They are used in a wide range of applications such as autonomous driving, traffic control, manufacturing plants, telecommunication systems, smart grids, and portable health monitoring systems. CPSs are facing steadily increasing requirements such as autonomy, adaptability, reliability, robustness, efficiency, and performance. A CPS necessitates comprehensive knowledge about itself and its environment to meet these requirements as well as make rational, well-informed decisions, manage its objectives in a sophisticated way, and adapt to a possibly changing environment. To gain such comprehensive knowledge, a CPS must monitor itself and its environment. However, the data obtained during this process comes from physical properties measured by sensors and may differ from the ground truth. Sensors are neither completely accurate nor precise. Even if they were, they could still be used incorrectly or break while operating. Besides, it is possible that not all characteristics of physical quantities in the environment are entirely known. Furthermore, some input data may be meaningless as long as they are not transferred to a domain understandable to the CPS. Regardless of the reason, whether erroneous data, incomplete knowledge or unintelligibility of data, such circumstances can result in a CPS that has an incomplete or inaccurate picture of itself and its environment, which can lead to wrong decisions with possible negative consequences. Therefore, a CPS must know the obtained data’s reliability and may need to abstract information of it to fulfill its tasks. Besides, a CPS should base its decisions on a measure that reflects its confidence about certain circumstances. Computational Self-Awareness (CSA) is a promising solution for providing a CPS with a monitoring ability that is reliable and robust — even in the presence of erroneous data. This dissertation proves that CSA, especially the properties abstraction, data reliability, and confidence, can improve a system’s monitoring capabilities regarding its robustness and reliability. The extensive experiments conducted are based on two case studies from different fields: the health- and industrial sectors
    corecore