3,700 research outputs found

    TANDEM: taming failures in next-generation datacenters with emerging memory

    Get PDF
    The explosive growth of online services, leading to unforeseen scales, has made modern datacenters highly prone to failures. Taming these failures hinges on fast and correct recovery, minimizing service interruptions. Applications, owing to recovery, entail additional measures to maintain a recoverable state of data and computation logic during their failure-free execution. However, these precautionary measures have severe implications on performance, correctness, and programmability, making recovery incredibly challenging to realize in practice. Emerging memory, particularly non-volatile memory (NVM) and disaggregated memory (DM), offers a promising opportunity to achieve fast recovery with maximum performance. However, incorporating these technologies into datacenter architecture presents significant challenges; Their distinct architectural attributes, differing significantly from traditional memory devices, introduce new semantic challenges for implementing recovery, complicating correctness and programmability. Can emerging memory enable fast, performant, and correct recovery in the datacenter? This thesis aims to answer this question while addressing the associated challenges. When architecting datacenters with emerging memory, system architects face four key challenges: (1) how to guarantee correct semantics; (2) how to efficiently enforce correctness with optimal performance; (3) how to validate end-to-end correctness including recovery; and (4) how to preserve programmer productivity (Programmability). This thesis aims to address these challenges through the following approaches: (a) defining precise consistency models that formally specify correct end-to-end semantics in the presence of failures (consistency models also play a crucial role in programmability); (b) developing new low-level mechanisms to efficiently enforce the prescribed models given the capabilities of emerging memory; and (c) creating robust testing frameworks to validate end-to-end correctness and recovery. We start our exploration with non-volatile memory (NVM), which offers fast persistence capabilities directly accessible through the processor’s load-store (memory) interface. Notably, these capabilities can be leveraged to enable fast recovery for Log-Free Data Structures (LFDs) while maximizing performance. However, due to the complexity of modern cache hierarchies, data hardly persist in any specific order, jeop- ardizing recovery and correctness. Therefore, recovery needs primitives that explicitly control the order of updates to NVM (known as persistency models). We outline the precise specification of a novel persistency model – Release Persistency (RP) – that provides a consistency guarantee for LFDs on what remains in non-volatile memory upon failure. To efficiently enforce RP, we propose a novel microarchitecture mechanism, lazy release persistence (LRP). Using standard LFDs benchmarks, we show that LRP achieves fast recovery while incurring minimal overhead on performance. We continue our discussion with memory disaggregation which decouples memory from traditional monolithic servers, offering a promising pathway for achieving very high availability in replicated in-memory data stores. Achieving such availability hinges on transaction protocols that can efficiently handle recovery in this setting, where compute and memory are independent. However, there is a challenge: disaggregated memory (DM) fails to work with RPC-style protocols, mandating one-sided transaction protocols. Exacerbating the problem, one-sided transactions expose critical low-level ordering to architects, posing a threat to correctness. We present a highly available transaction protocol, Pandora, that is specifically designed to achieve fast recovery in disaggregated key-value stores (DKVSes). Pandora is the first one-sided transactional protocol that ensures correct, non-blocking, and fast recovery in DKVS. Our experimental implementation artifacts demonstrate that Pandora achieves fast recovery and high availability while causing minimal disruption to services. Finally, we introduce a novel target litmus-testing framework – DART – to validate the end-to-end correctness of transactional protocols with recovery. Using DART’s target testing capabilities, we have found several critical bugs in Pandora, highlighting the need for robust end-to-end testing methods in the design loop to iteratively fix correctness bugs. Crucially, DART is lightweight and black-box, thereby eliminating any intervention from the programmers

    Exploring Hardware Fault Impacts on Different Real Number Representations of the Structural Resilience of TCUs in GPUs

    Get PDF
    The most recent generations of graphics processing units (GPUs) boost the execution of convolutional operations required by machine learning applications by resorting to specialized and efficient in-chip accelerators (Tensor Core Units or TCUs) that operate on matrix multiplication tiles. Unfortunately, modern cutting-edge semiconductor technologies are increasingly prone to hardware defects, and the trend to highly stress TCUs during the execution of safety-critical and high-performance computing (HPC) applications increases the likelihood of TCUs producing different kinds of failures. In fact, the intrinsic resiliency to hardware faults of arithmetic units plays a crucial role in safety-critical applications using GPUs (e.g., in automotive, space, and autonomous robotics). Recently, new arithmetic formats have been proposed, particularly those suited to neural network execution. However, the reliability characterization of TCUs supporting different arithmetic formats was still lacking. In this work, we quantitatively assessed the impact of hardware faults in TCU structures while employing two distinct formats (floating-point and posit) and using two different configurations (16 and 32 bits) to represent real numbers. For the experimental evaluation, we resorted to an architectural description of a TCU core (PyOpenTCU) and performed 120 fault simulation campaigns, injecting around 200,000 faults per campaign and requiring around 32 days of computation. Our results demonstrate that the posit format of TCUs is less affected by faults than the floating-point one (by up to three orders of magnitude for 16 bits and up to twenty orders for 32 bits). We also identified the most sensible fault locations (i.e., those that produce the largest errors), thus paving the way to adopting smart hardening solutions

    Memory built-in self-repair and correction for improving yield: a review

    Get PDF
    Nanometer memories are highly prone to defects due to dense structure, necessitating memory built-in self-repair as a must-have feature to improve yield. Today’s system-on-chips contain memories occupying an area as high as 90% of the chip area. Shrinking technology uses stricter design rules for memories, making them more prone to manufacturing defects. Further, using 3D-stacked memories makes the system vulnerable to newer defects such as those coming from through-silicon-vias (TSV) and micro bumps. The increased memory size is also resulting in an increase in soft errors during system operation. Multiple memory repair techniques based on redundancy and correction codes have been presented to recover from such defects and prevent system failures. This paper reviews recently published memory repair methodologies, including various built-in self-repair (BISR) architectures, repair analysis algorithms, in-system repair, and soft repair handling using error correcting codes (ECC). It provides a classification of these techniques based on method and usage. Finally, it reviews evaluation methods used to determine the effectiveness of the repair algorithms. The paper aims to present a survey of these methodologies and prepare a platform for developing repair methods for upcoming-generation memories

    Ecology of methanotrophs in a landfill methane biofilter

    Get PDF
    Decomposing landfill waste is a significant anthropogenic source of the potent climate-active gas methane (CH₄). To mitigate fugitive methane emissions Norfolk County Council are trialling a landfill biofilter, designed to harness the methane oxidizing potential of methanotrophic bacteria. These methanotrophs can convert CH₄ to CO₂ or biomass and act as CH₄ sinks. The most active CH₄ oxidising regions of the Strumpshaw biofilter were identified from in-situ temperature, CH₄, O₂ and CO₂ profiles. While soil CH₄ oxidation potential was estimated and used to confirm methanotroph activity and determine optimal soil moisture conditions for CH₄ oxidation. It was observed that most CH₄ oxidation occurs in the top 60cm of the biofilter (up to 50% of CH4 input) at temperatures around 50ºC, optimal soil moisture was 10-27.5%. A decrease in in-situ temperature following CH₄ supply interruption suggested the high biofilter temperatures were driven by CH₄ oxidation. The biofilter soil bacterial community was profiled by 16S rRNA gene analysis, with methanotrophs accounting for ~5-10% of bacteria. Active methanotrophs at a range of different incubation temperatures were identified by ¹³CH₄ DNA stable-isotope probing coupled with 16S rRNA gene amplicon and metagenome analysis. These methods identified Methylocella, Methylobacter, Methylocystis and Crenothrix as potential CH₄ oxidisers at the lower temperatures (30ºC/37ºC) observed following system start-up or gas-feed interruption. At higher temperatures typical of established biofilter operation (45ºC/50ºC), Methylocaldum and an unassigned Methylococcaceae species were the dominant active methanotrophs. Finally, novel methanotrophs Methylococcus capsulatus (Norfolk) and Methylocaldum szegediense (Norfolk) were isolated from biofilter soil enrichments. Methylocaldum szegediense (Norfolk) may be very closely related to or the same species as one of the most abundant active methanotrophs in a metagenome from a 50ºC biofilter soil incubation, based on genome-to-MAG similarity. This isolate was capable of growth over a broad temperature range (37-62ºC) including the higher (in-situ) biofilter temperatures (>50ºC)

    Secure storage systems for untrusted cloud environments

    Get PDF
    The cloud has become established for applications that need to be scalable and highly available. However, moving data to data centers owned and operated by a third party, i.e., the cloud provider, raises security concerns because a cloud provider could easily access and manipulate the data or program flow, preventing the cloud from being used for certain applications, like medical or financial. Hardware vendors are addressing these concerns by developing Trusted Execution Environments (TEEs) that make the CPU state and parts of memory inaccessible from the host software. While TEEs protect the current execution state, they do not provide security guarantees for data which does not fit nor reside in the protected memory area, like network and persistent storage. In this work, we aim to address TEEs’ limitations in three different ways, first we provide the trust of TEEs to persistent storage, second we extend the trust to multiple nodes in a network, and third we propose a compiler-based solution for accessing heterogeneous memory regions. More specifically, • SPEICHER extends the trust provided by TEEs to persistent storage. SPEICHER implements a key-value interface. Its design is based on LSM data structures, but extends them to provide confidentiality, integrity, and freshness for the stored data. Thus, SPEICHER can prove to the client that the data has not been tampered with by an attacker. • AVOCADO is a distributed in-memory key-value store (KVS) that extends the trust that TEEs provide across the network to multiple nodes, allowing KVSs to scale beyond the boundaries of a single node. On each node, AVOCADO carefully divides data between trusted memory and untrusted host memory, to maximize the amount of data that can be stored on each node. AVOCADO leverages the fact that we can model network attacks as crash-faults to trust other nodes with a hardened ABD replication protocol. • TOAST is based on the observation that modern high-performance systems often use several different heterogeneous memory regions that are not easily distinguishable by the programmer. The number of regions is increased by the fact that TEEs divide memory into trusted and untrusted regions. TOAST is a compiler-based approach to unify access to different heterogeneous memory regions and provides programmability and portability. TOAST uses a load/store interface to abstract most library interfaces for different memory regions

    Pathway: a fast and flexible unified stream data processing framework for analytical and Machine Learning applications

    Full text link
    We present Pathway, a new unified data processing framework that can run workloads on both bounded and unbounded data streams. The framework was created with the original motivation of resolving challenges faced when analyzing and processing data from the physical economy, including streams of data generated by IoT and enterprise systems. These required rapid reaction while calling for the application of advanced computation paradigms (machinelearning-powered analytics, contextual analysis, and other elements of complex event processing). Pathway is equipped with a Table API tailored for Python and Python/SQL workflows, and is powered by a distributed incremental dataflow in Rust. We describe the system and present benchmarking results which demonstrate its capabilities in both batch and streaming contexts, where it is able to surpass state-of-the-art industry frameworks in both scenarios. We also discuss streaming use cases handled by Pathway which cannot be easily resolved with state-of-the-art industry frameworks, such as streaming iterative graph algorithms (PageRank, etc.)

    A Literature Review of Fault Diagnosis Based on Ensemble Learning

    Get PDF
    The accuracy of fault diagnosis is an important indicator to ensure the reliability of key equipment systems. Ensemble learning integrates different weak learning methods to obtain stronger learning and has achieved remarkable results in the field of fault diagnosis. This paper reviews the recent research on ensemble learning from both technical and field application perspectives. The paper summarizes 87 journals in recent web of science and other academic resources, with a total of 209 papers. It summarizes 78 different ensemble learning based fault diagnosis methods, involving 18 public datasets and more than 20 different equipment systems. In detail, the paper summarizes the accuracy rates, fault classification types, fault datasets, used data signals, learners (traditional machine learning or deep learning-based learners), ensemble learning methods (bagging, boosting, stacking and other ensemble models) of these fault diagnosis models. The paper uses accuracy of fault diagnosis as the main evaluation metrics supplemented by generalization and imbalanced data processing ability to evaluate the performance of those ensemble learning methods. The discussion and evaluation of these methods lead to valuable research references in identifying and developing appropriate intelligent fault diagnosis models for various equipment. This paper also discusses and explores the technical challenges, lessons learned from the review and future development directions in the field of ensemble learning based fault diagnosis and intelligent maintenance

    Deciphering Radio Emission from Solar Coronal Mass Ejections using High-fidelity Spectropolarimetric Radio Imaging

    Full text link
    Coronal mass ejections (CMEs) are large-scale expulsions of plasma and magnetic fields from the Sun into the heliosphere and are the most important driver of space weather. The geo-effectiveness of a CME is primarily determined by its magnetic field strength and topology. Measurement of CME magnetic fields, both in the corona and heliosphere, is essential for improving space weather forecasting. Observations at radio wavelengths can provide several remote measurement tools for estimating both strength and topology of the CME magnetic fields. Among them, gyrosynchrotron (GS) emission produced by mildly-relativistic electrons trapped in CME magnetic fields is one of the promising methods to estimate magnetic field strength of CMEs at lower and middle coronal heights. However, GS emissions from some parts of the CME are much fainter than the quiet Sun emission and require high dynamic range (DR) imaging for their detection. This thesis presents a state-of-the-art calibration and imaging algorithm capable of routinely producing high DR spectropolarimetric snapshot solar radio images using data from a new technology radio telescope, the Murchison Widefield Array. This allows us to detect much fainter GS emissions from CME plasma at much higher coronal heights. For the first time, robust circular polarization measurements have been jointly used with total intensity measurements to constrain the GS model parameters, which has significantly improved the robustness of the estimated GS model parameters. A piece of observational evidence is also found that routinely used homogeneous and isotropic GS models may not always be sufficient to model the observations. In the future, with upcoming sensitive telescopes and physics-based forward models, it should be possible to relax some of these assumptions and make this method more robust for estimating CME plasma parameters at coronal heights.Comment: 297 pages, 100 figures, 9 tables. Submitted at Tata Institute of Fundamental Research, Mumbai, India, Ph.D Thesi

    A distributed and energy‑efficient KNN for EEG classification with dynamic money‑saving policy in heterogeneous clusters

    Get PDF
    Universidad de Granada/CBUASpanish Ministry of Science, Innovation, and Universities under Grants PGC2018-098813-B-C31,PID2022-137461NB-C32ERDF fund. Funding for open access charge: University of Granada/ CBU
    • …
    corecore