3,700 research outputs found
TANDEM: taming failures in next-generation datacenters with emerging memory
The explosive growth of online services, leading to unforeseen scales, has made modern datacenters highly prone to failures. Taming these failures hinges on fast and correct recovery, minimizing service interruptions.
Applications, owing to recovery, entail additional measures to maintain a recoverable state of data and computation logic during their failure-free execution. However, these precautionary measures have
severe implications on performance, correctness, and programmability, making recovery incredibly challenging to realize in practice.
Emerging memory, particularly non-volatile memory (NVM) and disaggregated memory (DM), offers a promising opportunity to achieve fast recovery with maximum performance. However, incorporating these technologies into datacenter architecture presents significant challenges; Their distinct architectural attributes, differing significantly from traditional memory devices, introduce new semantic challenges for
implementing recovery, complicating correctness and programmability.
Can emerging memory enable fast, performant, and correct recovery in the datacenter? This thesis aims to answer this question while addressing the associated challenges.
When architecting datacenters with emerging memory, system architects face four key challenges: (1) how to guarantee correct semantics; (2) how to efficiently enforce correctness with optimal performance; (3) how to validate end-to-end correctness including recovery; and (4) how to preserve programmer productivity (Programmability).
This thesis aims to address these challenges through the following approaches: (a)
defining precise consistency models that formally specify correct end-to-end semantics
in the presence of failures (consistency models also play a crucial role in programmability); (b) developing new low-level mechanisms to efficiently enforce the prescribed models given the capabilities of emerging memory; and (c) creating robust testing frameworks to validate end-to-end correctness and recovery.
We start our exploration with non-volatile memory (NVM), which offers fast persistence capabilities directly accessible through the processor’s load-store (memory) interface. Notably, these capabilities can be leveraged to enable fast recovery for Log-Free Data Structures (LFDs) while maximizing performance. However, due to the complexity of modern cache hierarchies, data hardly persist in any specific order, jeop-
ardizing recovery and correctness. Therefore, recovery needs primitives that explicitly control the order of updates to NVM (known as persistency models). We outline the precise specification of a novel persistency model – Release Persistency (RP) – that provides a consistency guarantee for LFDs on what remains in non-volatile memory upon failure. To efficiently enforce RP, we propose a novel microarchitecture mechanism,
lazy release persistence (LRP). Using standard LFDs benchmarks, we show that LRP achieves fast recovery while incurring minimal overhead on performance.
We continue our discussion with memory disaggregation which decouples memory from traditional monolithic servers, offering a promising pathway for achieving very high availability in replicated in-memory data stores. Achieving such availability hinges on transaction protocols that can efficiently handle recovery in this setting, where
compute and memory are independent. However, there is a challenge: disaggregated memory (DM) fails to work with RPC-style protocols, mandating one-sided transaction protocols. Exacerbating the problem, one-sided transactions expose critical low-level
ordering to architects, posing a threat to correctness. We present a highly available transaction protocol, Pandora, that is specifically designed to achieve fast recovery in disaggregated key-value stores (DKVSes).
Pandora is the first one-sided transactional protocol that ensures correct, non-blocking, and fast recovery in DKVS. Our experimental implementation artifacts demonstrate that Pandora achieves fast recovery and high availability while causing minimal disruption to services.
Finally, we introduce a novel target litmus-testing framework – DART – to validate the end-to-end correctness of transactional protocols with recovery. Using DART’s target testing capabilities, we have found several critical bugs in Pandora, highlighting the need for robust end-to-end testing methods in the design loop to iteratively fix correctness bugs. Crucially, DART is lightweight and black-box, thereby eliminating
any intervention from the programmers
Exploring Hardware Fault Impacts on Different Real Number Representations of the Structural Resilience of TCUs in GPUs
The most recent generations of graphics processing units (GPUs) boost the execution of convolutional operations required by machine learning applications by resorting to specialized and efficient in-chip accelerators (Tensor Core Units or TCUs) that operate on matrix multiplication tiles. Unfortunately, modern cutting-edge semiconductor technologies are increasingly prone to hardware defects, and the trend to highly stress TCUs during the execution of safety-critical and high-performance computing (HPC) applications increases the likelihood of TCUs producing different kinds of failures. In fact, the intrinsic resiliency to hardware faults of arithmetic units plays a crucial role in safety-critical applications using GPUs (e.g., in automotive, space, and autonomous robotics). Recently, new arithmetic formats have been proposed, particularly those suited to neural network execution. However, the reliability characterization of TCUs supporting different arithmetic formats was still lacking. In this work, we quantitatively assessed the impact of hardware faults in TCU structures while employing two distinct formats (floating-point and posit) and using two different configurations (16 and 32 bits) to represent real numbers. For the experimental evaluation, we resorted to an architectural description of a TCU core (PyOpenTCU) and performed 120 fault simulation campaigns, injecting around 200,000 faults per campaign and requiring around 32 days of computation. Our results demonstrate that the posit format of TCUs is less affected by faults than the floating-point one (by up to three orders of magnitude for 16 bits and up to twenty orders for 32 bits). We also identified the most sensible fault locations (i.e., those that produce the largest errors), thus paving the way to adopting smart hardening solutions
Memory built-in self-repair and correction for improving yield: a review
Nanometer memories are highly prone to defects due to dense structure, necessitating memory built-in self-repair as a must-have feature to improve yield. Today’s system-on-chips contain memories occupying an area as high as 90% of the chip area. Shrinking technology uses stricter design rules for memories, making them more prone to manufacturing defects. Further, using 3D-stacked memories makes the system vulnerable to newer defects such as those coming from through-silicon-vias (TSV) and micro bumps. The increased memory size is also resulting in an increase in soft errors during system operation. Multiple memory repair techniques based on redundancy and correction codes have been presented to recover from such defects and prevent system failures. This paper reviews recently published memory repair methodologies, including various built-in self-repair (BISR) architectures, repair analysis algorithms, in-system repair, and soft repair handling using error correcting codes (ECC). It provides a classification of these techniques based on method and usage. Finally, it reviews evaluation methods used to determine the effectiveness of the repair algorithms. The paper aims to present a survey of these methodologies and prepare a platform for developing repair methods for upcoming-generation memories
Ecology of methanotrophs in a landfill methane biofilter
Decomposing landfill waste is a significant anthropogenic source of the potent climate-active gas methane (CHâ‚„). To mitigate fugitive methane emissions Norfolk County Council are trialling a landfill biofilter, designed to harness the methane oxidizing potential of methanotrophic bacteria. These methanotrophs can convert CHâ‚„ to COâ‚‚ or biomass and act as CHâ‚„ sinks.
The most active CH₄ oxidising regions of the Strumpshaw biofilter were identified from in-situ temperature, CH₄, O₂ and CO₂ profiles. While soil CH₄ oxidation potential was estimated and used to confirm methanotroph activity and determine optimal soil moisture conditions for CH₄ oxidation. It was observed that most CH₄ oxidation occurs in the top 60cm of the biofilter (up to 50% of CH4 input) at temperatures around 50ºC, optimal soil moisture was 10-27.5%. A decrease in in-situ temperature following CH₄ supply interruption suggested the high biofilter temperatures were driven by CH₄ oxidation.
The biofilter soil bacterial community was profiled by 16S rRNA gene analysis, with methanotrophs accounting for ~5-10% of bacteria. Active methanotrophs at a range of different incubation temperatures were identified by ¹³CH₄ DNA stable-isotope probing coupled with 16S rRNA gene amplicon and metagenome analysis. These methods identified Methylocella, Methylobacter, Methylocystis and Crenothrix as potential CH₄ oxidisers at the lower temperatures (30ºC/37ºC) observed following system start-up or gas-feed interruption. At higher temperatures typical of established biofilter operation (45ºC/50ºC), Methylocaldum and an unassigned Methylococcaceae species were the dominant active methanotrophs.
Finally, novel methanotrophs Methylococcus capsulatus (Norfolk) and Methylocaldum szegediense (Norfolk) were isolated from biofilter soil enrichments. Methylocaldum szegediense (Norfolk) may be very closely related to or the same species as one of the most abundant active methanotrophs in a metagenome from a 50ºC biofilter soil incubation, based on genome-to-MAG similarity. This isolate was capable of growth over a broad temperature range (37-62ºC) including the higher (in-situ) biofilter temperatures (>50ºC)
Secure storage systems for untrusted cloud environments
The cloud has become established for applications that need to be scalable and highly
available. However, moving data to data centers owned and operated by a third party,
i.e., the cloud provider, raises security concerns because a cloud provider could easily
access and manipulate the data or program flow, preventing the cloud from being
used for certain applications, like medical or financial.
Hardware vendors are addressing these concerns by developing Trusted Execution
Environments (TEEs) that make the CPU state and parts of memory inaccessible from
the host software. While TEEs protect the current execution state, they do not provide
security guarantees for data which does not fit nor reside in the protected memory
area, like network and persistent storage.
In this work, we aim to address TEEs’ limitations in three different ways, first we
provide the trust of TEEs to persistent storage, second we extend the trust to multiple
nodes in a network, and third we propose a compiler-based solution for accessing
heterogeneous memory regions. More specifically,
• SPEICHER extends the trust provided by TEEs to persistent storage. SPEICHER
implements a key-value interface. Its design is based on LSM data structures, but
extends them to provide confidentiality, integrity, and freshness for the stored
data. Thus, SPEICHER can prove to the client that the data has not been tampered
with by an attacker.
• AVOCADO is a distributed in-memory key-value store (KVS) that extends the
trust that TEEs provide across the network to multiple nodes, allowing KVSs to
scale beyond the boundaries of a single node. On each node, AVOCADO carefully
divides data between trusted memory and untrusted host memory, to maximize
the amount of data that can be stored on each node. AVOCADO leverages the
fact that we can model network attacks as crash-faults to trust other nodes with
a hardened ABD replication protocol.
• TOAST is based on the observation that modern high-performance systems
often use several different heterogeneous memory regions that are not easily
distinguishable by the programmer. The number of regions is increased by the
fact that TEEs divide memory into trusted and untrusted regions. TOAST is a
compiler-based approach to unify access to different heterogeneous memory
regions and provides programmability and portability. TOAST uses a
load/store interface to abstract most library interfaces for different memory
regions
Pathway: a fast and flexible unified stream data processing framework for analytical and Machine Learning applications
We present Pathway, a new unified data processing framework that can run
workloads on both bounded and unbounded data streams. The framework was created
with the original motivation of resolving challenges faced when analyzing and
processing data from the physical economy, including streams of data generated
by IoT and enterprise systems. These required rapid reaction while calling for
the application of advanced computation paradigms (machinelearning-powered
analytics, contextual analysis, and other elements of complex event
processing). Pathway is equipped with a Table API tailored for Python and
Python/SQL workflows, and is powered by a distributed incremental dataflow in
Rust. We describe the system and present benchmarking results which demonstrate
its capabilities in both batch and streaming contexts, where it is able to
surpass state-of-the-art industry frameworks in both scenarios. We also discuss
streaming use cases handled by Pathway which cannot be easily resolved with
state-of-the-art industry frameworks, such as streaming iterative graph
algorithms (PageRank, etc.)
A Literature Review of Fault Diagnosis Based on Ensemble Learning
The accuracy of fault diagnosis is an important indicator to ensure the reliability of key equipment systems. Ensemble learning integrates different weak learning methods to obtain stronger learning and has achieved remarkable results in the field of fault diagnosis. This paper reviews the recent research on ensemble learning from both technical and field application perspectives. The paper summarizes 87 journals in recent web of science and other academic resources, with a total of 209 papers. It summarizes 78 different ensemble learning based fault diagnosis methods, involving 18 public datasets and more than 20 different equipment systems. In detail, the paper summarizes the accuracy rates, fault classification types, fault datasets, used data signals, learners (traditional machine learning or deep learning-based learners), ensemble learning methods (bagging, boosting, stacking and other ensemble models) of these fault diagnosis models. The paper uses accuracy of fault diagnosis as the main evaluation metrics supplemented by generalization and imbalanced data processing ability to evaluate the performance of those ensemble learning methods. The discussion and evaluation of these methods lead to valuable research references in identifying and developing appropriate intelligent fault diagnosis models for various equipment. This paper also discusses and explores the technical challenges, lessons learned from the review and future development directions in the field of ensemble learning based fault diagnosis and intelligent maintenance
Deciphering Radio Emission from Solar Coronal Mass Ejections using High-fidelity Spectropolarimetric Radio Imaging
Coronal mass ejections (CMEs) are large-scale expulsions of plasma and
magnetic fields from the Sun into the heliosphere and are the most important
driver of space weather. The geo-effectiveness of a CME is primarily determined
by its magnetic field strength and topology. Measurement of CME magnetic
fields, both in the corona and heliosphere, is essential for improving space
weather forecasting. Observations at radio wavelengths can provide several
remote measurement tools for estimating both strength and topology of the CME
magnetic fields. Among them, gyrosynchrotron (GS) emission produced by
mildly-relativistic electrons trapped in CME magnetic fields is one of the
promising methods to estimate magnetic field strength of CMEs at lower and
middle coronal heights. However, GS emissions from some parts of the CME are
much fainter than the quiet Sun emission and require high dynamic range (DR)
imaging for their detection. This thesis presents a state-of-the-art
calibration and imaging algorithm capable of routinely producing high DR
spectropolarimetric snapshot solar radio images using data from a new
technology radio telescope, the Murchison Widefield Array. This allows us to
detect much fainter GS emissions from CME plasma at much higher coronal
heights. For the first time, robust circular polarization measurements have
been jointly used with total intensity measurements to constrain the GS model
parameters, which has significantly improved the robustness of the estimated GS
model parameters. A piece of observational evidence is also found that
routinely used homogeneous and isotropic GS models may not always be sufficient
to model the observations. In the future, with upcoming sensitive telescopes
and physics-based forward models, it should be possible to relax some of these
assumptions and make this method more robust for estimating CME plasma
parameters at coronal heights.Comment: 297 pages, 100 figures, 9 tables. Submitted at Tata Institute of
Fundamental Research, Mumbai, India, Ph.D Thesi
A distributed and energy‑efficient KNN for EEG classification with dynamic money‑saving policy in heterogeneous clusters
Universidad de Granada/CBUASpanish Ministry of Science, Innovation, and Universities under Grants PGC2018-098813-B-C31,PID2022-137461NB-C32ERDF fund. Funding for open access charge: University of Granada/
CBU
- …