1,276 research outputs found
TANDEM: taming failures in next-generation datacenters with emerging memory
The explosive growth of online services, leading to unforeseen scales, has made modern datacenters highly prone to failures. Taming these failures hinges on fast and correct recovery, minimizing service interruptions.
Applications, owing to recovery, entail additional measures to maintain a recoverable state of data and computation logic during their failure-free execution. However, these precautionary measures have
severe implications on performance, correctness, and programmability, making recovery incredibly challenging to realize in practice.
Emerging memory, particularly non-volatile memory (NVM) and disaggregated memory (DM), offers a promising opportunity to achieve fast recovery with maximum performance. However, incorporating these technologies into datacenter architecture presents significant challenges; Their distinct architectural attributes, differing significantly from traditional memory devices, introduce new semantic challenges for
implementing recovery, complicating correctness and programmability.
Can emerging memory enable fast, performant, and correct recovery in the datacenter? This thesis aims to answer this question while addressing the associated challenges.
When architecting datacenters with emerging memory, system architects face four key challenges: (1) how to guarantee correct semantics; (2) how to efficiently enforce correctness with optimal performance; (3) how to validate end-to-end correctness including recovery; and (4) how to preserve programmer productivity (Programmability).
This thesis aims to address these challenges through the following approaches: (a)
defining precise consistency models that formally specify correct end-to-end semantics
in the presence of failures (consistency models also play a crucial role in programmability); (b) developing new low-level mechanisms to efficiently enforce the prescribed models given the capabilities of emerging memory; and (c) creating robust testing frameworks to validate end-to-end correctness and recovery.
We start our exploration with non-volatile memory (NVM), which offers fast persistence capabilities directly accessible through the processor’s load-store (memory) interface. Notably, these capabilities can be leveraged to enable fast recovery for Log-Free Data Structures (LFDs) while maximizing performance. However, due to the complexity of modern cache hierarchies, data hardly persist in any specific order, jeop-
ardizing recovery and correctness. Therefore, recovery needs primitives that explicitly control the order of updates to NVM (known as persistency models). We outline the precise specification of a novel persistency model – Release Persistency (RP) – that provides a consistency guarantee for LFDs on what remains in non-volatile memory upon failure. To efficiently enforce RP, we propose a novel microarchitecture mechanism,
lazy release persistence (LRP). Using standard LFDs benchmarks, we show that LRP achieves fast recovery while incurring minimal overhead on performance.
We continue our discussion with memory disaggregation which decouples memory from traditional monolithic servers, offering a promising pathway for achieving very high availability in replicated in-memory data stores. Achieving such availability hinges on transaction protocols that can efficiently handle recovery in this setting, where
compute and memory are independent. However, there is a challenge: disaggregated memory (DM) fails to work with RPC-style protocols, mandating one-sided transaction protocols. Exacerbating the problem, one-sided transactions expose critical low-level
ordering to architects, posing a threat to correctness. We present a highly available transaction protocol, Pandora, that is specifically designed to achieve fast recovery in disaggregated key-value stores (DKVSes).
Pandora is the first one-sided transactional protocol that ensures correct, non-blocking, and fast recovery in DKVS. Our experimental implementation artifacts demonstrate that Pandora achieves fast recovery and high availability while causing minimal disruption to services.
Finally, we introduce a novel target litmus-testing framework – DART – to validate the end-to-end correctness of transactional protocols with recovery. Using DART’s target testing capabilities, we have found several critical bugs in Pandora, highlighting the need for robust end-to-end testing methods in the design loop to iteratively fix correctness bugs. Crucially, DART is lightweight and black-box, thereby eliminating
any intervention from the programmers
Resource-aware scheduling for 2D/3D multi-/many-core processor-memory systems
This dissertation addresses the complexities of 2D/3D multi-/many-core processor-memory systems, focusing on two key areas: enhancing timing predictability in real-time multi-core processors and optimizing performance within thermal constraints. The integration of an increasing number of transistors into compact chip designs, while boosting computational capacity, presents challenges in resource contention and thermal management. The first part of the thesis improves timing predictability. We enhance shared cache interference analysis for set-associative caches, advancing the calculation of Worst-Case Execution Time (WCET). This development enables accurate assessment of cache interference and the effectiveness of partitioned schedulers in real-world scenarios. We introduce TCPS, a novel task and cache-aware partitioned scheduler that optimizes cache partitioning based on task-specific WCET sensitivity, leading to improved schedulability and predictability. Our research explores various cache and scheduling configurations, providing insights into their performance trade-offs. The second part focuses on thermal management in 2D/3D many-core systems. Recognizing the limitations of Dynamic Voltage and Frequency Scaling (DVFS) in S-NUCA many-core processors, we propose synchronous thread migrations as a thermal management strategy. This approach culminates in the HotPotato scheduler, which balances performance and thermal safety. We also introduce 3D-TTP, a transient temperature-aware power budgeting strategy for 3D-stacked systems, reducing the need for Dynamic Thermal Management (DTM) activation. Finally, we present 3QUTM, a novel method for 3D-stacked systems that combines core DVFS and memory bank Low Power Modes with a learning algorithm, optimizing response times within thermal limits. This research contributes significantly to enhancing performance and thermal management in advanced processor-memory systems
Towards a centralized multicore automotive system
Today’s automotive systems are inundated with embedded electronics to host chassis, powertrain, infotainment, advanced driver assistance systems, and other modern vehicle functions. As many as 100 embedded microcontrollers execute hundreds of millions of lines of code in a single vehicle. To control the increasing complexity in vehicle electronics and services, automakers are planning to consolidate different on-board automotive functions as software tasks on centralized multicore hardware platforms. However, these vehicle software services have different and contrasting timing, safety, and security requirements. Existing vehicle operating systems are ill-equipped to provide all the required service guarantees on a single machine. A centralized automotive system aims to tackle this by assigning software tasks to multiple criticality domains or levels according to their consequences of failures, or international safety standards like ISO 26262. This research investigates several emerging challenges in time-critical systems for a centralized multicore automotive platform and proposes a novel vehicle operating system framework to address them.
This thesis first introduces an integrated vehicle management system (VMS), called DriveOSâ„¢, for a PC-class multicore hardware platform. Its separation kernel design enables temporal and spatial isolation among critical and non-critical vehicle services in different domains on the same machine. Time- and safety-critical vehicle functions are implemented in a sandboxed Real-time Operating System (OS) domain, and non-critical software is developed in a sandboxed general-purpose OS (e.g., Linux, Android) domain. To leverage the advantages of model-driven vehicle function development, DriveOS provides a multi-domain application framework in Simulink. This thesis also presents a real-time task pipeline scheduling algorithm in multiprocessors for communication between connected vehicle services with end-to-end guarantees. The benefits and performance of the overall automotive system framework are demonstrated with hardware-in-the-loop testing using real-world applications, car datasets and simulated benchmarks, and with an early-stage deployment in a production-grade luxury electric vehicle
Secure storage systems for untrusted cloud environments
The cloud has become established for applications that need to be scalable and highly
available. However, moving data to data centers owned and operated by a third party,
i.e., the cloud provider, raises security concerns because a cloud provider could easily
access and manipulate the data or program flow, preventing the cloud from being
used for certain applications, like medical or financial.
Hardware vendors are addressing these concerns by developing Trusted Execution
Environments (TEEs) that make the CPU state and parts of memory inaccessible from
the host software. While TEEs protect the current execution state, they do not provide
security guarantees for data which does not fit nor reside in the protected memory
area, like network and persistent storage.
In this work, we aim to address TEEs’ limitations in three different ways, first we
provide the trust of TEEs to persistent storage, second we extend the trust to multiple
nodes in a network, and third we propose a compiler-based solution for accessing
heterogeneous memory regions. More specifically,
• SPEICHER extends the trust provided by TEEs to persistent storage. SPEICHER
implements a key-value interface. Its design is based on LSM data structures, but
extends them to provide confidentiality, integrity, and freshness for the stored
data. Thus, SPEICHER can prove to the client that the data has not been tampered
with by an attacker.
• AVOCADO is a distributed in-memory key-value store (KVS) that extends the
trust that TEEs provide across the network to multiple nodes, allowing KVSs to
scale beyond the boundaries of a single node. On each node, AVOCADO carefully
divides data between trusted memory and untrusted host memory, to maximize
the amount of data that can be stored on each node. AVOCADO leverages the
fact that we can model network attacks as crash-faults to trust other nodes with
a hardened ABD replication protocol.
• TOAST is based on the observation that modern high-performance systems
often use several different heterogeneous memory regions that are not easily
distinguishable by the programmer. The number of regions is increased by the
fact that TEEs divide memory into trusted and untrusted regions. TOAST is a
compiler-based approach to unify access to different heterogeneous memory
regions and provides programmability and portability. TOAST uses a
load/store interface to abstract most library interfaces for different memory
regions
BASALISC: Programmable Hardware Accelerator for BGV Fully Homomorphic Encryption
Fully Homomorphic Encryption (FHE) allows for secure computation on encrypted data. Unfortunately, huge memory size, computational cost and bandwidth requirements limit its practicality. We present BASALISC, an architecture family of hardware accelerators that aims to substantially accelerate FHE computations in the cloud. BASALISC is the first to implement the BGV scheme with fully-packed bootstrapping – the noise removal capability necessary for arbitrary-depth computation. It supports a customized version of bootstrapping that can be instantiated with hardware multipliers optimized for area and power.
BASALISC is a three-abstraction-layer RISC architecture, designed for a 1 GHz ASIC implementation and underway toward 150mm2 die tape-out in a 12nm GF process. BASALISC\u27s four-layer memory hierarchy includes a two-dimensional conflict-free inner memory layer that enables 32 Tb/s radix-256 NTT computations without pipeline stalls. Its conflict-resolution permutation hardware is generalized and re-used to compute BGV automorphisms without throughput penalty. BASALISC also has a custom multiply-accumulate unit to accelerate BGV key switching.
The BASALISC toolchain comprises a custom compiler and a joint performance and correctness simulator. To evaluate BASALISC, we study its physical realizability, emulate and formally verify its core functional units, and we study its performance on a set of benchmarks. Simulation results show a speedup of more than 5,000× over HElib – a popular software FHE library
QoS-aware architectures, technologies, and middleware for the cloud continuum
The recent trend of moving Cloud Computing capabilities to the Edge of the network is reshaping how applications and their middleware supports are designed, deployed, and operated. This new model envisions a continuum of virtual resources between the traditional cloud and the network edge, which is potentially more suitable to meet the heterogeneous Quality of Service (QoS) requirements of diverse application domains and next-generation applications. Several classes of advanced Internet of Things (IoT) applications, e.g., in the industrial manufacturing domain, are expected to serve a wide range of applications with heterogeneous QoS requirements and call for QoS management systems to guarantee/control performance indicators, even in the presence of real-world factors such as limited bandwidth and concurrent virtual resource utilization. The present dissertation proposes a comprehensive QoS-aware architecture that addresses the challenges of integrating cloud infrastructure with edge nodes in IoT applications. The architecture provides end-to-end QoS support by incorporating several components for managing physical and virtual resources. The proposed architecture features: i) a multilevel middleware for resolving the convergence between Operational Technology (OT) and Information Technology (IT), ii) an end-to-end QoS management approach compliant with the Time-Sensitive Networking (TSN) standard, iii) new approaches for virtualized network environments, such as running TSN-based applications under Ultra-low Latency (ULL) constraints in virtual and 5G environments, and iv) an accelerated and deterministic container overlay network architecture. Additionally, the QoS-aware architecture includes two novel middlewares: i) a middleware that transparently integrates multiple acceleration technologies in heterogeneous Edge contexts and ii) a QoS-aware middleware for Serverless platforms that leverages coordination of various QoS mechanisms and virtualized Function-as-a-Service (FaaS) invocation stack to manage end-to-end QoS metrics. Finally, all architecture components were tested and evaluated by leveraging realistic testbeds, demonstrating the efficacy of the proposed solutions
A High-performance, Energy-efficient Modular DMA Engine Architecture
Data transfers are essential in today's computing systems as latency and
complex memory access patterns are increasingly challenging to manage. Direct
memory access engines (DMAEs) are critically needed to transfer data
independently of the processing elements, hiding latency and achieving high
throughput even for complex access patterns to high-latency memory. With the
prevalence of heterogeneous systems, DMAEs must operate efficiently in
increasingly diverse environments. This work proposes a modular and highly
configurable open-source DMAE architecture called intelligent DMA (iDMA), split
into three parts that can be composed and customized independently. The
front-end implements the control plane binding to the surrounding system. The
mid-end accelerates complex data transfer patterns such as multi-dimensional
transfers, scattering, or gathering. The back-end interfaces with the on-chip
communication fabric (data plane). We assess the efficiency of iDMA in various
instantiations: In high-performance systems, we achieve speedups of up to 15.8x
with only 1 % additional area compared to a base system without a DMAE. We
achieve an area reduction of 10 % while improving ML inference performance by
23 % in ultra-low-energy edge AI systems over an existing DMAE solution. We
provide area, timing, latency, and performance characterization to guide its
instantiation in various systems.Comment: 14 pages, 14 figures, accepted by an IEEE journal for publicatio
Towards trustworthy computing on untrustworthy hardware
Historically, hardware was thought to be inherently secure and trusted due to its
obscurity and the isolated nature of its design and manufacturing. In the last two
decades, however, hardware trust and security have emerged as pressing issues.
Modern day hardware is surrounded by threats manifested mainly in undesired
modifications by untrusted parties in its supply chain, unauthorized and pirated
selling, injected faults, and system and microarchitectural level attacks. These threats,
if realized, are expected to push hardware to abnormal and unexpected behaviour
causing real-life damage and significantly undermining our trust in the electronic and
computing systems we use in our daily lives and in safety critical applications. A
large number of detective and preventive countermeasures have been proposed in
literature. It is a fact, however, that our knowledge of potential consequences to
real-life threats to hardware trust is lacking given the limited number of real-life
reports and the plethora of ways in which hardware trust could be undermined. With
this in mind, run-time monitoring of hardware combined with active mitigation of
attacks, referred to as trustworthy computing on untrustworthy hardware, is proposed
as the last line of defence. This last line of defence allows us to face the issue of live
hardware mistrust rather than turning a blind eye to it or being helpless once it occurs.
This thesis proposes three different frameworks towards trustworthy computing
on untrustworthy hardware. The presented frameworks are adaptable to different
applications, independent of the design of the monitored elements, based on
autonomous security elements, and are computationally lightweight. The first
framework is concerned with explicit violations and breaches of trust at run-time,
with an untrustworthy on-chip communication interconnect presented as a potential
offender. The framework is based on the guiding principles of component guarding,
data tagging, and event verification. The second framework targets hardware elements
with inherently variable and unpredictable operational latency and proposes a
machine-learning based characterization of these latencies to infer undesired latency
extensions or denial of service attacks. The framework is implemented on a DDR3
DRAM after showing its vulnerability to obscured latency extension attacks. The
third framework studies the possibility of the deployment of untrustworthy hardware
elements in the analog front end, and the consequent integrity issues that might arise
at the analog-digital boundary of system on chips. The framework uses machine
learning methods and the unique temporal and arithmetic features of signals at this
boundary to monitor their integrity and assess their trust level
- …