1,984 research outputs found

    TANDEM: taming failures in next-generation datacenters with emerging memory

    Get PDF
    The explosive growth of online services, leading to unforeseen scales, has made modern datacenters highly prone to failures. Taming these failures hinges on fast and correct recovery, minimizing service interruptions. Applications, owing to recovery, entail additional measures to maintain a recoverable state of data and computation logic during their failure-free execution. However, these precautionary measures have severe implications on performance, correctness, and programmability, making recovery incredibly challenging to realize in practice. Emerging memory, particularly non-volatile memory (NVM) and disaggregated memory (DM), offers a promising opportunity to achieve fast recovery with maximum performance. However, incorporating these technologies into datacenter architecture presents significant challenges; Their distinct architectural attributes, differing significantly from traditional memory devices, introduce new semantic challenges for implementing recovery, complicating correctness and programmability. Can emerging memory enable fast, performant, and correct recovery in the datacenter? This thesis aims to answer this question while addressing the associated challenges. When architecting datacenters with emerging memory, system architects face four key challenges: (1) how to guarantee correct semantics; (2) how to efficiently enforce correctness with optimal performance; (3) how to validate end-to-end correctness including recovery; and (4) how to preserve programmer productivity (Programmability). This thesis aims to address these challenges through the following approaches: (a) defining precise consistency models that formally specify correct end-to-end semantics in the presence of failures (consistency models also play a crucial role in programmability); (b) developing new low-level mechanisms to efficiently enforce the prescribed models given the capabilities of emerging memory; and (c) creating robust testing frameworks to validate end-to-end correctness and recovery. We start our exploration with non-volatile memory (NVM), which offers fast persistence capabilities directly accessible through the processor’s load-store (memory) interface. Notably, these capabilities can be leveraged to enable fast recovery for Log-Free Data Structures (LFDs) while maximizing performance. However, due to the complexity of modern cache hierarchies, data hardly persist in any specific order, jeop- ardizing recovery and correctness. Therefore, recovery needs primitives that explicitly control the order of updates to NVM (known as persistency models). We outline the precise specification of a novel persistency model – Release Persistency (RP) – that provides a consistency guarantee for LFDs on what remains in non-volatile memory upon failure. To efficiently enforce RP, we propose a novel microarchitecture mechanism, lazy release persistence (LRP). Using standard LFDs benchmarks, we show that LRP achieves fast recovery while incurring minimal overhead on performance. We continue our discussion with memory disaggregation which decouples memory from traditional monolithic servers, offering a promising pathway for achieving very high availability in replicated in-memory data stores. Achieving such availability hinges on transaction protocols that can efficiently handle recovery in this setting, where compute and memory are independent. However, there is a challenge: disaggregated memory (DM) fails to work with RPC-style protocols, mandating one-sided transaction protocols. Exacerbating the problem, one-sided transactions expose critical low-level ordering to architects, posing a threat to correctness. We present a highly available transaction protocol, Pandora, that is specifically designed to achieve fast recovery in disaggregated key-value stores (DKVSes). Pandora is the first one-sided transactional protocol that ensures correct, non-blocking, and fast recovery in DKVS. Our experimental implementation artifacts demonstrate that Pandora achieves fast recovery and high availability while causing minimal disruption to services. Finally, we introduce a novel target litmus-testing framework – DART – to validate the end-to-end correctness of transactional protocols with recovery. Using DART’s target testing capabilities, we have found several critical bugs in Pandora, highlighting the need for robust end-to-end testing methods in the design loop to iteratively fix correctness bugs. Crucially, DART is lightweight and black-box, thereby eliminating any intervention from the programmers

    Guided rewriting and constraint satisfaction for parallel GPU code generation

    Get PDF
    Graphics Processing Units (GPUs) are notoriously hard to optimise for manually due to their scheduling and memory hierarchies. What is needed are good automatic code generators and optimisers for such parallel hardware. Functional approaches such as Accelerate, Futhark and LIFT leverage a high-level algorithmic Intermediate Representation (IR) to expose parallelism and abstract the implementation details away from the user. However, producing efficient code for a given accelerator remains challenging. Existing code generators depend on the user input to choose a subset of hard-coded optimizations or automated exploration of implementation search space. The former suffers from the lack of extensibility, while the latter is too costly due to the size of the search space. A hybrid approach is needed, where a space of valid implementations is built automatically and explored with the aid of human expertise. This thesis presents a solution combining user-guided rewriting and automatically generated constraints to produce high-performance code. The first contribution is an automatic tuning technique to find a balance between performance and memory consumption. Leveraging its functional patterns, the LIFT compiler is empowered to infer tuning constraints and limit the search to valid tuning combinations only. Next, the thesis reframes parallelisation as a constraint satisfaction problem. Parallelisation constraints are extracted automatically from the input expression, and a solver is used to identify valid rewriting. The constraints truncate the search space to valid parallel mappings only by capturing the scheduling restrictions of the GPU in the context of a given program. A synchronisation barrier insertion technique is proposed to prevent data races and improve the efficiency of the generated parallel mappings. The final contribution of this thesis is the guided rewriting method, where the user encodes a design space of structural transformations using high-level IR nodes called rewrite points. These strongly typed pragmas express macro rewrites and expose design choices as explorable parameters. The thesis proposes a small set of reusable rewrite points to achieve tiling, cache locality, data reuse and memory optimisation. A comparison with the vendor-provided handwritten kernel ARM Compute Library and the TVM code generator demonstrates the effectiveness of this thesis' contributions. With convolution as a use case, LIFT-generated direct and GEMM-based convolution implementations are shown to perform on par with the state-of-the-art solutions on a mobile GPU. Overall, this thesis demonstrates that a functional IR yields well to user-guided and automatic rewriting for high-performance code generation

    Secure storage systems for untrusted cloud environments

    Get PDF
    The cloud has become established for applications that need to be scalable and highly available. However, moving data to data centers owned and operated by a third party, i.e., the cloud provider, raises security concerns because a cloud provider could easily access and manipulate the data or program flow, preventing the cloud from being used for certain applications, like medical or financial. Hardware vendors are addressing these concerns by developing Trusted Execution Environments (TEEs) that make the CPU state and parts of memory inaccessible from the host software. While TEEs protect the current execution state, they do not provide security guarantees for data which does not fit nor reside in the protected memory area, like network and persistent storage. In this work, we aim to address TEEs’ limitations in three different ways, first we provide the trust of TEEs to persistent storage, second we extend the trust to multiple nodes in a network, and third we propose a compiler-based solution for accessing heterogeneous memory regions. More specifically, ‱ SPEICHER extends the trust provided by TEEs to persistent storage. SPEICHER implements a key-value interface. Its design is based on LSM data structures, but extends them to provide confidentiality, integrity, and freshness for the stored data. Thus, SPEICHER can prove to the client that the data has not been tampered with by an attacker. ‱ AVOCADO is a distributed in-memory key-value store (KVS) that extends the trust that TEEs provide across the network to multiple nodes, allowing KVSs to scale beyond the boundaries of a single node. On each node, AVOCADO carefully divides data between trusted memory and untrusted host memory, to maximize the amount of data that can be stored on each node. AVOCADO leverages the fact that we can model network attacks as crash-faults to trust other nodes with a hardened ABD replication protocol. ‱ TOAST is based on the observation that modern high-performance systems often use several different heterogeneous memory regions that are not easily distinguishable by the programmer. The number of regions is increased by the fact that TEEs divide memory into trusted and untrusted regions. TOAST is a compiler-based approach to unify access to different heterogeneous memory regions and provides programmability and portability. TOAST uses a load/store interface to abstract most library interfaces for different memory regions

    Efficient concurrent data structure access parallelism techniques for increasing scalability

    Get PDF
    Multi-core processors have revolutionised the way data structures are designed by bringing parallelism to mainstream computing. Key to exploiting hardware parallelism available in multi-core processors are concurrent data structures. However, some concurrent data structure abstractions are inherently sequential and incapable of harnessing the parallelism performance of multi-core processors. Designing and implementing concurrent data structures to harness hardware parallelism is challenging due to the requirement of correctness, efficiency and practicability under various application constraints. In this thesis, our research contribution is towards improving concurrent data structure access parallelism to increase data structure performance. We propose new design frameworks that improve access parallelism of already existing concurrent data structure designs. Also, we propose new concurrent data structure designs with significant performance improvements. To give an insight into the interplay between hardware and concurrent data structure access parallelism, we give a detailed analysis and model the performance scalability with varying parallelism.In the first part of the thesis, we focus on data structure semantic relaxation. By relaxing the semantics of a data structure, a bigger design space, that allows weaker synchronization and more useful parallelism, is unveiled. Investigating new data structure designs, capable of trading semantics for achieving better performance in a monotonic way, is a major challenge in the area. We algorithmically address this challenge in this part of the thesis. We present an efficient, lock-free, concurrent data structure design framework for out-of-order semantic relaxation. We introduce a new two-dimensional algorithmic design, that uses multiple instances of a given data structure to improve access parallelism. In the second part of the thesis, we propose an efficient priority queue that improves access parallelism by reducing the number of synchronization points for each operation. Priority queues are fundamental abstract data types, often used to manage limited resources in parallel systems. Typical proposed parallel priority queue implementations are based on heaps or skip lists. In recent literature, skip lists have been shown to be the most efficient design choice for implementing priority queues. Though numerous intricate implementations of skip list based queues have been proposed in the literature, their performance is constrained by the high number of global atomic updates per operation and the high memory consumption, which are proportional to the number of sub-lists in the queue. In this part of the thesis, we propose an alternative approach for designing lock-free linearizable priority queues, that significantly improve memory efficiency and throughput performance, by reducing the number of global atomic updates and memory consumption as compared to skip-list based queues. To achieve this, our new design combines two structures; a search tree and a linked list, forming what we call a Tree Search List Queue (TSLQueue). Subsequently, we analyse and introduce a model for lock-free concurrent data structure access parallelism. The major impediment to scaling concurrent data structures is memory contention when accessing shared data structure access points, leading to thread serialisation, and hindering parallelism. Aiming to address this challenge, a significant amount of work in the literature has proposed multi-access techniques that improve concurrent data structure parallelism. However, there is little work on analysing and modelling the execution behaviour of concurrent multi-access data structures especially in a shared memory setting. In this part of the thesis, we analyse and model the general execution behaviour of concurrent multi-access data structures in the shared memory setting. We study and analyse the behaviour of the two popular random access patterns: shared (Remote) and exclusive (Local) access, and the behaviour of the two most commonly used atomic primitives for designing lock-free data structures: Compare and Swap, and, Fetch and Add

    Efficient Model Checking: The Power of Randomness

    Get PDF

    Approximate Computing Survey, Part I: Terminology and Software & Hardware Approximation Techniques

    Full text link
    The rapid growth of demanding applications in domains applying multimedia processing and machine learning has marked a new era for edge and cloud computing. These applications involve massive data and compute-intensive tasks, and thus, typical computing paradigms in embedded systems and data centers are stressed to meet the worldwide demand for high performance. Concurrently, the landscape of the semiconductor field in the last 15 years has constituted power as a first-class design concern. As a result, the community of computing systems is forced to find alternative design approaches to facilitate high-performance and/or power-efficient computing. Among the examined solutions, Approximate Computing has attracted an ever-increasing interest, with research works applying approximations across the entire traditional computing stack, i.e., at software, hardware, and architectural levels. Over the last decade, there is a plethora of approximation techniques in software (programs, frameworks, compilers, runtimes, languages), hardware (circuits, accelerators), and architectures (processors, memories). The current article is Part I of our comprehensive survey on Approximate Computing, and it reviews its motivation, terminology and principles, as well it classifies and presents the technical details of the state-of-the-art software and hardware approximation techniques.Comment: Under Review at ACM Computing Survey

    Adaptive Microarchitectural Optimizations to Improve Performance and Security of Multi-Core Architectures

    Get PDF
    With the current technological barriers, microarchitectural optimizations are increasingly important to ensure performance scalability of computing systems. The shift to multi-core architectures increases the demands on the memory system, and amplifies the role of microarchitectural optimizations in performance improvement. In a multi-core system, microarchitectural resources are usually shared, such as the cache, to maximize utilization but sharing can also lead to contention and lower performance. This can be mitigated through partitioning of shared caches.However, microarchitectural optimizations which were assumed to be fundamentally secure for a long time, can be used in side-channel attacks to exploit secrets, as cryptographic keys. Timing-based side-channels exploit predictable timing variations due to the interaction with microarchitectural optimizations during program execution. Going forward, there is a strong need to be able to leverage microarchitectural optimizations for performance without compromising security. This thesis contributes with three adaptive microarchitectural resource management optimizations to improve security and/or\ua0performance\ua0of multi-core architectures\ua0and a systematization-of-knowledge of timing-based side-channel attacks.\ua0We observe that to achieve high-performance cache partitioning in a multi-core system\ua0three requirements need to be met: i) fine-granularity of partitions, ii) locality-aware placement and iii) frequent changes. These requirements lead to\ua0high overheads for current centralized partitioning solutions, especially as the number of cores in the\ua0system increases. To address this problem, we present an adaptive and scalable cache partitioning solution (DELTA) using a distributed and asynchronous allocation algorithm. The\ua0allocations occur through core-to-core challenges, where applications with larger performance benefit will gain cache capacity. The\ua0solution is implementable in hardware, due to low computational complexity, and can scale to large core counts.According to our analysis, better performance can be achieved by coordination of multiple optimizations for different resources, e.g., off-chip bandwidth and cache, but is challenging due to the increased number of possible allocations which need to be evaluated.\ua0Based on these observations, we present a solution (CBP) for coordinated management of the optimizations: cache partitioning, bandwidth partitioning and prefetching.\ua0Efficient allocations, considering the inter-resource interactions and trade-offs, are achieved using local resource managers to limit the solution space.The continuously growing number of\ua0side-channel attacks leveraging\ua0microarchitectural optimizations prompts us to review attacks and defenses to understand the vulnerabilities of different microarchitectural optimizations. We identify the four root causes of timing-based side-channel attacks: determinism, sharing, access violation\ua0and information flow.\ua0Our key insight is that eliminating any of the exploited root causes, in any of the attack steps, is enough to provide protection.\ua0Based on our framework, we present a systematization of the attacks and defenses on a wide range of microarchitectural optimizations, which highlights their key similarities.\ua0Shared caches are an attractive attack surface for side-channel attacks, while defenses need to be efficient since the cache is crucial for performance.\ua0To address this issue, we present an adaptive and scalable cache partitioning solution (SCALE) for protection against cache side-channel attacks. The solution leverages randomness,\ua0and provides quantifiable and information theoretic security guarantees using differential privacy. The solution closes the performance gap to a state-of-the-art non-secure allocation policy for a mix of secure and non-secure applications

    QoS-aware architectures, technologies, and middleware for the cloud continuum

    Get PDF
    The recent trend of moving Cloud Computing capabilities to the Edge of the network is reshaping how applications and their middleware supports are designed, deployed, and operated. This new model envisions a continuum of virtual resources between the traditional cloud and the network edge, which is potentially more suitable to meet the heterogeneous Quality of Service (QoS) requirements of diverse application domains and next-generation applications. Several classes of advanced Internet of Things (IoT) applications, e.g., in the industrial manufacturing domain, are expected to serve a wide range of applications with heterogeneous QoS requirements and call for QoS management systems to guarantee/control performance indicators, even in the presence of real-world factors such as limited bandwidth and concurrent virtual resource utilization. The present dissertation proposes a comprehensive QoS-aware architecture that addresses the challenges of integrating cloud infrastructure with edge nodes in IoT applications. The architecture provides end-to-end QoS support by incorporating several components for managing physical and virtual resources. The proposed architecture features: i) a multilevel middleware for resolving the convergence between Operational Technology (OT) and Information Technology (IT), ii) an end-to-end QoS management approach compliant with the Time-Sensitive Networking (TSN) standard, iii) new approaches for virtualized network environments, such as running TSN-based applications under Ultra-low Latency (ULL) constraints in virtual and 5G environments, and iv) an accelerated and deterministic container overlay network architecture. Additionally, the QoS-aware architecture includes two novel middlewares: i) a middleware that transparently integrates multiple acceleration technologies in heterogeneous Edge contexts and ii) a QoS-aware middleware for Serverless platforms that leverages coordination of various QoS mechanisms and virtualized Function-as-a-Service (FaaS) invocation stack to manage end-to-end QoS metrics. Finally, all architecture components were tested and evaluated by leveraging realistic testbeds, demonstrating the efficacy of the proposed solutions

    Taylor University Catalog 2023-2024

    Get PDF
    The 2023-2024 academic catalog of Taylor University in Upland, Indiana.https://pillars.taylor.edu/catalogs/1128/thumbnail.jp
    • 

    corecore