37 research outputs found

    A Tight Holistic Memory Latency Bound Through Coordinated Management of Memory Resources

    Get PDF

    A Quantitative Analysis and Guideline of Data Streaming Accelerator in Intel 4th Gen Xeon Scalable Processors

    Full text link
    As semiconductor power density is no longer constant with the technology process scaling down, modern CPUs are integrating capable data accelerators on chip, aiming to improve performance and efficiency for a wide range of applications and usages. One such accelerator is the Intel Data Streaming Accelerator (DSA) introduced in Intel 4th Generation Xeon Scalable CPUs (Sapphire Rapids). DSA targets data movement operations in memory that are common sources of overhead in datacenter workloads and infrastructure. In addition, it becomes much more versatile by supporting a wider range of operations on streaming data, such as CRC32 calculations, delta record creation/merging, and data integrity field (DIF) operations. This paper sets out to introduce the latest features supported by DSA, deep-dive into its versatility, and analyze its throughput benefits through a comprehensive evaluation. Along with the analysis of its characteristics, and the rich software ecosystem of DSA, we summarize several insights and guidelines for the programmer to make the most out of DSA, and use an in-depth case study of DPDK Vhost to demonstrate how these guidelines benefit a real application

    A holistic scalability strategy for time series databases following cascading polyglot persistence

    Get PDF
    Time series databases aim to handle big amounts of data in a fast way, both when introducing new data to the system, and when retrieving it later on. However, depending on the scenario in which these databases participate, reducing the number of requested resources becomes a further requirement. Following this goal, NagareDB and its Cascading Polyglot Persistence approach were born. They were not just intended to provide a fast time series solution, but also to find a great cost-efficiency balance. However, although they provided outstanding results, they lacked a natural way of scaling out in a cluster fashion. Consequently, monolithic approaches could extract the maximum value from the solution but distributed ones had to rely on general scalability approaches. In this research, we proposed a holistic approach specially tailored for databases following Cascading Polyglot Persistence to further maximize its inherent resource-saving goals. The proposed approach reduced the cluster size by 33%, in a setup with just three ingestion nodes and up to 50% in a setup with 10 ingestion nodes. Moreover, the evaluation shows that our scaling method is able to provide efficient cluster growth, offering scalability speedups greater than 85% in comparison to a theoretically 100% perfect scaling, while also ensuring data safety via data replication.This research was partly supported by the Grant Agreement No. 857191, by the Spanish Ministry of Science and Innovation (contract PID2019-107255GB) and by the Generalitat de Catalunya (contract 2017-SGR-1414).Peer ReviewedPostprint (published version

    Complexity effective memory access scheduling for many-core accelerator architectures

    Full text link
    Modern DRAM systems rely on memory controllers that employ out-of-order scheduling to maximize row access lo-cality and bank-level parallelism, which in turn maximizes DRAM bandwidth. This is especially important in graphics processing unit (GPU) architectures, where the large quan-tity of parallelism places a heavy demand on the memory system. The logic needed for out-of-order scheduling can be expensive in terms of area, especially when compared to an in-order scheduling approach. In this paper, we propose a complexity-effective solution to DRAM request schedul-ing which recovers most of the performance loss incurred by a naive in-order first-in first-out (FIFO) DRAM scheduler compared to an aggressive out-of-order DRAM scheduler. We observe that the memory request stream from individual GPU“shader cores ” tends to have sufficient row access local-ity to maximize DRAM efficiency in most applications with-out significant reordering. However, the interconnection net-work across which memory requests are sent from the shader cores to the DRAM controller tends to finely interleave the numerous memory request streams in a way that destroys the row access locality of the resultant stream seen at the DRAM controller. To address this, we employ an intercon-nection network arbitration scheme that preserves the row access locality of individual memory request streams and, in doing so, achieves DRAM efficiency and system perfor-mance close to that achievable by using out-of-order mem-ory request scheduling while doing so with a simpler de-sign. We evaluate our interconnection network arbitration scheme using crossbar, mesh, and ring networks for a base-line architecture of 8 memory channels, each controlled by its own DRAM controller and 28 shader cores (224 ALUs), supporting up to 1,792 in-flight memory requests. Our re-sults show that our interconnect arbitration scheme coupled with a banked FIFO in-order scheduler obtains up to 91% of the performance obtainable with an out-of-order memory scheduler for a crossbar network with eight-entry DRAM controller queues

    Adaptive Dual-Mode Arbitration for High-Performance Real-Time Embedded Systems

    Get PDF
    Multi-core platforms can deliver substantial computational power together with minimum costs, compact size, weight, and power usage. However, multi-core architectures are shaking the very foundation of modern real-time systems, i.e. deriving the Worst-Case Execution Time (WCET) of the tasks. Modern embedded systems such as those deployed in the automotive and avionic fields face two difficult-to-resolve conflicting requirements due to the interference problem on the shared hardware components amongst cores: delivering high average-case performance and providing tight WCET. This challenge exists in different shared hardware resources including on-chip shared cache, hardware prefetchers, buses, and memory controller. The problem is mainly because various cores in the system interfere with each other while competing to access the aforementioned hardware components. While dedicated real-time controllers provide timing guarantees, they do so at the cost of significantly degrading system performance. This dissertation overcomes this trade-off by introducing Duetto, a general hardware resource management paradigm that pairs a real-time arbiter with a high-performance arbiter and a latency estimator module. Based on the observation that the resource is rarely overloaded, Duetto executes the high-performance arbiter most of the time, switching to the real-time arbiter only in the rare cases when the latency estimator deems that timing guarantees risk being violated. In this thesis, the Duetto paradigm is realized for different shared hardware resources. In the first part, I demonstrate Duetto on the case study of a multi-bank on-chip memory and discuss the foundation of the methodology. The methodology is concerned about designing the real-time arbiter in such a way that it is compatible with Duetto, deriving latency analysis, and designing the latency estimator module. In the second part, this thesis addresses the trade-off between maintaining cache coherence in multi-core real-time systems and improving average-case performance by proposing a novel coherency arbiter infrastructure and employing it in the context of Duetto. This is achieved by precisely engineering the multi-core hardware architecture and its underlying interconnect infrastructure such that data sharing is feasible for real-time systems in a manner amenable for timing analysis. The proposed solution provides near-to Commercial-Off-The-Shelf (COTS) performance and does not impose any coherency protocol modifications. The third part of this dissertation proposes DuoMC by applying Duetto to off-chip Memory Controller (MC) which is crucial since Dynamic Random-Access Memory (DRAM) main memory is one of the most complex shared resources in multi-core architectures and it is one of the critical bottlenecks both from latency as well as performance perspectives. As part of the MC evaluation, we release MCsim, an open-source, cycle-accurate simulator for memory controllers

    Invalidation-based protocols for replicated datastores

    Get PDF
    Distributed in-memory datastores underpin cloud applications that run within a datacenter and demand high performance, strong consistency, and availability. A key feature of datastores is data replication. The data are replicated across servers because a single server often cannot handle the request load. Replication is also necessary to guarantee that a server or link failure does not render a portion of the dataset inaccessible. A replication protocol is responsible for ensuring strong consistency between the replicas of a datastore, even when faults occur, by determining the actions necessary to access and manipulate the data. Consequently, a replication protocol also drives the datastore's performance. Existing strongly consistent replication protocols deliver fault tolerance but fall short in terms of performance. Meanwhile, the opposite occurs in the world of multiprocessors, where data are replicated across the private caches of different cores. The multiprocessor regime uses invalidations to afford strongly consistent replication with high performance but neglects fault tolerance. Although handling failures in the datacenter is critical for data availability, we observe that the common operation is fault-free and far exceeds the operation during faults. In other words, the common operating environment inside a datacenter closely resembles that of a multiprocessor. Based on this insight, we draw inspiration from the multiprocessor for high-performance, strongly consistent replication in the datacenter. The primary contribution of this thesis is in adapting invalidating protocols to the nuances of replicated datastores, which include skewed data accesses, fault tolerance, and distributed transactions

    Adaptive memory hierarchies for next generation tiled microarchitectures

    Get PDF
    Les últimes dècades el rendiment dels processadors i de les memòries ha millorat a diferent ritme, limitant el rendiment dels processadors i creant el conegut memory gap. Sol·lucionar aquesta diferència de rendiment és un camp d'investigació d'actualitat i que requereix de noves sol·lucions. Una sol·lució a aquest problema són les memòries “cache”, que permeten reduïr l'impacte d'unes latències de memòria creixents i que conformen la jerarquia de memòria. La majoria de d'organitzacions de les “caches” estan dissenyades per a uniprocessadors o multiprcessadors tradicionals. Avui en dia, però, el creixent nombre de transistors disponible per xip ha permès l'aparició de xips multiprocessador (CMPs). Aquests xips tenen diferents propietats i limitacions i per tant requereixen de jerarquies de memòria específiques per tal de gestionar eficientment els recursos disponibles. En aquesta tesi ens hem centrat en millorar el rendiment i la eficiència energètica de la jerarquia de memòria per CMPs, des de les “caches” fins als controladors de memòria. A la primera part d'aquesta tesi, s'han estudiat organitzacions tradicionals per les “caches” com les privades o compartides i s'ha pogut constatar que, tot i que funcionen bé per a algunes aplicacions, un sistema que s'ajustés dinàmicament seria més eficient. Tècniques com el Cooperative Caching (CC) combinen els avantatges de les dues tècniques però requereixen un mecanisme centralitzat de coherència que té un consum energètic molt elevat. És per això que en aquesta tesi es proposa el Distributed Cooperative Caching (DCC), un mecanisme que proporciona coherència en CMPs i aplica el concepte del cooperative caching de forma distribuïda. Mitjançant l'ús de directoris distribuïts s'obté una sol·lució més escalable i que, a més, disposa d'un mecanisme de marcatge més flexible i eficient energèticament. A la segona part, es demostra que les aplicacions fan diferents usos de la “cache” i que si es realitza una distribució de recursos eficient es poden aprofitar els que estan infrautilitzats. Es proposa l'Elastic Cooperative Caching (ElasticCC), una organització capaç de redistribuïr la memòria “cache” dinàmicament segons els requeriments de cada aplicació. Una de les contribucions més importants d'aquesta tècnica és que la reconfiguració es decideix completament a través del maquinari i que tots els mecanismes utilitzats es basen en estructures distribuïdes, permetent una millor escalabilitat. ElasticCC no només és capaç de reparticionar les “caches” segons els requeriments de cada aplicació, sinó que, a més a més, és capaç d'adaptar-se a les diferents fases d'execució de cada una d'elles. La nostra avaluació també demostra que la reconfiguració dinàmica de l'ElasticCC és tant eficient que gairebé proporciona la mateixa taxa de fallades que una configuració amb el doble de memòria.Finalment, la tesi es centra en l'estudi del comportament de les memòries DRAM i els seus controladors en els CMPs. Es demostra que, tot i que els controladors tradicionals funcionen eficientment per uniprocessadors, en CMPs els diferents patrons d'accés obliguen a repensar com estan dissenyats aquests sistemes. S'han presentat múltiples sol·lucions per CMPs però totes elles es veuen limitades per un compromís entre el rendiment global i l'equitat en l'assignació de recursos. En aquesta tesi es proposen els Thread Row Buffers (TRBs), una zona d'emmagatenament extra a les memòries DRAM que permetria guardar files de dades específiques per a cada aplicació. Aquest mecanisme permet proporcionar un accés equitatiu a la memòria sense perjudicar el seu rendiment global. En resum, en aquesta tesi es presenten noves organitzacions per la jerarquia de memòria dels CMPs centrades en la escalabilitat i adaptativitat als requeriments de les aplicacions. Els resultats presentats demostren que les tècniques proposades proporcionen un millor rendiment i eficiència energètica que les millors tècniques existents fins a l'actualitat.Processor performance and memory performance have improved at different rates during the last decades, limiting processor performance and creating the well known "memory gap". Solving this performance difference is an important research field and new solutions must be proposed in order to have better processors in the future. Several solutions exist, such as caches, that reduce the impact of longer memory accesses and conform the system memory hierarchy. However, most of the existing memory hierarchy organizations were designed for single processors or traditional multiprocessors. Nowadays, the increasing number of available transistors has allowed the apparition of chip multiprocessors, which have different constraints and require new ad-hoc memory systems able to efficiently manage memory resources. Therefore, in this thesis we have focused on improving the performance and energy efficiency of the memory hierarchy of chip multiprocessors, ranging from caches to DRAM memories. In the first part of this thesis we have studied traditional cache organizations such as shared or private caches and we have seen that they behave well only for some applications and that an adaptive system would be desirable. State-of-the-art techniques such as Cooperative Caching (CC) take advantage of the benefits of both worlds. This technique, however, requires the usage of a centralized coherence structure and has a high energy consumption. Therefore we propose the Distributed Cooperative Caching (DCC), a mechanism to provide coherence to chip multiprocessors and apply the concept of cooperative caching in a distributed way. Through the usage of distributed directories we obtain a more scalable solution and, in addition, has a more flexible and energy-efficient tag allocation method. We also show that applications make different uses of cache and that an efficient allocation can take advantage of unused resources. We propose Elastic Cooperative Caching (ElasticCC), an adaptive cache organization able to redistribute cache resources dynamically depending on application requirements. One of the most important contributions of this technique is that adaptivity is fully managed by hardware and that all repartitioning mechanisms are based on distributed structures, allowing a better scalability. ElasticCC not only is able to repartition cache sizes to application requirements, but also is able to dynamically adapt to the different execution phases of each thread. Our experimental evaluation also has shown that the cache partitioning provided by ElasticCC is efficient and is almost able to match the off-chip miss rate of a configuration that doubles the cache space. Finally, we focus in the behavior of DRAM memories and memory controllers in chip multiprocessors. Although traditional memory schedulers work well for uniprocessors, we show that new access patterns advocate for a redesign of some parts of DRAM memories. Several organizations exist for multiprocessor DRAM schedulers, however, all of them must trade-off between memory throughput and fairness. We propose Thread Row Buffers, an extended storage area in DRAM memories able to store a data row for each thread. This mechanism enables a fair memory access scheduling without hurting memory throughput. Overall, in this thesis we present new organizations for the memory hierarchy of chip multiprocessors which focus on the scalability and of the proposed structures and adaptivity to application behavior. Results show that the presented techniques provide a better performance and energy-efficiency than existing state-of-the-art solutions

    Citadel: Enclaves with Strong Microarchitectural Isolation and Secure Shared Memory on a Speculative Out-of-Order Processor

    Full text link
    We present Citadel, to our knowledge, the first enclave platform with strong microarchitectural isolation to run realistic secure programs on a speculative out-of-order multicore processor. First, we develop a new hardware mechanism to enable secure shared memory while defending against transient execution attacks by blocking speculative accesses to shared memory. Then, we develop an efficient dynamic cache partitioning scheme, improving both enclaves' and unprotected processes' performance. We conduct an in-depth security analysis and a performance evaluation of our new mechanisms. Finally, we build the hardware and software infrastructure required to run our secure enclaves. Our multicore processor runs on an FPGA and boots untrusted Linux from which users can securely launch and interact with enclaves. We open-source our end-to-end hardware and software infrastructure, hoping to spark more research and bridge the gap between conceptual proposals and FPGA prototypes

    Beyond Umpire and Arbiter: Courts as Facilitators of Intergovernmental Dialogue in Division of Powers Cases in Canada

    Get PDF
    The courts in Canada have often been cast, by both courts and legal scholars, as 'umpires' or 'arbiters' of the federal-provincial division of powers - umpires or arbiters that have the exclusive, or at least decisive, authority to clarify and enforce, and resolve disputes about, 'who does what' in the federal system. However, the image conveyed by these metaphors underestimates the role that the federal and provincial political branches play in the federal system, by working out their own solutions, in the intergovernmental arena, both directly and indirectly, where questions and disputes arise about how jurisdiction is and should be allocated. The image conveyed by the umpire or arbiter metaphors also sits uncomfortably with the facilitative role that the Supreme Court of Canada has carved out for itself in its recent division of powers decisions, a role that casts the courts as facilitators of these instances of intergovernmental dialogue. This doctoral dissertation challenges, and moves beyond, the umpire and arbiter metaphors. It examines the political safeguards available to the provinces in Canada to prevent, or limit, perceived federal encroachments on provincial jurisdiction, in the process highlighting the role that the political branches play in Canada in working out their own allocations of jurisdiction, outside of the courts. It describes, and critically evaluates, the facilitative role carved out by the Court in its recent division of powers decisions, identifying various reasons to be skeptical of a facilitative role that casts the courts as facilitators of intergovernmental dialogue. Finally, and with an eye to future research, it briefly outlines an alternative facilitative role that focuses on facilitating deliberation about the division of powers implications of particular initiatives, arguing that it would be premature to dismiss facilitative approaches to judicial review altogether
    corecore