17 research outputs found
Network-Compute Co-Design for Distributed In-Memory Computing
The booming popularity of online services is rapidly raising the demands for modern datacenters. In order to cope with data deluge, growing user bases, and tight quality of service constraints, service providers deploy massive datacenters with tens to hundreds of thousands of servers, keeping petabytes of latency-critical data memory resident. Such data distribution and the multi-tiered nature of the software used by feature-rich services results in frequent inter-server communication and remote memory access over the network. Hence, networking takes center stage in datacenters.
In response to growing internal datacenter network traffic, networking technology is rapidly evolving. Lean user-level protocols, like RDMA, and high-performance fabrics have started making their appearance, dramatically reducing datacenter-wide network latency and offering unprecedented per-server bandwidth. At the same time, the end of Dennard scaling is grinding processor performance improvements to a halt. The net result is a growing mismatch between the per-server network and compute capabilities: it will soon be difficult for a server processor to utilize all of its available network bandwidth.
Restoring balance between network and compute capabilities requires tighter co-design of the two. The network interface (NI) is of particular interest, as it lies on the boundary of network and compute. In this thesis, we focus on the design of an NI for a lightweight RDMA-like protocol and its full integration with modern manycore server processors. The NI capabilities scale with both the increasing network bandwidth and the growing number of cores on modern server processors.
Leveraging our architecture's integrated NI logic, we introduce new functionality at the network endpoints that yields performance improvements for distributed systems. Such additions include new network operations with stronger semantics tailored to common application requirements and integrated logic for balancing network load across a modern processor's multiple cores. We make the case that exposing richer, end-to-end semantics to the NI is a unique enabler for optimizations that can reduce software complexity and remove significant load from the processor, contributing towards maintaining balance between the two valuable resources of network and compute. Overall, network-compute co-design is an approach that addresses challenges associated with the emerging technological mismatch of compute and networking capabilities, yielding significant performance improvements for distributed memory systems
Design Guidelines for High-Performance SCM Hierarchies
With emerging storage-class memory (SCM) nearing commercialization, there is
evidence that it will deliver the much-anticipated high density and access
latencies within only a few factors of DRAM. Nevertheless, the
latency-sensitive nature of memory-resident services makes seamless integration
of SCM in servers questionable. In this paper, we ask the question of how best
to introduce SCM for such servers to improve overall performance/cost over
existing DRAM-only architectures. We first show that even with the most
optimistic latency projections for SCM, the higher memory access latency
results in prohibitive performance degradation. However, we find that
deployment of a modestly sized high-bandwidth 3D stacked DRAM cache makes the
performance of an SCM-mostly memory system competitive. The high degree of
spatial locality that memory-resident services exhibit not only simplifies the
DRAM cache's design as page-based, but also enables the amortization of
increased SCM access latencies and the mitigation of SCM's read/write latency
disparity.
We identify the set of memory hierarchy design parameters that plays a key
role in the performance and cost of a memory system combining an SCM technology
and a 3D stacked DRAM cache. We then introduce a methodology to drive
provisioning for each of these design parameters under a target
performance/cost goal. Finally, we use our methodology to derive concrete
results for specific SCM technologies. With PCM as a case study, we show that a
two bits/cell technology hits the performance/cost sweet spot, reducing the
memory subsystem cost by 40% while keeping performance within 3% of the best
performing DRAM-only system, whereas single-level and triple-level cell
organizations are impractical for use as memory replacements.Comment: Published at MEMSYS'1
Mitigating Load Imbalance in Distributed Data Serving with Rack-Scale Memory Pooling
To provide low-latency and high-throughput guarantees, most large key-value stores keep the data in the memory of many servers. Despite the natural parallelism across lookups, the load imbalance, introduced by heavy skew in the popularity distribution of keys, limits performance. To avoid violating tail latency service-level objectives, systems tend to keep server utilization low and organize the data in micro-shards, which provides units of migration and replication for the purpose of load balancing. These techniques reduce the skew but incur additional monitoring, data replication, and consistency maintenance overheads. In this work, we introduce RackOut, a memory pooling technique that leverages the one-sided remote read primitive of emerging rack-scale systems to mitigate load imbalance while respecting service-level objectives. In RackOut, the data are aggregated at rack-scale granularity, with all of the participating servers in the rack jointly servicing all of the rackâs micro-shards. We develop a queuing model to evaluate the impact of RackOut at the datacenter scale. In addition, we implement a RackOut proof-of-concept key value store, evaluate it on two experimental platforms based on RDMA and Scale-Out NUMA, and use these results to validate the model. We devise two distinct approaches to load balancing within a RackOut unit, one based on random selection of nodesâRackOut_staticâand another one based on an adaptive load balancing mechanismâ RackOut_adaptive. Our results show that RackOut_static increases throughput by up to 6Ă for RDMA and 8.6Ă for Scale-Out NUMA compared to a scale-out deployment, while respecting tight tail latency service-level objectives. RackOut_adaptive improves the throughput by 30% for workloads with 20% of writes over RackOut_static
An Analysis of Load Imbalance in Scale-out Data Serving
Despite the natural parallelism across lookups, performance of distributed key-value stores is often limited due to load imbalance induced by heavy skew in the popularity distribution of the dataset. To avoid violating service level objectives expressed in terms of tail latency, systems tend to keep server utilization low and organize the data in micro-shards, which in turn provides units of migration and replication for the purpose of load balancing. These techniques reduce the skew, but incur additional monitoring, data replication and consistency maintenance overheads. This work shows that the trend towards extreme scale-out will further exacerbate the skew-induced load imbalance, and hence the overhead of migration and replication
SABRes: Atomic Object Reads for In-Memory Rack-Scale Computing
Modern in-memory services rely on large distributed object stores to achieve the high scalability essential to service thousands of requests concurrently. The independent and unpredictable nature of incoming requests results in random accesses to the object store, triggering frequent remote memory accesses. State-of-the-art distributed memory frameworks leverage the one-sided operations offered by RDMA technology to mitigate the traditionally high cost of remote memory access. Unfortunately, the limited semantics of RDMA one-sided operations bound remote memory access atomicity to a single cache block; therefore, atomic remote object access relies on software mechanisms. Emerging highly integrated rack-scale systems that reduce the latency of one-sided operations to a small multiple of DRAM latency expose the overhead of these software mechanisms as a major latency contributor. This technology-triggered paradigm shift calls for new one-sided operations with stronger semantics. We take a step in that direction by proposing SABRes, a new one-sided operation that provides atomic remote object reads in hardware. We then present LightSABRes, a lightweight hardware accelerator for SABRes that removes all atomicity-associated software overheads. Compared to a state-of-the-art software atomicity mechanism, LightSABRes improve the throughput of a microbenchmark atomically accessing 128B-8KB objects from remote memory by 15-97%, and the throughput of a modern in-memory distributed object store by 30-60%
The Mondrian Data Engine
The increasing demand for extracting value out of ever-growing data poses an ongoing challenge to system designers, a task only made trickier by the end of Dennard scaling. As the performance density of traditional CPU-centric architectures stagnates, advancing compute capabilities necessitates novel architectural approaches. Near-memory processing (NMP) architectures are reemerging as promising candidates to improve computing efficiency through tight coupling of logic and memory. NMP architectures are especially fitting for data analytics, as they provide immense bandwidth to memory-resident data and dramatically reduce data movement, the main source of energy consumption. Modern data analytics operators are optimized for CPU execution and hence rely on large caches and employ random memory accesses. In the context of NMP, such random accesses result in wasteful DRAM row buffer activations that account for a significant fraction of the total memory access energy. In addition, utilizing NMPâs ample bandwidth with fine-grained random accesses requires complex hardware that cannot be accommodated under NMPâs tight area and power constraints. Our thesis is that efficient NMP calls for an algorithm-hardware co-design that favors algorithms with sequential accesses to enable simple hardware that accesses memory in streams. We introduce an instance of such a co-designed NMP architecture for data analytics, the Mondrian Data Engine. Compared to a CPU-centric and a baseline NMP system, the Mondrian Data Engine improves the performance of basic data analytics operators by up to 49Ă and 5Ă, and efficiency by up to 28Ă and 5Ă, respectively
RPCValet: NI-Driven Tail-Aware Balancing of ”s-Scale RPCs
Modern online services come with stringent quality requirements in terms of response time tail latency. Because of their decomposition into fine-grained communicating software layers, a single user request fans out into a plethora of short, ÎŒs-scale RPCs, aggravating the need for faster inter-server communication. In reaction to that need, we are witnessing a technological transition characterized by the emergence of hardware-terminated user-level protocols (e.g., InfiniBand/RDMA) and new architectures with fully integrated Network Interfaces (NIs). Such architectures offer a unique opportunity for a new NI-driven approach to balancing RPCs among the cores of manycore server CPUs, yielding major tail latency improvements for ÎŒs-scale RPCs. We introduce RPCValet, an NI-driven RPC load-balancing design for architectures with hardware-terminated protocols and integrated NIs, that delivers near-optimal tail latency. RPCValet's RPC dispatch decisions emulate the theoretically optimal single-queue system, without incurring synchronization overheads currently associated with single-queue implementations. Our design improves throughput under tight tail latency goals by up to 1.4x, and reduces tail latency before saturation by up to 4x for RPCs with ÎŒs-scale service times, as compared to current systems with hardware support for RPC load distribution. RPCValet performs within 15% of the theoretically optimal single-queue system
The dollar exchange rates in the covid-19 era : evidence from 5 currencies
Purpose: In this paper, through a novel Bayesian specification, we test whether the exchange rates are affected by the current crisis caused by the Covid-19 spread. Design/Methodology/Approach: we set out a novel Bayesian vector autoregressive model and compare it in terms of forecasting ability with the existing literatureâs econometric models. Findings: Based on our findings, the novel Bayesian model proposed in the present paper, is better in terms of forecasting ability than the econometric models, and more importantly, it can unveil the impact of the Covid-19 spread on the exchange rates, while the econometric models failed to shed light on this relationship. Practical Implications: The Covid-19 has affected the overall economic system, in many ways, leading to its disorganization. Such an impact is highlighted by the present paper, examining the exchange rates. Originality/Value: The Bayesian framework proposed in the present paper has novel technical components and can unveil hidden effects of an exogenous variable on a system of endogenous variables, that the classical econometric approaches fail to unveil.peer-reviewe
Atomic object reads for in-memory rack-scale computing
A distributed memory system including a plurality of chips, a plurality of nodes that are distributed across the plurality of chips such that each node is comprised within a chip, each node includes a dedicated local memory and a processor core, and each local memory is configured to be accessible over network communication, a network interface for each node, the network interface configured such that a corresponding network interface of each node is integrated in a coherence domain of the chip of the corresponding node, wherein each of the network interfaces are configured to support a one-sided operation, the network interface directly reading or writing in the dedicated local memory of the corresponding node without involving a processor core, and wherein the one-sided operation is configured such that the processor core of a corresponding node uses a protocol to directly inject a remote memory access for read or write request to the network interface of the node, the remote memory access request allowing to read or write an arbitrarily long region of a memory of a remote node