100 research outputs found
LHView: Location Aware Hybrid Partial View
The rise of the Cloud creates enormous business opportunities for companies to provide
global services, which requires applications supporting the operation of those services
to scale while minimizing maintenance costs, either due to unnecessary allocation of
resources or due to excessive human supervision and administration. Solutions designed
to support such systems have tackled fundamental challenges from individual component
failure to transient network partitions. A fundamental aspect that all scalable large
systems have to deal with is the membership of the system, i.e, tracking the active components
that compose the system. Most systems rely on membership management protocols
that operate at the application level, many times exposing the interface of a logical overlay
network, that should guarantee high scalability, efficiency, and robustness.
Although these protocols are capable of repairing the overlay in face of large numbers
of individual components faults, when scaling to global settings (i.e, geo-distributed
scenarios), this robustness is a double edged-sword because it is extremely complex for
a node in a system to distinguish between a set of simultaneously node failures and a
(transient) network partition. Thus the occurrence of a network partition creates isolated
sub-sets of nodes incapable of reconnecting even after the recovery from the partition.
This work address this challenges by proposing a novel datacenter-aware membership
protocol to tolerate network partitions by applying existing overlay management techniques
and classification techniques that may allow the system to efficiently cope with
such events without compromising the remaining properties of the overlay network. Furthermore,
we strive to achieve these goals with a solution that requires minimal human
intervention
Network-Compute Co-Design for Distributed In-Memory Computing
The booming popularity of online services is rapidly raising the demands for modern datacenters. In order to cope with data deluge, growing user bases, and tight quality of service constraints, service providers deploy massive datacenters with tens to hundreds of thousands of servers, keeping petabytes of latency-critical data memory resident. Such data distribution and the multi-tiered nature of the software used by feature-rich services results in frequent inter-server communication and remote memory access over the network. Hence, networking takes center stage in datacenters.
In response to growing internal datacenter network traffic, networking technology is rapidly evolving. Lean user-level protocols, like RDMA, and high-performance fabrics have started making their appearance, dramatically reducing datacenter-wide network latency and offering unprecedented per-server bandwidth. At the same time, the end of Dennard scaling is grinding processor performance improvements to a halt. The net result is a growing mismatch between the per-server network and compute capabilities: it will soon be difficult for a server processor to utilize all of its available network bandwidth.
Restoring balance between network and compute capabilities requires tighter co-design of the two. The network interface (NI) is of particular interest, as it lies on the boundary of network and compute. In this thesis, we focus on the design of an NI for a lightweight RDMA-like protocol and its full integration with modern manycore server processors. The NI capabilities scale with both the increasing network bandwidth and the growing number of cores on modern server processors.
Leveraging our architecture's integrated NI logic, we introduce new functionality at the network endpoints that yields performance improvements for distributed systems. Such additions include new network operations with stronger semantics tailored to common application requirements and integrated logic for balancing network load across a modern processor's multiple cores. We make the case that exposing richer, end-to-end semantics to the NI is a unique enabler for optimizations that can reduce software complexity and remove significant load from the processor, contributing towards maintaining balance between the two valuable resources of network and compute. Overall, network-compute co-design is an approach that addresses challenges associated with the emerging technological mismatch of compute and networking capabilities, yielding significant performance improvements for distributed memory systems
Replicating Persistent Memory Key-Value Stores with Efficient RDMA Abstraction
Combining persistent memory (PM) with RDMA is a promising approach to
performant replicated distributed key-value stores (KVSs). However, existing
replication approaches do not work well when applied to PM KVSs: 1) Using RPC
induces software queueing and execution at backups, increasing request latency;
2) Using one-sided RDMA WRITE causes many streams of small PM writes, leading
to severe device-level write amplification (DLWA) on PM. In this paper, we
propose Rowan, an efficient RDMA abstraction to handle replication writes in PM
KVSs; it aggregates concurrent remote writes from different servers, and lands
these writes to PM in a sequential (thus low DLWA) and one-sided (thus low
latency) manner. We realize Rowan with off-the-shelf RDMA NICs. Further, we
build Rowan-KV, a log-structured PM KVS using Rowan for replication. Evaluation
shows that under write-intensive workloads, compared with PM KVSs using RPC and
RDMA WRITE for replication, Rowan-KV boosts throughput by 1.22X and 1.39X as
well as lowers median PUT latency by 1.77X and 2.11X, respectively, while
largely eliminating DLWA.Comment: Accepted to OSDI 202
MaxMem: Colocation and Performance for Big Data Applications on Tiered Main Memory Servers
We present MaxMem, a tiered main memory management system that aims to
maximize Big Data application colocation and performance. MaxMem uses an
application-agnostic and lightweight memory occupancy control mechanism based
on fast memory miss ratios to provide application QoS under increasing
colocation. By relying on memory access sampling and binning to quickly
identify per-process memory heat gradients, MaxMem maximizes performance for
many applications sharing tiered main memory simultaneously. MaxMem is designed
as a user-space memory manager to be easily modifiable and extensible, without
complex kernel code development. On a system with tiered main memory consisting
of DRAM and Intel Optane persistent memory modules, our evaluation confirms
that MaxMem provides 11% and 38% better throughput and up to 80% and an order
of magnitude lower 99th percentile latency than HeMem and Linux AutoNUMA,
respectively, with a Big Data key-value store in dynamic colocation scenarios.Comment: 12 pages, 10 figure
Disaggregating and Consolidating Network Functionalities
Resource disaggregation has gained huge popularity in recent years. Existing
works demonstrate how to disaggregate compute, memory, and storage resources.
We, for the first time, demonstrate how to disaggregate network resources by
proposing a new distributed hardware framework called SuperNIC. Each SuperNIC
connects a small set of endpoints and consolidates network functionalities for
these endpoints. We prototyped SuperNIC with FPGA and demonstrate its
performance and cost benefits with real network functions and customized
disaggregated applications
Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries
Graph processing has become an important part of multiple areas of computer
science, such as machine learning, computational sciences, medical
applications, social network analysis, and many others. Numerous graphs such as
web or social networks may contain up to trillions of edges. Often, these
graphs are also dynamic (their structure changes over time) and have
domain-specific rich data associated with vertices and edges. Graph database
systems such as Neo4j enable storing, processing, and analyzing such large,
evolving, and rich datasets. Due to the sheer size of such datasets, combined
with the irregular nature of graph processing, these systems face unique design
challenges. To facilitate the understanding of this emerging domain, we present
the first survey and taxonomy of graph database systems. We focus on
identifying and analyzing fundamental categories of these systems (e.g., triple
stores, tuple stores, native graph database systems, or object-oriented
systems), the associated graph models (e.g., RDF or Labeled Property Graph),
data organization techniques (e.g., storing graph data in indexing structures
or dividing data into records), and different aspects of data distribution and
query execution (e.g., support for sharding and ACID). 51 graph database
systems are presented and compared, including Neo4j, OrientDB, or Virtuoso. We
outline graph database queries and relationships with associated domains (NoSQL
stores, graph streaming, and dynamic graph algorithms). Finally, we describe
research and engineering challenges to outline the future of graph databases
- …