32 research outputs found
Understanding PCIe performance for end host networking
In recent years, spurred on by the development and availability of programmable NICs, end hosts have increasingly become the enforcement point for core network functions such as load balancing, congestion control, and application specific network offloads. However, implementing custom designs on programmable NICs is not easy: many potential bottlenecks can impact performance.
This paper focuses on the performance implication of PCIe, the de-facto I/O interconnect in contemporary servers, when interacting with the host architecture and device drivers. We present a theoretical model for PCIe and pcie-bench, an open-source suite, that allows developers to gain an accurate and deep understanding of the PCIe substrate. Using pcie-bench, we characterize the PCIe subsystem in modern servers. We highlight surprising differences in PCIe implementations, evaluate the undesirable impact of PCIe features such as IOMMUs, and show the practical limits for common network cards operating at 40Gb/s and beyond. Furthermore, through pcie-bench we gained insights which guided software and future hardware
architectures for both commercial and research oriented network cards
and DMA engines
I/O Is Faster Than the CPU - Let's Partition Resources and Eliminate (Most) OS Abstractions
Peer reviewe
RPCValet: NI-Driven Tail-Aware Balancing of µs-Scale RPCs
Modern online services come with stringent quality requirements in terms of response time tail latency. Because of their decomposition into fine-grained communicating software layers, a single user request fans out into a plethora of short, μs-scale RPCs, aggravating the need for faster inter-server communication. In reaction to that need, we are witnessing a technological transition characterized by the emergence of hardware-terminated user-level protocols (e.g., InfiniBand/RDMA) and new architectures with fully integrated Network Interfaces (NIs). Such architectures offer a unique opportunity for a new NI-driven approach to balancing RPCs among the cores of manycore server CPUs, yielding major tail latency improvements for μs-scale RPCs. We introduce RPCValet, an NI-driven RPC load-balancing design for architectures with hardware-terminated protocols and integrated NIs, that delivers near-optimal tail latency. RPCValet's RPC dispatch decisions emulate the theoretically optimal single-queue system, without incurring synchronization overheads currently associated with single-queue implementations. Our design improves throughput under tight tail latency goals by up to 1.4x, and reduces tail latency before saturation by up to 4x for RPCs with μs-scale service times, as compared to current systems with hardware support for RPC load distribution. RPCValet performs within 15% of the theoretically optimal single-queue system
Hermes: a Fast, Fault-Tolerant and Linearizable Replication Protocol
Today's datacenter applications are underpinned by datastores that are
responsible for providing availability, consistency, and performance. For high
availability in the presence of failures, these datastores replicate data
across several nodes. This is accomplished with the help of a reliable
replication protocol that is responsible for maintaining the replicas
strongly-consistent even when faults occur. Strong consistency is preferred to
weaker consistency models that cannot guarantee an intuitive behavior for the
clients. Furthermore, to accommodate high demand at real-time latencies,
datastores must deliver high throughput and low latency.
This work introduces Hermes, a broadcast-based reliable replication protocol
for in-memory datastores that provides both high throughput and low latency by
enabling local reads and fully-concurrent fast writes at all replicas. Hermes
couples logical timestamps with cache-coherence-inspired invalidations to
guarantee linearizability, avoid write serialization at a centralized ordering
point, resolve write conflicts locally at each replica (hence ensuring that
writes never abort) and provide fault-tolerance via replayable writes. Our
implementation of Hermes over an RDMA-enabled reliable datastore with five
replicas shows that Hermes consistently achieves higher throughput than
state-of-the-art RDMA-based reliable protocols (ZAB and CRAQ) across all write
ratios while also significantly reducing tail latency. At 5% writes, the tail
latency of Hermes is 3.6X lower than that of CRAQ and ZAB.Comment: Accepted in ASPLOS 202
PaLM: Scaling Language Modeling with Pathways
Large language models have been shown to achieve remarkable performance
across a variety of natural language tasks using few-shot learning, which
drastically reduces the number of task-specific training examples needed to
adapt the model to a particular application. To further our understanding of
the impact of scale on few-shot learning, we trained a 540-billion parameter,
densely activated, Transformer language model, which we call Pathways Language
Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML
system which enables highly efficient training across multiple TPU Pods. We
demonstrate continued benefits of scaling by achieving state-of-the-art
few-shot learning results on hundreds of language understanding and generation
benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough
performance, outperforming the finetuned state-of-the-art on a suite of
multi-step reasoning tasks, and outperforming average human performance on the
recently released BIG-bench benchmark. A significant number of BIG-bench tasks
showed discontinuous improvements from model scale, meaning that performance
steeply increased as we scaled to our largest model. PaLM also has strong
capabilities in multilingual tasks and source code generation, which we
demonstrate on a wide array of benchmarks. We additionally provide a
comprehensive analysis on bias and toxicity, and study the extent of training
data memorization with respect to model scale. Finally, we discuss the ethical
considerations related to large language models and discuss potential
mitigation strategies
PaLM 2 Technical Report
We introduce PaLM 2, a new state-of-the-art language model that has better
multilingual and reasoning capabilities and is more compute-efficient than its
predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture
of objectives. Through extensive evaluations on English and multilingual
language, and reasoning tasks, we demonstrate that PaLM 2 has significantly
improved quality on downstream tasks across different model sizes, while
simultaneously exhibiting faster and more efficient inference compared to PaLM.
This improved efficiency enables broader deployment while also allowing the
model to respond faster, for a more natural pace of interaction. PaLM 2
demonstrates robust reasoning capabilities exemplified by large improvements
over PaLM on BIG-Bench and other reasoning tasks. PaLM 2 exhibits stable
performance on a suite of responsible AI evaluations, and enables
inference-time control over toxicity without additional overhead or impact on
other capabilities. Overall, PaLM 2 achieves state-of-the-art performance
across a diverse set of tasks and capabilities.
When discussing the PaLM 2 family, it is important to distinguish between
pre-trained models (of various sizes), fine-tuned variants of these models, and
the user-facing products that use these models. In particular, user-facing
products typically include additional pre- and post-processing steps.
Additionally, the underlying models may evolve over time. Therefore, one should
not expect the performance of user-facing products to exactly match the results
reported in this report
Practical Batch-Updatable External Hashing with Sorting
This paper presents a practical external hashing scheme that supports fast lookup (7 microseconds) for large datasets (millions to billions of items) with a small memory footprint (2.5 bits/item) and fast index construction (151 K items/s for 1-KiB key-value pairs). Our scheme combines three key techniques: (1) a new index data structure (Entropy-Coded Tries); (2) the use of sorting as the main data manipulation method; and (3) support for incremental index construction for dynamic datasets. We evaluate our scheme by building an external dictionary on flash-based drives and demonstrate our scheme’s high performance, compactness, and practicality.