Search CORE

32 research outputs found

Understanding PCIe performance for end host networking

Author: Dixon Colin
Kalia Anuj
Lee Ki Suh
Lim Hyeontaek
McVoy Larry
Moll L.
Peleg Omer
Pratt Ian
Radhakrishnan Sivasankar
Rizzo Luigi
Zazo Jose Fernando
Publication venue: SIGCOMM '18 Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication
Publication date: 25/08/2018
Field of study

In recent years, spurred on by the development and availability of programmable NICs, end hosts have increasingly become the enforcement point for core network functions such as load balancing, congestion control, and application specific network offloads. However, implementing custom designs on programmable NICs is not easy: many potential bottlenecks can impact performance. This paper focuses on the performance implication of PCIe, the de-facto I/O interconnect in contemporary servers, when interacting with the host architecture and device drivers. We present a theoretical model for PCIe and pcie-bench, an open-source suite, that allows developers to gain an accurate and deep understanding of the PCIe substrate. Using pcie-bench, we characterize the PCIe subsystem in modern servers. We highlight surprising differences in PCIe implementations, evaluate the undesirable impact of PCIe features such as IOMMUs, and show the practical limits for common network cards operating at 40Gb/s and beyond. Furthermore, through pcie-bench we gained insights which guided software and future hardware architectures for both commercial and research oriented network cards and DMA engines

Crossref

Apollo (Cambridge)

I/O Is Faster Than the CPU - Let's Partition Resources and Eliminate (Most) OS Abstractions

Author: Bailey Katelin
Firestone Daniel
Honda Michio
Hruby Tomas
Kalia Anuj
Kicinski Jakub
Kim Hyeong-Jun
Lim Hyeontaek
McCanne Steven
Mogul Jeffrey C.
Peter Simon
Rizzo Luigi
Shan Yizhou
Publication venue: ACM
Publication date: 01/01/2019
Field of study

Peer reviewe

Crossref

Helsingin yliopiston digitaalinen arkisto

RPCValet: NI-Driven Tail-Aware Balancing of µs-Scale RPCs

Author: Asanović Krste
Belay Adam
Bronson Nathan
Daglis Alexandros
Dragojevic Aleksandar
Gao Peter Xiang
Kaffes Kostis
Kalia Anuj
Kaufmann Antoine
Lim Hyeontaek
Linley Group
Linley Group
Linley Group
Pragaspathy Ilanthiraiyan
Rumble Stephen M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 31/05/2019
Field of study

Modern online services come with stringent quality requirements in terms of response time tail latency. Because of their decomposition into fine-grained communicating software layers, a single user request fans out into a plethora of short, μs-scale RPCs, aggravating the need for faster inter-server communication. In reaction to that need, we are witnessing a technological transition characterized by the emergence of hardware-terminated user-level protocols (e.g., InfiniBand/RDMA) and new architectures with fully integrated Network Interfaces (NIs). Such architectures offer a unique opportunity for a new NI-driven approach to balancing RPCs among the cores of manycore server CPUs, yielding major tail latency improvements for μs-scale RPCs. We introduce RPCValet, an NI-driven RPC load-balancing design for architectures with hardware-terminated protocols and integrated NIs, that delivers near-optimal tail latency. RPCValet's RPC dispatch decisions emulate the theoretically optimal single-queue system, without incurring synchronization overheads currently associated with single-queue implementations. Our design improves throughput under tight tail latency goals by up to 1.4x, and reduces tail latency before saturation by up to 4x for RPCs with μs-scale service times, as compared to current systems with hardware support for RPC load distribution. RPCValet performs within 15% of the theoretically optimal single-queue system

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Hermes: a Fast, Fault-Tolerant and Linearizable Replication Protocol

Author: Adya Atul
Aguilera Marcos
Aleksandar Dragojević
Anwar Ali
Baker Jason
Balakrishnan Mahesh
Behrens Jonathan
Brian
Bronson Nathan
Burrows Mike
Consistent
DeCandia Giuseppe
Gray Jim
Hunt Patrick
István Zsolt
Jha Sagar
Jin Xin
Kalia Anuj
Leslie Lamport
Li Jialin
Lim Hyeontaek
Lu Yuanwei
Mao Yanhua
Nightingale Edmund B.
Ongaro Diego
Poke Marius
Reed Benjamin
Renesse Robbert Van
Terrace Jeff
van Renesse Robbert
Wei Michael
Woo Shinae
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 27/01/2020
Field of study

Today's datacenter applications are underpinned by datastores that are responsible for providing availability, consistency, and performance. For high availability in the presence of failures, these datastores replicate data across several nodes. This is accomplished with the help of a reliable replication protocol that is responsible for maintaining the replicas strongly-consistent even when faults occur. Strong consistency is preferred to weaker consistency models that cannot guarantee an intuitive behavior for the clients. Furthermore, to accommodate high demand at real-time latencies, datastores must deliver high throughput and low latency. This work introduces Hermes, a broadcast-based reliable replication protocol for in-memory datastores that provides both high throughput and low latency by enabling local reads and fully-concurrent fast writes at all replicas. Hermes couples logical timestamps with cache-coherence-inspired invalidations to guarantee linearizability, avoid write serialization at a centralized ordering point, resolve write conflicts locally at each replica (hence ensuring that writes never abort) and provide fault-tolerance via replayable writes. Our implementation of Hermes over an RDMA-enabled reliable datastore with five replicas shows that Hermes consistently achieves higher throughput than state-of-the-art RDMA-based reliable protocols (ZAB and CRAQ) across all write ratios while also significantly reducing tail latency. At 5% writes, the tail latency of Hermes is 3.6X lower than that of CRAQ and ZAB.Comment: Accepted in ASPLOS 202

arXiv.org e-Print Archive

Crossref

Edinburgh Research Explorer

PaLM: Scaling Language Modeling with Pathways

Author: Agrawal Shivani
Austin Jacob
Barham Paul
Barnes Parker
Bosma Maarten
Bradbury James
Catasta Michele
Child Rewon
Chowdhery Aakanksha
Chung Hyung Won
Dai Andrew M.
Dean Jeff
Dev Sunipa
Devlin Jacob
Diaz Mark
Dohan David
Du Nan
Duke Toju
Eck Douglas
Fedus Liam
Fiedel Noah
Firat Orhan
Garcia Xavier
Gehrmann Sebastian
Ghemawat Sanjay
Gur-Ari Guy
Hutchinson Ben
Ippolito Daphne
Isard Michael
Lee Katherine
Levskaya Anselm
Lewkowycz Aitor
Lim Hyeontaek
Luan David
Maynez Joshua
Meier-Hellstern Kathy
Michalewski Henryk
Mishra Gaurav
Misra Vedant
Moreira Erica
Narang Sharan
Omernick Mark
Pellat Marie
Petrov Slav
Pillai Thanumalayan Sankaranarayana
Polozov Oleksandr
Pope Reiner
Prabhakaran Vinodkumar
Rao Abhishek
Reif Emily
Roberts Adam
Robinson Kevin
Saeta Brennan
Schuh Parker
Sepassi Ryan
Shazeer Noam
Shi Kensen
Spiridonov Alexander
Sutton Charles
Tay Yi
Tsvyashchenko Sasha
Wang Xuezhi
Wei Jason
Yin Pengcheng
Zhou Denny
Zhou Zongwei
Zoph Barret
Publication venue
Publication date: 19/04/2022
Field of study

Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies

arXiv.org e-Print Archive

PaLM 2 Technical Report

Author: Abrego Gustavo Hernandez
Ahn Junwhan
Anil Rohan
Austin Jacob
Bailey Paige
Barham Paul
Botha Jan
Bradbury James
Brahma Siddhartha
Brooks Kevin
Catasta Michele
Chen Zhifeng
Cheng Yong
Cherry Colin
Choquette-Choo Christopher A.
Chowdhery Aakanksha
Chu Eric
Clark Jonathan H.
Crepy Clément
Dai Andrew M.
Dave Shachi
Dehghani Mostafa
Dev Sunipa
Devlin Jacob
Du Nan
Dyer Ethan
Díaz Mark
Feinberg Vlad
Feng Fangxiaoyu
Fienber Vlad
Firat Orhan
Freitag Markus
Garcia Xavier
Gehrmann Sebastian
Gonzalez Lucas
Gur-Ari Guy
Hand Steven
Hashemi Hadi
Hou Le
Howland Joshua
Hu Andrea
Huang Yanping
Hui Jeffrey
Hurwitz Jeremy
Isard Michael
Ittycheriah Abe
Jagielski Matthew
Jia Wenhao
Johnson Melvin
Kenealy Kathleen
Krikun Maxim
Kudugunta Sneha
Lan Chang
Lee Benjamin
Lee Katherine
Lepikhin Dmitry
Li Eric
Li Jian
Li Music
Li Wei
Li YaGuang
Lim Hyeontaek
Lin Hanzhao
Liu Frederick
Liu Zhongtao
Maggioni Marcello
Mahendru Aroma
Maynez Joshua
Meier-Hellstern Kathy
Mishra Gaurav
Misra Vedant
Moreira Erica
Moussalem Maysam
Nado Zachary
Nham John
Ni Eric
Nystrom Andrew
Omernick Mark
Parrish Alicia
Passos Alexandre
Pellat Marie
Petrov Slav
Polacek Martin
Polozov Alex
Pope Reiner
Qiao Siyuan
Reif Emily
Richter Bryan
Riley Parker
Robinson Kevin
Ros Alex Castro
Roy Aurko
Ruder Sebastian
Saeta Brennan
Samuel Rajkumar
Shafey Laurent El
Shakeri Siamak
Shelby Renee
Slone Ambrose
Smilkov Daniel
So David R.
Sohn Daniel
Taropa Emanuel
Tay Yi
Tokumine Simon
Valter Dasha
Vasudevan Vijay
Vodrahalli Kiran
Wang Pidong
Wang Tao
Wang Xuezhi
Wang Zirui
Wieting John
Wu Yonghui
Wu Yuhuai
Xiao Kefan
Xu Kelvin
Xu Yuanzhong
Xu Yunhan
Xue Linting
Yin Pengcheng
Yu Jiahui
Zhang Qiao
Zhang Yujing
Zheng Ce
Zheng Steven
Zhou Denny
Zhou Weikang
Publication venue
Publication date: 13/09/2023
Field of study

We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. This improved efficiency enables broader deployment while also allowing the model to respond faster, for a more natural pace of interaction. PaLM 2 demonstrates robust reasoning capabilities exemplified by large improvements over PaLM on BIG-Bench and other reasoning tasks. PaLM 2 exhibits stable performance on a suite of responsible AI evaluations, and enables inference-time control over toxicity without additional overhead or impact on other capabilities. Overall, PaLM 2 achieves state-of-the-art performance across a diverse set of tasks and capabilities. When discussing the PaLM 2 family, it is important to distinguish between pre-trained models (of various sizes), fine-tuned variants of these models, and the user-facing products that use these models. In particular, user-facing products typically include additional pre- and post-processing steps. Additionally, the underlying models may evolve over time. Therefore, one should not expect the performance of user-facing products to exactly match the results reported in this report

arXiv.org e-Print Archive

Practical Batch-Updatable External Hashing with Sorting

Author: David G. Andersen
Hyeontaek Lim
Michael Kaminsky
Publication venue
Publication date: 07/01/2013
Field of study

This paper presents a practical external hashing scheme that supports fast lookup (7 microseconds) for large datasets (millions to billions of items) with a small memory footprint (2.5 bits/item) and fast index construction (151 K items/s for 1-KiB key-value pairs). Our scheme combines three key techniques: (1) a new index data structure (Entropy-Coded Tries); (2) the use of sorting as the main data manipulation method; and (3) support for incremental index construction for dynamic datasets. We evaluate our scheme by building an external dictionary on flash-based drives and demonstrate our scheme’s high performance, compactness, and practicality.

CiteSeerX

Crossref