Search CORE

3 research outputs found

Recommended from our members

Eloquent: A More Robust Transmission Scheme for LLM Token Streaming

Author: Cheng Yihua
Du Kuntai
Jiang Junchen
Li Hanchen
Liu Yuhan
Ray Siddhant
Publication venue
Publication date: 13/08/2024
Field of study

To render each generated token in real-time for users, the Large Language Model (LLM) server generates tokens one by one and streams each token (or group of a few tokens) through the network to the user right after generation, which we refer to as LLM token streaming. However, under unstable network conditions, the LLM token streaming experience could suffer greatly from stalls since one packet loss could block the rendering of later tokens even if the packets containing them arrive on time. With a measurement study, we show that current applications suffer from increased stalls under unstable networks. For this emerging token streaming problem in LLM Chatbots that differs from previous multimedia and text applications, we propose a novel transmission scheme, called Eloquent, which puts newly generated tokens as well as currently unacknowledged tokens in the next outgoing packet. This ensures that each packet contains some new tokens and, in the meantime, is independently rendered when received, avoiding the aforementioned stalls caused by missing packets. Through simulation under various networks, we show Eloquent reduces stall ratio (proportion of token rendering wait time) by 71.0% compared to the retransmission method commonly used by real chatbot applications and by 31.6% compared to the baseline packet duplication scheme. By tailoring Eloquent to fit the token-by-token generation of LLM, we enable the Chatbots to respond like an eloquent speaker for users to better enjoy pervasive AI.</p

Knowledge UChicago

Recommended from our members

CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

Author: Ananthanarayanan Ganesh
Cheng Yihua
Du Kuntai
Hoffmann Henry
Holtzman Ari
Huang Yuyang
Jiang Junchen
Li Hanchen
Liu Yuhan
Lu Shan
Maire Michael
Ray Siddhant
Yao Jiayi
Zhang Qizheng
Publication venue
Publication date: 11/09/2024
Field of study

As large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge. Yet using long contexts is challenging as nothing can be generated until the whole context is processed by the LLM. While the context-processing delay can be reduced by reusing the KV cache of a context across different inputs, fetching the KV cache, which contains large tensors, over the network can cause high extra network delays. CacheGen is a fast context-loading module for LLM systems. First, CacheGen uses a custom tensor encoder, leveraging KV cache's distributional properties to encode a KV cache into more compact bitstream representations with negligible decoding overhead, to save bandwidth usage. Second, CacheGen adapts the compression level of different parts of a KV cache to cope with changes in available bandwidth, in order to maintain low context-loading delay and high generation quality. We test CacheGen on popular LLMs and datasets. Compared to the recent systems that reuse the KV cache, CacheGen reduces the KV cache size by 3.5--4.3x and the total delay in fetching and processing contexts by 3.2--3.7x with negligible impact on the LLM response quality. Our code is at: https://github.com/UChi-JCL/CacheGen.</p

Knowledge UChicago

ffmpeg 5.0 binary file

Author: Kuntai Du
Publication venue
Publication date: 11/02/2022
Field of study

ffmpeg 5.0 release from https://johnvansickle.com/ffmpeg/ Archived for artifact evaluation of MLSys Paper: AccMPEG: Optimizing Video Encoding for Accurate Video Analytic

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY