Search CORE

6 research outputs found

IPA: Inference Pipeline Adaptation to Achieve High Accuracy and Cost-Efficiency

Author: Doyle Joseph
Ghafouri Saeid
Jamshidi Pooyan
Lorido-Botran Tania
Razavi Kamran
Salmani Mehran
Sanaee Alireza
Wang Lin
Publication venue
Publication date: 24/08/2023
Field of study

Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial challenge in ML production systems, given their tight end-to-end latency requirements. To simplify the exploration of the vast and intricate trade-off space of accuracy and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling accuracy and cost trade-offs. To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an online deep-learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained models for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA dynamically configures batch size, replication, and model variants to optimize accuracy, minimize costs, and meet user-defined latency SLAs using Integer Programming. It supports multi-objective settings for achieving different trade-offs between accuracy and cost objectives while remaining adaptable to varying workloads and dynamic traffic patterns. Extensive experiments on a Kubernetes implementation with five real-world inference pipelines demonstrate that IPA improves normalized accuracy by up to 35% with a minimal cost increase of less than 5%

arXiv.org e-Print Archive

Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems

Author: Doyle Joseph
Ghafouri Saeid
Jamshidi Pooyan
Mühlhäuser Max
Razavi Kamran
Salmani Mehran
Sanaee Alireza
Sharifi Mohsen
Publication venue: ACM
Publication date: 24/04/2023
Field of study

The use of machine learning (ML) inference for various applications is growing drastically. ML inference services engage with users directly, requiring fast and accurate responses. Moreover, these services face dynamic workloads of requests, imposing changes in their computing resources. Failing to right-size computing resources results in either latency service level objectives (SLOs) violations or wasted computing resources. Adapting to dynamic workloads considering all the pillars of accuracy, latency, and resource cost is challenging. In response to these challenges, we propose InfAdapter, which proactively selects a set of ML model variants with their resource allocations to meet latency SLO while maximizing an objective function composed of accuracy and cost. InfAdapter decreases SLO violation and costs up to 65 and 33, respectively, compared to a popular industry autoscaler (Kubernetes Vertical Pod Autoscaler)

arXiv.org e-Print Archive

TUbiblio

IPA: Inference Pipeline Adaptation to Achieve High Accuracy and Cost-Efficiency

Author: Doyle Joseph
Ghafouri Saeid
Jamshidi Pooyan
Lorido-Botran Tania
Razavi Kamran
Salmani Mehran
Sanaee Alireza
Wang Lin
Publication venue: arXiv
Publication date: 24/08/2023
Field of study

TUbiblio

[Solution] IPA: Inference Pipeline Adaptation to achieve high accuracy and cost-efficiency

Author: Botran Tania Lorido
Doyle Joseph
Ghafouri Saeid
Jamshidi Pooyan
Razavi Kamran
Salmani Mehran
Sanaee Alireza
Wang Lin
Publication venue: eScholarship, University of California
Publication date: 01/01/2024
Field of study

Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial challenge in machine learning production systems, given their tight end-to-end latency requirements. To simplify the exploration of the vast and intricate trade-off space of latency, accuracy, and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling latency, accuracy, and cost trade-offs. To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an online deep learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained models for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA dynamically configures batch size, replication, and model variants to optimize accuracy, minimize costs, and meet user-defined latency Service Level Agreements (SLAs) using Integer Programming. It supports multi-objective settings for achieving different trade-offs between accuracy and cost objectives while remaining adaptable to varying workloads and dynamic traffic patterns. Navigating a wider variety of configurations allows IPA to achieve better trade-offs between cost and accuracy objectives compared to existing methods. Extensive experiments in a Kubernetes implementation with five real-world inference pipelines demonstrate that IPA improves end-to-end accuracy by up to 21% with a minimal cost increase. The code and data for replications are available at https: //github.com/reconfigurable-ml-pipeline/ipa

TUbiblio

eScholarship - University of California