12 research outputs found
INFaaS: A Model-less and Managed Inference Serving System
Despite existing work in machine learning inference serving, ease-of-use and
cost efficiency remain challenges at large scales. Developers must manually
search through thousands of model-variants -- versions of already-trained
models that differ in hardware, resource footprints, latencies, costs, and
accuracies -- to meet the diverse application requirements. Since requirements,
query load, and applications themselves evolve over time, these decisions need
to be made dynamically for each inference query to avoid excessive costs
through naive autoscaling. To avoid navigating through the large and complex
trade-off space of model-variants, developers often fix a variant across
queries, and replicate it when load increases. However, given the diversity
across variants and hardware platforms in the cloud, a lack of understanding of
the trade-off space can incur significant costs to developers.
This paper introduces INFaaS, a managed and model-less system for distributed
inference serving, where developers simply specify the performance and accuracy
requirements for their applications without needing to specify a specific
model-variant for each query. INFaaS generates model-variants, and efficiently
navigates the large trade-off space of model-variants on behalf of developers
to meet application-specific objectives: (a) for each query, it selects a
model, hardware architecture, and model optimizations, (b) it combines VM-level
horizontal autoscaling with model-level autoscaling, where multiple, different
model-variants are used to serve queries within each machine. By leveraging
diverse variants and sharing hardware resources across models, INFaaS achieves
1.3x higher throughput, violates latency objectives 1.6x less often, and saves
up to 21.6x in cost (8.5x on average) compared to state-of-the-art inference
serving systems on AWS EC2
MOSEL: Inference Serving Using Dynamic Modality Selection
Rapid advancements over the years have helped machine learning models reach
previously hard-to-achieve goals, sometimes even exceeding human capabilities.
However, to attain the desired accuracy, the model sizes and in turn their
computational requirements have increased drastically. Thus, serving
predictions from these models to meet any target latency and cost requirements
of applications remains a key challenge, despite recent work in building
inference-serving systems as well as algorithmic approaches that dynamically
adapt models based on inputs. In this paper, we introduce a form of dynamism,
modality selection, where we adaptively choose modalities from inference inputs
while maintaining the model quality. We introduce MOSEL, an automated inference
serving system for multi-modal ML models that carefully picks input modalities
per request based on user-defined performance and accuracy requirements. MOSEL
exploits modality configurations extensively, improving system throughput by
3.6 with an accuracy guarantee and shortening job completion times by
11
Recommended from our members
NextGen-Malloc: Giving Memory Allocator Its Own Room in the House
Article describes how memory allocation and management have a significant impact on performance and energy of modern applications. The authors observe that performance can vary by as much as 72% in some applications based on which memory allocator is used, and in this paper, the authors make a case for offloading memory allocation (and other similar management functions) from main processing cores to other processing units to boost performance, reduce energy consumption, and customize services to specific applications or application domains
Faster Jobs in Distributed Data Processing using Multi-Task Learning
Slow running or straggler tasks in distributed processing frameworks [1, 2] can be 6 to 8 times slower than the median task in a job on a production cluster [3], despite existing mitigation techniques. This leads to extended job completion times, inefficient use of resources, and increased costs. Recently, proactive straggler avoidance techniques [4] have explored the use of predictive models to improve task scheduling. However, to capture node and workload variability, separate models are constructed for every node and workload, requiring the time consuming collection of substantial training data and limiting the applicability to new nodes and workloads. In this work, we observe that predictors for similar nodes or workloads are likely to be similar and can share information, suggesting a multi-task learning (MTL) based approach. We generalize the MTL formulation of [5] to capture commonalities in arbitrary groups. Using our formulation to predict stragglers allows us to reduce job completion times by up to 59 % over Wrangler [4]. This large reduction arises from a 7 percentage point increase in prediction accuracy. Further, we can get equal or better accuracy than [4] using a sixth of the training data, thus bringing the training time down from 4 hours to about 40 minutes. In addition, our formulation reduces the number of parameters by grouping our parameters into node- and workload-dependent factors. We show that, in the event of a particular task having insufficient data, this helps us generalize and achieve significant gains over a naive MTL formulation [5].
Discovery of Application Workloads from Network File Traces
An understanding of application I/O access patterns is useful in several situations. First, gaining insight into what applications are doing with their data at a semantic level helps in designing efficient storage systems. Second, it helps create benchmarks that mimic realistic application behavior closely. Third, it enables autonomic systems as the information obtained can be used to adapt the system in a closed loop. All these use cases require the ability to extract the application-level semantics of I/O operations. Methods such as modifying application code to associate I/O operations with semantic tags are intrusive. It is well known that network file system traces are an important source of information that can be obtained non-intrusively and analyze
Sidecars on the Central Lane: Impact of Network Proxies on Microservices
Cloud applications are moving away from monolithic model towards
loosely-coupled microservices designs. Service meshes are widely used for
implementing microservices applications mainly because they provide a modular
architecture for modern applications by separating operational features from
application business logic. Sidecar proxies in service meshes enable this
modularity by applying security, networking, and monitoring policies on the
traffic to and from services. To implement these policies, sidecars often
execute complex chains of logic that vary across associated applications and
end up unevenly impacting the performance of the overall application. Lack of
understanding of how the sidecars impact the performance of microservice-based
applications stands in the way of building performant and resource-efficient
applications. To this end, we bring sidecar proxies in focus and argue that we
need to deeply study their impact on the system performance and resource
utilization. We identify and describe challenges in characterizing sidecars,
namely the need for microarchitectural metrics and comprehensive methodologies,
and discuss research directions where such characterization will help in
building efficient service mesh infrastructure for microservice applications.Comment: Presented at HotInfra 2023 (co-located with ISCA 2023, Orlando, FL
