3 research outputs found
Stratum: A Serverless Framework for Lifecycle Management of Machine Learning based Data Analytics Tasks
With the proliferation of machine learning (ML) libraries and frameworks, and
the programming languages that they use, along with operations of data loading,
transformation, preparation and mining, ML model development is becoming a
daunting task. Furthermore, with a plethora of cloud-based ML model development
platforms, heterogeneity in hardware, increased focus on exploiting edge
computing resources for low-latency prediction serving and often a lack of a
complete understanding of resources required to execute ML workflows
efficiently, ML model deployment demands expertise for managing the lifecycle
of ML workflows efficiently and with minimal cost. To address these challenges,
we propose an end-to-end data analytics, a serverless platform called Stratum.
Stratum can deploy, schedule and dynamically manage data ingestion tools, live
streaming apps, batch analytics tools, ML-as-a-service (for inference jobs),
and visualization tools across the cloud-fog-edge spectrum. This paper
describes the Stratum architecture highlighting the problems it resolves
FECBench: A Holistic Interference-aware Approach for Application Performance Modeling
Services hosted in multi-tenant cloud platforms often encounter performance
interference due to contention for non-partitionable resources, which in turn
causes unpredictable behavior and degradation in application performance. To
grapple with these problems and to define effective resource management
solutions for their services, providers often must expend significant efforts
and incur prohibitive costs in developing performance models of their services
under a variety of interference scenarios on different hardware. This is a hard
problem due to the wide range of possible co-located services and their
workloads, and the growing heterogeneity in the runtime platforms including the
use of fog and edge-based resources, not to mention the accidental complexity
in performing application profiling under a variety of scenarios. To address
these challenges, we present FECBench, a framework to guide providers in
building performance interference prediction models for their services without
incurring undue costs and efforts. The contributions of the paper are as
follows. First, we developed a technique to build resource stressors that can
stress multiple system resources all at once in a controlled manner to gain
insights about the interference on an application's performance. Second, to
overcome the need for exhaustive application profiling, FECBench intelligently
uses the design of experiments (DoE) approach to enable users to build
surrogate performance models of their services. Third, FECBench maintains an
extensible knowledge base of application combinations that create resource
stresses across the multi-dimensional resource design space. Empirical results
using real-world scenarios to validate the efficacy of FECBench show that the
predicted application performance has a median error of only 7.6% across all
test cases, with 5.4% in the best case and 13.5% in the worst case
BARISTA: Efficient and Scalable Serverless Serving System for Deep Learning Prediction Services
Pre-trained deep learning models are increasingly being used to offer a
variety of compute-intensive predictive analytics services such as fitness
tracking, speech and image recognition. The stateless and highly parallelizable
nature of deep learning models makes them well-suited for serverless computing
paradigm. However, making effective resource management decisions for these
services is a hard problem due to the dynamic workloads and diverse set of
available resource configurations that have their deployment and management
costs. To address these challenges, we present a distributed and scalable
deep-learning prediction serving system called Barista and make the following
contributions. First, we present a fast and effective methodology for
forecasting workloads by identifying various trends. Second, we formulate an
optimization problem to minimize the total cost incurred while ensuring bounded
prediction latency with reasonable accuracy. Third, we propose an efficient
heuristic to identify suitable compute resource configurations. Fourth, we
propose an intelligent agent to allocate and manage the compute resources by
horizontal and vertical scaling to maintain the required prediction latency.
Finally, using representative real-world workloads for urban transportation
service, we demonstrate and validate the capabilities of Barista