25 research outputs found

    How does it function? Characterizing long-term trends in production serverless workloads

    Get PDF
    This paper releases and analyzes two new Huawei cloud serverless traces. The traces span a period of over 7 months with over 1.4 trillion function invocations combined. The first trace is derived from Huawei's internal workloads and contains detailed per-second statistics for 200 functions running across multiple Huawei cloud data centers. The second trace is a representative workload from Huawei's public FaaS platform. This trace contains per-minute arrival rates for over 5000 functions running in a single Huawei data center. We present the internals of a production FaaS platform by characterizing resource consumption, cold-start times, programming languages used, periodicity, per-second versus per-minute burstiness, correlations, and popularity. Our findings show that there is considerable diversity in how serverless functions behave: requests vary by up to 9 orders of magnitude across functions, with some functions executed over 1 billion times per day; scheduling time, execution time and cold-start distributions vary across 2 to 4 orders of magnitude and have very long tails; and function invocation counts demonstrate strong periodicity for many individual functions and on an aggregate level. Our analysis also highlights the need for further research in estimating resource reservations and time-series prediction to account for the huge diversity in how serverless functions behave.Postprin

    Perphon: a ML-based Agent for Workload Co-location via Performance Prediction and Resource Inference

    Get PDF
    Cluster administrators are facing great pressures to improve cluster utilization through workload co-location. Guaranteeing performance of long-running applications (LRAs), however, is far from settled as unpredictable interference across applications is catastrophic to QoS [2]. Current solutions such as [1] usually employ sandboxed and offline profiling for different workload combinations and leverage them to predict incoming interference. However, the time complexity restricts the applicability to complex co-locations. Hence, this issue entails a new framework to harness runtime performance and mitigate the time cost with machine intelligence: i) It is desirable to explore a quantitative relationship between allocated resource and consequent workload performance, not relying on analyzing interference derived from different workload combinations. The majority of works, however, depend on offline profiling and training which may lead to model aging problem. Moreover, multi-resource dimensions (e.g., LLC contention) that are not completely included by existing works but have impact on performance interference need to be considered [3]. ii) Workload co-location also necessitates fine-grained isolation and access control mechanism. Once performance degradation is detected, dynamic resource adjustment will be enforced and application will be assigned an access to specific slices of each resources. Inferring a "just enough" amount of resource adjustment ensures the application performance can be secured whilst improving cluster utilization. We present Perphon, a runtime agent on a per node basis, that decouples ML-based performance prediction and resource inference from centralized scheduler. Figure 1 outlines the proposed architecture. We initially exploit sensitivity of applications to multi-resources to establish performance prediction. To achieve this, Metric Monitor aggregates application fingerprint and system-level performance metrics including CPU, memory, Last Level Cache (LLC), memory bandwidth (MBW) and number of running threads, etc. They are enabled by Intel-RDT and precisely obtained from resource group manager. Perphon employs an Online Gradient Boost Regression Tree (OGBRT) approach to resolve model aging problem. Res-Perf Model warms up via offline learning that merely relies on a small volume of profiling in the early stage, but evolves with arrival of workloads. Consequently, parameters will be automatically updated and synchronized among agents. Anomaly Detector can timely pinpoint a performance degradation via LSTM time-series analysis and determine when and which application need to be re-allocated resources. Once abnormal performance counter or load is detected, Resource Inferer conducts a gradient ascend based inference to work out a proper slice of resources, towards dynamically recovering targeted performance. Upon receiving an updated re-allocation, Access Controller re-assigns a specific portion of the node resources to the affected application. Eventually, Isolation Executor enforces resource manipulation and ensures performance isolation across applications. Specifically, we use cgroup cpuset and memory subsystem to control usage of CPU and memory while leveraging Intel-RDT technology to underpin the manipulation of LLC and MBW. For fine-granularity management, we create different groups for LRA and batch jobs when the agent starts. Our prototype integration with Node Manager of Apache YARN shows that throughput of Kafka data-streaming application in Perphon is 2.0x and 1.82x times that of isolation execution schemes in native YARN and pure cgroup cpu subsystem

    tf.data service: A Case for Disaggregating ML Input Data Processing

    Full text link
    Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt. For cost efficiency, it is essential to keep these accelerators highly utilized. This requires preprocessing input data at the rate at which the accelerators can ingest and perform ML computations on the data. To avoid data stalls, the host CPU and RAM required for input data processing per accelerator core used for ML computations varies across jobs. Hence, the traditional approach of processing input data on ML accelerator hosts with a fixed hardware ratio leads to either under-utilizing the accelerators or the host CPU and RAM. In this paper, we address these concerns by building a disaggregated ML data processing system. We present tf.data service, an open-source disaggregated input data processing service built on top of tf.data in TensorFlow. We show that disaggregating data preprocessing has three key advantages for large-scale ML training jobs. First, the service can horizontally scale-out to right-size CPU/RAM host resources for data processing in each job, saving 32x training time and 26x cost, on average. Second, the service can share ephemeral preprocessed data results across jobs, to optimize CPU usage and reduce redundant computations. Finally, the service supports coordinated reads, a technique that avoids stragglers due to different input sizes in distributed training, reducing training time by 2.2x, on average. Our design is inspired by lessons learned from deploying tf.data service in production, including relaxing data visitation guarantees without impacting model accuracy

    No Provisioned Concurrency: Fast RDMA-codesigned Remote Fork for Serverless Computing

    Full text link
    Serverless platforms essentially face a tradeoff between container startup time and provisioned concurrency (i.e., cached instances), which is further exaggerated by the frequent need for remote container initialization. This paper presents MITOSIS, an operating system primitive that provides fast remote fork, which exploits a deep codesign of the OS kernel with RDMA. By leveraging the fast remote read capability of RDMA and partial state transfer across serverless containers, MITOSIS bridges the performance gap between local and remote container initialization. MITOSIS is the first to fork over 10,000 new containers from one instance across multiple machines within a second, while allowing the new containers to efficiently transfer the pre-materialized states of the forked one. We have implemented MITOSIS on Linux and integrated it with FN, a popular serverless platform. Under load spikes in real-world serverless workloads, MITOSIS reduces the function tail latency by 89% with orders of magnitude lower memory usage. For serverless workflow that requires state transfer, MITOSIS improves its execution time by 86%.Comment: To appear in OSDI'2

    BigDL: A Distributed Deep Learning Framework for Big Data

    Full text link
    This paper presents BigDL (a distributed deep learning framework for Apache Spark), which has been used by a variety of users in the industry for building deep learning applications on production big data platforms. It allows deep learning applications to run on the Apache Hadoop/Spark cluster so as to directly process the production data, and as a part of the end-to-end data analysis pipeline for deployment and management. Unlike existing deep learning frameworks, BigDL implements distributed, data parallel training directly on top of the functional compute model (with copy-on-write and coarse-grained operations) of Spark. We also share real-world experience and "war stories" of users that have adopted BigDL to address their challenges(i.e., how to easily build end-to-end data analysis and deep learning pipelines for their production data).Comment: In ACM Symposium of Cloud Computing conference (SoCC) 201

    Поиск аномалий в сенсорных данных цифровой индустрии с помощью параллельных вычислений

    Get PDF
    The article presents the results of case studies on the anomaly discovery in sensor data from various applications of the digital industry. The time series data obtained from the sensors installed on machine parts and metallurgical equipment, and from the temperature sensors in the smart building heating control system are considered. The anomalies discovered in such data indicate an abnormal situation or failures in the technological equipment. In this study, the anomaly is formalized as a range discord, namely a subsequence, the distance from which to its nearest neighbor is not less than the threshold prespecified by an analyst. The nearest neighbor of the given subsequence is a subsequence that does not overlap with this one and has a minimum distance to it. The discord discovery is performed through the parallel algorithm for GPU developed by the author. To visualize the anomalies found, a discord heatmap method and an algorithm for selection the most interesting discords regardless of their lengths are proposed.В статье представлены результаты исследований по поиску аномалий в сенсорных данных из различных приложений цифровой индустрии. Рассматриваются временные ряды, полученные при эксплуатации деталей машин, показания датчиков, установленных на металлургическом оборудовании, и показания температурных датчиков в системе умного управления отоплением зданий. Аномалии, найденные в таких данных, свидетельствуют о нештатной ситуации, отказах, сбоях и износе технологического оборудования. Аномалия формализуется как диапазонный диссонанс — подпоследовательность временного ряда, расстояние от которой до ее ближайшего соседа не менее наперед заданного аналитиком порога. Ближайшим соседом данной подпоследовательности является такая подпоследовательность ряда, которая не пересекается с данной и имеет минимальное расстояние до нее. Поиск диссонансов выполняется с помощью параллельного алгоритма для графического процессора, ранее разработанного автором данной статьи. Для визуализации найденных аномалий предложены метод построения тепловой карты диссонансов, имеющих различные длины, и алгоритм нахождения в построенной тепловой карте наиболее значимых диссонансов независимо от их длин

    New Directions in Cloud Programming

    Full text link
    Nearly twenty years after the launch of AWS, it remains difficult for most developers to harness the enormous potential of the cloud. In this paper we lay out an agenda for a new generation of cloud programming research aimed at bringing research ideas to programmers in an evolutionary fashion. Key to our approach is a separation of distributed programs into a PACT of four facets: Program semantics, Availablity, Consistency and Targets of optimization. We propose to migrate developers gradually to PACT programming by lifting familiar code into our more declarative level of abstraction. We then propose a multi-stage compiler that emits human-readable code at each stage that can be hand-tuned by developers seeking more control. Our agenda raises numerous research challenges across multiple areas including language design, query optimization, transactions, distributed consistency, compilers and program synthesis

    PEO-Store: Practical and Economical Oblivious Store with Peer-to-Peer Delegation

    Get PDF
    The growing popularity of cloud storage has brought attention to critical need for preventing information leakage from cloud access patterns. To this end, recent efforts have extended Oblivious RAM (ORAM) to the cloud environment in the form of Oblivious Store. However, its impracticality due to the use of probability encryption with fake accesses to obfuscate the access pattern, as well as the security requirements of conventional obliviousness designs, which hinder cloud interests in improving storage utilization by removing redundant data among cross-users, limit its effectiveness. Thus, we propose a practical Oblivious Store, PEO-Store, which integrates the obliviousness property into the cloud while removing redundancy without compromising security. Unlike conventional schemes, PEO-Store randomly selects a delegate for each client to communicate with the cloud, breaking the mapping link between a valid access pattern sequence and a specific client. Each client encrypts their data and shares it with selected delegates, who act as intermediaries with the cloud provider. This design leverages non-interactive zero-knowledge-based redundancy detection, discrete logarithm problem-based key sharing, and secure time-based delivery proof to protect access pattern privacy and accurately identify and remove redundancy in the cloud. The theoretical proof demonstrates that the probability of identifying the valid access pattern with a specific user is negligible in our design. Experimental results show that PEO-Store outperforms state-of-the-art methods, achieving an average throughput of up to 3 times faster and saving 74% of storage space

    Systems Support for Trusted Execution Environments

    Get PDF
    Cloud computing has become a default choice for data processing by both large corporations and individuals due to its economy of scale and ease of system management. However, the question of trust and trustoworthy computing inside the Cloud environments has been long neglected in practice and further exacerbated by the proliferation of AI and its use for processing of sensitive user data. Attempts to implement the mechanisms for trustworthy computing in the cloud have previously remained theoretical due to lack of hardware primitives in the commodity CPUs, while a combination of Secure Boot, TPMs, and virtualization has seen only limited adoption. The situation has changed in 2016, when Intel introduced the Software Guard Extensions (SGX) and its enclaves to the x86 ISA CPUs: for the first time, it became possible to build trustworthy applications relying on a commonly available technology. However, Intel SGX posed challenges to the practitioners who discovered the limitations of this technology, from the limited support of legacy applications and integration of SGX enclaves into the existing system, to the performance bottlenecks on communication, startup, and memory utilization. In this thesis, our goal is enable trustworthy computing in the cloud by relying on the imperfect SGX promitives. To this end, we develop and evaluate solutions to issues stemming from limited systems support of Intel SGX: we investigate the mechanisms for runtime support of POSIX applications with SCONE, an efficient SGX runtime library developed with performance limitations of SGX in mind. We further develop this topic with FFQ, which is a concurrent queue for SCONE's asynchronous system call interface. ShieldBox is our study of interplay of kernel bypass and trusted execution technologies for NFV, which also tackles the problem of low-latency clocks inside enclave. The two last systems, Clemmys and T-Lease are built on a more recent SGXv2 ISA extension. In Clemmys, SGXv2 allows us to significantly reduce the startup time of SGX-enabled functions inside a Function-as-a-Service platform. Finally, in T-Lease we solve the problem of trusted time by introducing a trusted lease primitive for distributed systems. We perform evaluation of all of these systems and prove that they can be practically utilized in existing systems with minimal overhead, and can be combined with both legacy systems and other SGX-based solutions. In the course of the thesis, we enable trusted computing for individual applications, high-performance network functions, and distributed computing framework, making a <vision of trusted cloud computing a reality