8,958 research outputs found

    Computing server power modeling in a data center: survey,taxonomy and performance evaluation

    Full text link
    Data centers are large scale, energy-hungry infrastructure serving the increasing computational demands as the world is becoming more connected in smart cities. The emergence of advanced technologies such as cloud-based services, internet of things (IoT) and big data analytics has augmented the growth of global data centers, leading to high energy consumption. This upsurge in energy consumption of the data centers not only incurs the issue of surging high cost (operational and maintenance) but also has an adverse effect on the environment. Dynamic power management in a data center environment requires the cognizance of the correlation between the system and hardware level performance counters and the power consumption. Power consumption modeling exhibits this correlation and is crucial in designing energy-efficient optimization strategies based on resource utilization. Several works in power modeling are proposed and used in the literature. However, these power models have been evaluated using different benchmarking applications, power measurement techniques and error calculation formula on different machines. In this work, we present a taxonomy and evaluation of 24 software-based power models using a unified environment, benchmarking applications, power measurement technique and error formula, with the aim of achieving an objective comparison. We use different servers architectures to assess the impact of heterogeneity on the models' comparison. The performance analysis of these models is elaborated in the paper

    PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems

    Full text link
    Machine Learning models are often composed of pipelines of transformations. While this design allows to efficiently execute single model components at training time, prediction serving has different requirements such as low latency, high throughput and graceful performance degradation under heavy load. Current prediction serving systems consider models as black boxes, whereby prediction-time-specific optimizations are ignored in favor of ease of deployment. In this paper, we present PRETZEL, a prediction serving system introducing a novel white box architecture enabling both end-to-end and multi-model optimizations. Using production-like model pipelines, our experiments show that PRETZEL is able to introduce performance improvements over different dimensions; compared to state-of-the-art approaches PRETZEL is on average able to reduce 99th percentile latency by 5.5x while reducing memory footprint by 25x, and increasing throughput by 4.7x.Comment: 16 pages, 14 figures, 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 201

    Artificial intelligence driven anomaly detection for big data systems

    Get PDF
    The main goal of this thesis is to contribute to the research on automated performance anomaly detection and interference prediction by implementing Artificial Intelligence (AI) solutions for complex distributed systems, especially for Big Data platforms within cloud computing environments. The late detection and manual resolutions of performance anomalies and system interference in Big Data systems may lead to performance violations and financial penalties. Motivated by this issue, we propose AI-based methodologies for anomaly detection and interference prediction tailored to Big Data and containerized batch platforms to better analyze system performance and effectively utilize computing resources within cloud environments. Therefore, new precise and efficient performance management methods are the key to handling performance anomalies and interference impacts to improve the efficiency of data center resources. The first part of this thesis contributes to performance anomaly detection for in-memory Big Data platforms. We examine the performance of Big Data platforms and justify our choice of selecting the in-memory Apache Spark platform. An artificial neural network-driven methodology is proposed to detect and classify performance anomalies for batch workloads based on the RDD characteristics and operating system monitoring metrics. Our method is evaluated against other popular machine learning algorithms (ML), as well as against four different monitoring datasets. The results prove that our proposed method outperforms other ML methods, typically achieving 98–99% F-scores. Moreover, we prove that a random start instant, a random duration, and overlapped anomalies do not significantly impact the performance of our proposed methodology. The second contribution addresses the challenge of anomaly identification within an in-memory streaming Big Data platform by investigating agile hybrid learning techniques. We develop TRACK (neural neTwoRk Anomaly deteCtion in sparK) and TRACK-Plus, two methods to efficiently train a class of machine learning models for performance anomaly detection using a fixed number of experiments. Our model revolves around using artificial neural networks with Bayesian Optimization (BO) to find the optimal training dataset size and configuration parameters to efficiently train the anomaly detection model to achieve high accuracy. The objective is to accelerate the search process for finding the size of the training dataset, optimizing neural network configurations, and improving the performance of anomaly classification. A validation based on several datasets from a real Apache Spark Streaming system is performed, demonstrating that the proposed methodology can efficiently identify performance anomalies, near-optimal configuration parameters, and a near-optimal training dataset size while reducing the number of experiments up to 75% compared with naïve anomaly detection training. The last contribution overcomes the challenges of predicting completion time of containerized batch jobs and proactively avoiding performance interference by introducing an automated prediction solution to estimate interference among colocated batch jobs within the same computing environment. An AI-driven model is implemented to predict the interference among batch jobs before it occurs within system. Our interference detection model can alleviate and estimate the task slowdown affected by the interference. This model assists the system operators in making an accurate decision to optimize job placement. Our model is agnostic to the business logic internal to each job. Instead, it is learned from system performance data by applying artificial neural networks to establish the completion time prediction of batch jobs within the cloud environments. We compare our model with three other baseline models (queueing-theoretic model, operational analysis, and an empirical method) on historical measurements of job completion time and CPU run-queue size (i.e., the number of active threads in the system). The proposed model captures multithreading, operating system scheduling, sleeping time, and job priorities. A validation based on 4500 experiments based on the DaCapo benchmarking suite was carried out, confirming the predictive efficiency and capabilities of the proposed model by achieving up to 10% MAPE compared with the other models.Open Acces

    A Survey on Automatic Parameter Tuning for Big Data Processing Systems

    Get PDF
    Big data processing systems (e.g., Hadoop, Spark, Storm) contain a vast number of configuration parameters controlling parallelism, I/O behavior, memory settings, and compression. Improper parameter settings can cause significant performance degradation and stability issues. However, regular users and even expert administrators grapple with understanding and tuning them to achieve good performance. We investigate existing approaches on parameter tuning for both batch and stream data processing systems and classify them into six categories: rule-based, cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. We summarize the pros and cons of each approach and raise some open research problems for automatic parameter tuning.Peer reviewe

    Machine Learning at Microsoft with ML .NET

    Full text link
    Machine Learning is transitioning from an art and science into a technology available to every developer. In the near future, every application on every platform will incorporate trained models to encode data-based decisions that would be impossible for developers to author. This presents a significant engineering challenge, since currently data science and modeling are largely decoupled from standard software development processes. This separation makes incorporating machine learning capabilities inside applications unnecessarily costly and difficult, and furthermore discourage developers from embracing ML in first place. In this paper we present ML .NET, a framework developed at Microsoft over the last decade in response to the challenge of making it easy to ship machine learning models in large software applications. We present its architecture, and illuminate the application demands that shaped it. Specifically, we introduce DataView, the core data abstraction of ML .NET which allows it to capture full predictive pipelines efficiently and consistently across training and inference lifecycles. We close the paper with a surprisingly favorable performance study of ML .NET compared to more recent entrants, and a discussion of some lessons learned

    Energy Efficiency Models for Scientific Applications on Supercomputers

    Get PDF
    corecore