10 research outputs found

    SLO-aware Colocation of Data Center Tasks Based on Instantaneous Processor Requirements

    Full text link
    In a cloud data center, a single physical machine simultaneously executes dozens of highly heterogeneous tasks. Such colocation results in more efficient utilization of machines, but, when tasks' requirements exceed available resources, some of the tasks might be throttled down or preempted. We analyze version 2.1 of the Google cluster trace that shows short-term (1 second) task CPU usage. Contrary to the assumptions taken by many theoretical studies, we demonstrate that the empirical distributions do not follow any single distribution. However, high percentiles of the total processor usage (summed over at least 10 tasks) can be reasonably estimated by the Gaussian distribution. We use this result for a probabilistic fit test, called the Gaussian Percentile Approximation (GPA), for standard bin-packing algorithms. To check whether a new task will fit into a machine, GPA checks whether the resulting distribution's percentile corresponding to the requested service level objective, SLO is still below the machine's capacity. In our simulation experiments, GPA resulted in colocations exceeding the machines' capacity with a frequency similar to the requested SLO.Comment: Author's version of a paper published in ACM SoCC'1

    Characterizing Service Level Objectives for Cloud Services: Motivation of Short-Term Cache Allocation Performance Modeling

    Get PDF
    Service level objectives (SLOs) stipulate performance goals for cloud applications, microservices, and infrastructure. SLOs are widely used, in part, because system managers can tailor goals to their products, companies, and workloads. Systems research intended to support strong SLOs should target realistic performance goals used by system managers in the field. Evaluations conducted with uncommon SLO goals may not translate to real systems. Some textbooks discuss the structure of SLOs but (1) they only sketch SLO goals and (2) they use outdated examples. We mined real SLOs published on the web, extracted their goals and characterized them. Many web documents discuss SLOs loosely but few provide details and reflect real settings. Systematic literature review (SLR) prunes results and reduces bias by (1) modeling expected SLO structure and (2) detecting and removing outliers. We collected 75 SLOs where response time, query percentile and reporting period were specified. We used these SLOs to confirm and refute common perceptions. For example, we found few SLOs with response time guarantees below 10 ms for 90% or more queries. This reality bolsters perceptions that single digit SLOs face fundamental research challenges.This work was funded by NSF Grants 1749501 and 1350941.No embargoAcademic Major: Computer Science and EngineeringAcademic Major: Financ

    Optimizing egalitarian performance when colocating tasks with types for cloud data center resource management

    Get PDF
    International audienceIn data centers, up to dozens of tasks are colocated on a single physical machine. Machines are used more efficiently, but the performance of the tasks deteriorates, as the colocated tasks compete for shared resources. Since the tasks are heterogeneous, the resulting performance dependencies are complex. In our previous work [26], [27] we proposed a new combinatorial optimization model that uses two parameters of a task-its size and its type-to characterize how a task influences the performance of other tasks allocated to the same machine. In this paper, we study the egalitarian optimization goal: the aim is to optimize the performance of the worst-off task. This problem generalizes the classic makespan minimization on multiple processors (P||C max). We prove that polynomially-solvable variants of P||C max are NP-hard for this generalization, and that the problem is hard to approximate when the number of types is not constant. For a constant number of types, we propose a PTAS, a fast approximation algorithm, and a series of heuristics. We simulate the algorithms on instances derived from a trace of one of Google clusters. Compared with baseline algorithms solving P||C max , our proposed algorithms aware of the types of the jobs lead to significantly better tasks' performance. The notion of type enables us to extend standard combinatorial optimization methods to handle degradation of performance caused by colocation. Types add a layer of additional complexity. However, our results-approximation algorithms and good average-case performance-show that types can be handled efficiently

    Energy-aware service provisioning in P2P-assisted cloud ecosystems

    Get PDF
    Cotutela Universitat Politècnica de Catalunya i Instituto Tecnico de LisboaEnergy has been emerged as a first-class computing resource in modern systems. The trend has primarily led to the strong focus on reducing the energy consumption of data centers, coupled with the growing awareness of the adverse impact on the environment due to data centers. This has led to a strong focus on energy management for server class systems. In this work, we intend to address the energy-aware service provisioning in P2P-assisted cloud ecosystems, leveraging economics-inspired mechanisms. Toward this goal, we addressed a number of challenges. To frame an energy aware service provisioning mechanism in the P2P-assisted cloud, first, we need to compare the energy consumption of each individual service in P2P-cloud and data centers. However, in the procedure of decreasing the energy consumption of cloud services, we may be trapped with the performance violation. Therefore, we need to formulate a performance aware energy analysis metric, conceptualized across the service provisioning stack. We leverage this metric to derive energy analysis framework. Then, we sketch a framework to analyze the energy effectiveness in P2P-cloud and data center platforms to choose the right service platform, according to the performance and energy characteristics. This framework maps energy from the hardware oblivious, top level to the particular hardware setting in the bottom layer of the stack. Afterwards, we introduce an economics-inspired mechanism to increase the energy effectiveness in the P2P-assisted cloud platform as well as moving toward a greener ICT for ICT for a greener ecosystem.La energía se ha convertido en un recurso de computación de primera clase en los sistemas modernos. La tendencia ha dado lugar principalmente a un fuerte enfoque hacia la reducción del consumo de energía de los centros de datos, así como una creciente conciencia sobre los efectos ambientales negativos, producidos por los centros de datos. Esto ha llevado a un fuerte enfoque en la gestión de energía de los sistemas de tipo servidor. En este trabajo, se pretende hacer frente a la provisión de servicios de bajo consumo energético en los ecosistemas de la nube asistida por P2P, haciendo uso de mecanismos basados en economía. Con este objetivo, hemos abordado una serie de desafíos. Para instrumentar un mecanismo de servicio de aprovisionamiento de energía consciente en la nube asistida por P2P, en primer lugar, tenemos que comparar el consumo energético de cada servicio en la nube P2P y en los centros de datos. Sin embargo, en el procedimiento de disminuir el consumo de energía de los servicios en la nube, podemos quedar atrapados en el incumplimiento del rendimiento. Por lo tanto, tenemos que formular una métrica, sobre el rendimiento energético, a través de la pila de servicio de aprovisionamiento. Nos aprovechamos de esta métrica para derivar un marco de análisis de energía. Luego, se esboza un marco para analizar la eficacia energética en la nube asistida por P2P y en la plataforma de centros de datos para elegir la plataforma de servicios adecuada, de acuerdo con las características de rendimiento y energía. Este marco mapea la energía desde el alto nivel independiente del hardware a la configuración de hardware particular en la capa inferior de la pila. Posteriormente, se introduce un mecanismo basado en economía para aumentar la eficacia energética en la plataforma en la nube asistida por P2P, así como avanzar hacia unas TIC más verdes, para las TIC en un ecosistema más verde.Postprint (published version

    Proactive Interference-aware Resource Management in Deep Learning Training Cluster

    Get PDF
    Deep Learning (DL) applications are growing at an unprecedented rate across many domains, ranging from weather prediction, map navigation to medical imaging. However, training these deep learning models in large-scale compute clusters face substantial challenges in terms of low cluster resource utilisation and high job waiting time. State-of-the-art DL cluster resource managers are needed to increase GPU utilisation and maximise throughput. While co-locating DL jobs within the same GPU has been shown to be an effective means towards achieving this, co-location subsequently incurs performance interference resulting in job slowdown. We argue that effective workload placement can minimise DL cluster interference at scheduling runtime by understanding the DL workload characteristics and their respective hardware resource consumption. However, existing DL cluster resource managers reserve isolated GPUs to perform online profiling to directly measure GPU utilisation and kernel patterns for each unique submitted job. Such a feedback-based reactive approach results in additional waiting times as well as reduced cluster resource efficiency and availability. In this thesis, we propose Horus: an interference-aware and prediction-based DL cluster resource manager. Through empirically studying a series of microbenchmarks and DL workload co-location combinations across heterogeneous GPU hardware, we demonstrate the negative effects of performance interference when colocating DL workload, and identify GPU utilisation as a general proxy metric to determine good placement decisions. From these findings, we design Horus, which in contrast to existing approaches, proactively predicts GPU utilisation of heterogeneous DL workload extrapolated from the DL model computation graph features when performing placement decisions, removing the need for online profiling and isolated reserved GPUs. By conducting empirical experimentation within a medium-scale DL cluster as well as a large-scale trace-driven simulation of a production system, we demonstrate Horus improves cluster GPU utilisation, reduces cluster makespan and waiting time, and can scale to operate within hundreds of machines

    A study of space station needs, attributes, and architectural options, volume 2, technical. Book 2: Mission implementation concepts

    Get PDF
    Space station systems characteristics and architecture are described. A manned space station operational analysis is performed to determine crew size, crew task complexity and time tables, and crew equipment to support the definition of systems and subsystems concepts. This analysis is used to select and evaluate the architectural options for development

    Safety and Reliability - Safe Societies in a Changing World

    Get PDF
    The contributions cover a wide range of methodologies and application areas for safety and reliability that contribute to safe societies in a changing world. These methodologies and applications include: - foundations of risk and reliability assessment and management - mathematical methods in reliability and safety - risk assessment - risk management - system reliability - uncertainty analysis - digitalization and big data - prognostics and system health management - occupational safety - accident and incident modeling - maintenance modeling and applications - simulation for safety and reliability analysis - dynamic risk and barrier management - organizational factors and safety culture - human factors and human reliability - resilience engineering - structural reliability - natural hazards - security - economic analysis in risk managemen
    corecore