5,555 research outputs found

    Datacenter Traffic Control: Understanding Techniques and Trade-offs

    Get PDF
    Datacenters provide cost-effective and flexible access to scalable compute and storage resources necessary for today's cloud computing needs. A typical datacenter is made up of thousands of servers connected with a large network and usually managed by one operator. To provide quality access to the variety of applications and services hosted on datacenters and maximize performance, it deems necessary to use datacenter networks effectively and efficiently. Datacenter traffic is often a mix of several classes with different priorities and requirements. This includes user-generated interactive traffic, traffic with deadlines, and long-running traffic. To this end, custom transport protocols and traffic management techniques have been developed to improve datacenter network performance. In this tutorial paper, we review the general architecture of datacenter networks, various topologies proposed for them, their traffic properties, general traffic control challenges in datacenters and general traffic control objectives. The purpose of this paper is to bring out the important characteristics of traffic control in datacenters and not to survey all existing solutions (as it is virtually impossible due to massive body of existing research). We hope to provide readers with a wide range of options and factors while considering a variety of traffic control mechanisms. We discuss various characteristics of datacenter traffic control including management schemes, transmission control, traffic shaping, prioritization, load balancing, multipathing, and traffic scheduling. Next, we point to several open challenges as well as new and interesting networking paradigms. At the end of this paper, we briefly review inter-datacenter networks that connect geographically dispersed datacenters which have been receiving increasing attention recently and pose interesting and novel research problems.Comment: Accepted for Publication in IEEE Communications Surveys and Tutorial

    Quantitative methods for data driven reliability optimization of engineered systems

    Get PDF
    Particle accelerators, such as the Large Hadron Collider at CERN, are among the largest and most complex engineered systems to date. Future generations of particle accelerators are expected to increase in size, complexity, and cost. Among the many obstacles, this introduces unprecedented reliability challenges and requires new reliability optimization approaches. With the increasing level of digitalization of technical infrastructures, the rate and granularity of operational data collection is rapidly growing. These data contain valuable information for system reliability optimization, which can be extracted and processed with data-science methods and algorithms. However, many existing data-driven reliability optimization methods fail to exploit these data, because they make too simplistic assumptions of the system behavior, do not consider organizational contexts for cost-effectiveness, and build on specific monitoring data, which are too expensive to record. To address these limitations in realistic scenarios, a tailored methodology based on CRISP-DM (CRoss-Industry Standard Process for Data Mining) is proposed to develop data-driven reliability optimization methods. For three realistic scenarios, the developed methods use the available operational data to learn interpretable or explainable failure models that allow to derive permanent and generally applicable reliability improvements: Firstly, novel explainable deep learning methods predict future alarms accurately from few logged alarm examples and support root-cause identification. Secondly, novel parametric reliability models allow to include expert knowledge for an improved quantification of failure behavior for a fleet of systems with heterogeneous operating conditions and derive optimal operational strategies for novel usage scenarios. Thirdly, Bayesian models trained on data from a range of comparable systems predict field reliability accurately and reveal non-technical factors' influence on reliability. An evaluation of the methods applied to the three scenarios confirms that the tailored CRISP-DM methodology advances the state-of-the-art in data-driven reliability optimization to overcome many existing limitations. However, the quality of the collected operational data remains crucial for the success of such approaches. Hence, adaptations of routine data collection procedures are suggested to enhance data quality and to increase the success rate of reliability optimization projects. With the developed methods and findings, future generations of particle accelerators can be constructed and operated cost-effectively, ensuring high levels of reliability despite growing system complexity

    CERN openlab Whitepaper on Future IT Challenges in Scientific Research

    Get PDF
    This whitepaper describes the major IT challenges in scientific research at CERN and several other European and international research laboratories and projects. Each challenge is exemplified through a set of concrete use cases drawn from the requirements of large-scale scientific programs. The paper is based on contributions from many researchers and IT experts of the participating laboratories and also input from the existing CERN openlab industrial sponsors. The views expressed in this document are those of the individual contributors and do not necessarily reflect the view of their organisations and/or affiliates

    Methods to Improve Applicability and Efficiency of Distributed Data-Centric Compute Frameworks

    Get PDF
    The success of modern applications depends on the insights they collect from their data repositories. Data repositories for such applications currently exceed exabytes and are rapidly increasing in size, as they collect data from varied sources - web applications, mobile phones, sensors and other connected devices. Distributed storage and data-centric compute frameworks have been invented to store and analyze these large datasets. This dissertation focuses on extending the applicability and improving the efficiency of distributed data-centric compute frameworks

    A Literature Survey on Resource Management Techniques, Issues and Challenges in Cloud Computing

    Get PDF
    Cloud computing is a large scale distributed computing which provides on demand services for clients. Cloud Clients use web browsers, mobile apps, thin clients, or terminal emulators to request and control their cloud resources at any time and anywhere through the network. As many companies are shifting their data to cloud and as many people are being aware of the advantages of storing data to cloud, there is increasing number of cloud computing infrastructure and large amount of data which lead to the complexity management for cloud providers. We surveyed the state-of-the-art resource management techniques for IaaS (infrastructure as a service) in cloud computing. Then we put forward different major issues in the deployment of the cloud infrastructure in order to avoid poor service delivery in cloud computing

    FfDL : A Flexible Multi-tenant Deep Learning Platform

    Full text link
    Deep learning (DL) is becoming increasingly popular in several application domains and has made several new application features involving computer vision, speech recognition and synthesis, self-driving automobiles, drug design, etc. feasible and accurate. As a result, large scale on-premise and cloud-hosted deep learning platforms have become essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale. This paper describes the design, implementation and our experiences with FfDL, a DL platform used at IBM. We describe how our design balances dependability with scalability, elasticity, flexibility and efficiency. We examine FfDL qualitatively through a retrospective look at the lessons learned from building, operating, and supporting FfDL; and quantitatively through a detailed empirical evaluation of FfDL, including the overheads introduced by the platform for various deep learning models, the load and performance observed in a real case study using FfDL within our organization, the frequency of various faults observed including unanticipated faults, and experiments demonstrating the benefits of various scheduling policies. FfDL has been open-sourced.Comment: MIDDLEWARE 201
    • …
    corecore