1,118 research outputs found

    FfDL : A Flexible Multi-tenant Deep Learning Platform

    Full text link
    Deep learning (DL) is becoming increasingly popular in several application domains and has made several new application features involving computer vision, speech recognition and synthesis, self-driving automobiles, drug design, etc. feasible and accurate. As a result, large scale on-premise and cloud-hosted deep learning platforms have become essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale. This paper describes the design, implementation and our experiences with FfDL, a DL platform used at IBM. We describe how our design balances dependability with scalability, elasticity, flexibility and efficiency. We examine FfDL qualitatively through a retrospective look at the lessons learned from building, operating, and supporting FfDL; and quantitatively through a detailed empirical evaluation of FfDL, including the overheads introduced by the platform for various deep learning models, the load and performance observed in a real case study using FfDL within our organization, the frequency of various faults observed including unanticipated faults, and experiments demonstrating the benefits of various scheduling policies. FfDL has been open-sourced.Comment: MIDDLEWARE 201

    5G Multi-access Edge Computing: Security, Dependability, and Performance

    Full text link
    The main innovation of the Fifth Generation (5G) of mobile networks is the ability to provide novel services with new and stricter requirements. One of the technologies that enable the new 5G services is the Multi-access Edge Computing (MEC). MEC is a system composed of multiple devices with computing and storage capabilities that are deployed at the edge of the network, i.e., close to the end users. MEC reduces latency and enables contextual information and real-time awareness of the local environment. MEC also allows cloud offloading and the reduction of traffic congestion. Performance is not the only requirement that the new 5G services have. New mission-critical applications also require high security and dependability. These three aspects (security, dependability, and performance) are rarely addressed together. This survey fills this gap and presents 5G MEC by addressing all these three aspects. First, we overview the background knowledge on MEC by referring to the current standardization efforts. Second, we individually present each aspect by introducing the related taxonomy (important for the not expert on the aspect), the state of the art, and the challenges on 5G MEC. Finally, we discuss the challenges of jointly addressing the three aspects.Comment: 33 pages, 11 figures, 15 tables. This paper is under review at IEEE Communications Surveys & Tutorials. Copyright IEEE 202

    Resource Allocation in Networking and Computing Systems: A Security and Dependability Perspective

    Get PDF
    In recent years, there has been a trend to integrate networking and computing systems, whose management is getting increasingly complex. Resource allocation is one of the crucial aspects of managing such systems and is affected by this increased complexity. Resource allocation strategies aim to effectively maximize performance, system utilization, and profit by considering virtualization technologies, heterogeneous resources, context awareness, and other features. In such complex scenario, security and dependability are vital concerns that need to be considered in future computing and networking systems in order to provide the future advanced services, such as mission-critical applications. This paper provides a comprehensive survey of existing literature that considers security and dependability for resource allocation in computing and networking systems. The current research works are categorized by considering the allocated type of resources for different technologies, scenarios, issues, attributes, and solutions. The paper presents the research works on resource allocation that includes security and dependability, both singularly and jointly. The future research directions on resource allocation are also discussed. The paper shows how there are only a few works that, even singularly, consider security and dependability in resource allocation in the future computing and networking systems and highlights the importance of jointly considering security and dependability and the need for intelligent, adaptive and robust solutions. This paper aims to help the researchers effectively consider security and dependability in future networking and computing systems.publishedVersio

    MLOps: A Review

    Full text link
    Recently, Machine Learning (ML) has become a widely accepted method for significant progress that is rapidly evolving. Since it employs computational methods to teach machines and produce acceptable answers. The significance of the Machine Learning Operations (MLOps) methods, which can provide acceptable answers for such problems, is examined in this study. To assist in the creation of software that is simple to use, the authors research MLOps methods. To choose the best tool structure for certain projects, the authors also assess the features and operability of various MLOps methods. A total of 22 papers were assessed that attempted to apply the MLOps idea. Finally, the authors admit the scarcity of fully effective MLOps methods based on which advancements can self-regulate by limiting human engagement

    Towards Runtime Verification via Event Stream Processing in Cloud Computing Infrastructures

    Get PDF
    Software bugs in cloud management systems often cause erratic behavior, hindering detection, and recovery of failures. As a consequence, the failures are not timely detected and notified, and can silently propagate through the system. To face these issues, we propose a lightweight approach to runtime verification, for monitoring and failure detection of cloud computing systems. We performed a preliminary evaluation of the proposed approach in the OpenStack cloud management platform, an “off-the-shelf” distributed system, showing that the approach can be applied with high failure detection coverage
    • …
    corecore