Search CORE

1,118 research outputs found

FfDL : A Flexible Multi-tenant Deep Learning Platform

Author: Egwutuoha Ifeanyi P.
Google Inc.
Hermann Jeremy
Jia Yangqing
Kraska Tim
Pan Xinghao
Park Jun Woo
Tantawi Asser N.
Venkataraman Shivaram
Wang Chao
Xiao Wencong
Zaharia Matei
Zhang Haoyu
Zhang K.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 14/09/2019
Field of study

Deep learning (DL) is becoming increasingly popular in several application domains and has made several new application features involving computer vision, speech recognition and synthesis, self-driving automobiles, drug design, etc. feasible and accurate. As a result, large scale on-premise and cloud-hosted deep learning platforms have become essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale. This paper describes the design, implementation and our experiences with FfDL, a DL platform used at IBM. We describe how our design balances dependability with scalability, elasticity, flexibility and efficiency. We examine FfDL qualitatively through a retrospective look at the lessons learned from building, operating, and supporting FfDL; and quantitatively through a detailed empirical evaluation of FfDL, including the overheads introduced by the platform for various deep learning models, the load and performance observed in a real case study using FfDL within our organization, the frequency of various faults observed including unanticipated faults, and experiments demonstrating the benefits of various scheduling policies. FfDL has been open-sourced.Comment: MIDDLEWARE 201

arXiv.org e-Print Archive

Crossref

5G Multi-access Edge Computing: Security, Dependability, and Performance

Author: Garroppo Rosario G.
Nencioni Gianfranco
Olimid Ruxandra F.
Publication venue
Publication date: 28/07/2021
Field of study

The main innovation of the Fifth Generation (5G) of mobile networks is the ability to provide novel services with new and stricter requirements. One of the technologies that enable the new 5G services is the Multi-access Edge Computing (MEC). MEC is a system composed of multiple devices with computing and storage capabilities that are deployed at the edge of the network, i.e., close to the end users. MEC reduces latency and enables contextual information and real-time awareness of the local environment. MEC also allows cloud offloading and the reduction of traffic congestion. Performance is not the only requirement that the new 5G services have. New mission-critical applications also require high security and dependability. These three aspects (security, dependability, and performance) are rarely addressed together. This survey fills this gap and presents 5G MEC by addressing all these three aspects. First, we overview the background knowledge on MEC by referring to the current standardization efforts. Second, we individually present each aspect by introducing the related taxonomy (important for the not expert on the aspect), the state of the art, and the challenges on 5G MEC. Finally, we discuss the challenges of jointly addressing the three aspects.Comment: 33 pages, 11 figures, 15 tables. This paper is under review at IEEE Communications Surveys & Tutorials. Copyright IEEE 202

arXiv.org e-Print Archive

Resource Allocation in Networking and Computing Systems: A Security and Dependability Perspective

Author: Khan Md Muhidul Islam
Nencioni Gianfranco
Publication venue: IEEE
Publication date: 01/01/2023
Field of study

In recent years, there has been a trend to integrate networking and computing systems, whose management is getting increasingly complex. Resource allocation is one of the crucial aspects of managing such systems and is affected by this increased complexity. Resource allocation strategies aim to effectively maximize performance, system utilization, and profit by considering virtualization technologies, heterogeneous resources, context awareness, and other features. In such complex scenario, security and dependability are vital concerns that need to be considered in future computing and networking systems in order to provide the future advanced services, such as mission-critical applications. This paper provides a comprehensive survey of existing literature that considers security and dependability for resource allocation in computing and networking systems. The current research works are categorized by considering the allocated type of resources for different technologies, scenarios, issues, attributes, and solutions. The paper presents the research works on resource allocation that includes security and dependability, both singularly and jointly. The future research directions on resource allocation are also discussed. The paper shows how there are only a few works that, even singularly, consider security and dependability in resource allocation in the future computing and networking systems and highlights the importance of jointly considering security and dependability and the need for intelligent, adaptive and robust solutions. This paper aims to help the researchers effectively consider security and dependability in future networking and computing systems.publishedVersio

UiS Brage

MLOps: A Review

Author: Kashyap Gautam Siddharth
Saxena Parag
Wazir Samar
Publication venue
Publication date: 19/08/2023
Field of study

Recently, Machine Learning (ML) has become a widely accepted method for significant progress that is rapidly evolving. Since it employs computational methods to teach machines and produce acceptable answers. The significance of the Machine Learning Operations (MLOps) methods, which can provide acceptable answers for such problems, is examined in this study. To assist in the creation of software that is simple to use, the authors research MLOps methods. To choose the best tool structure for certain projects, the authors also assess the features and operability of various MLOps methods. A total of 22 papers were assessed that attempted to apply the MLOps idea. Finally, the authors admit the scarcity of fully effective MLOps methods based on which advancements can self-regulate by limiting human engagement

arXiv.org e-Print Archive

Artificial Intelligence as a Service – Classification and Research Directions

Author: Bayer Calvin
Lins Sebastian
Pandl Konstantin D.
Sunyaev Ali
Teigeler Heiner
Thiebes Scott
Publication venue: Springer
Publication date: 14/07/2021
Field of study

KITopen

Towards Runtime Verification via Event Stream Processing in Cloud Computing Infrastructures

Author: Cotroneo D.
De Simone L.
Liguori P.
Natella R.
Scibelli A.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

Software bugs in cloud management systems often cause erratic behavior, hindering detection, and recovery of failures. As a consequence, the failures are not timely detected and notified, and can silently propagate through the system. To face these issues, we propose a lightweight approach to runtime verification, for monitoring and failure detection of cloud computing systems. We performed a preliminary evaluation of the proposed approach in the OpenStack cloud management platform, an “off-the-shelf” distributed system, showing that the approach can be applied with high failure detection coverage

Archivio della ricerca - Università degli studi di Napoli Federico II