1,118 research outputs found
FfDL : A Flexible Multi-tenant Deep Learning Platform
Deep learning (DL) is becoming increasingly popular in several application
domains and has made several new application features involving computer
vision, speech recognition and synthesis, self-driving automobiles, drug
design, etc. feasible and accurate. As a result, large scale on-premise and
cloud-hosted deep learning platforms have become essential infrastructure in
many organizations. These systems accept, schedule, manage and execute DL
training jobs at scale.
This paper describes the design, implementation and our experiences with
FfDL, a DL platform used at IBM. We describe how our design balances
dependability with scalability, elasticity, flexibility and efficiency. We
examine FfDL qualitatively through a retrospective look at the lessons learned
from building, operating, and supporting FfDL; and quantitatively through a
detailed empirical evaluation of FfDL, including the overheads introduced by
the platform for various deep learning models, the load and performance
observed in a real case study using FfDL within our organization, the frequency
of various faults observed including unanticipated faults, and experiments
demonstrating the benefits of various scheduling policies. FfDL has been
open-sourced.Comment: MIDDLEWARE 201
5G Multi-access Edge Computing: Security, Dependability, and Performance
The main innovation of the Fifth Generation (5G) of mobile networks is the
ability to provide novel services with new and stricter requirements. One of
the technologies that enable the new 5G services is the Multi-access Edge
Computing (MEC). MEC is a system composed of multiple devices with computing
and storage capabilities that are deployed at the edge of the network, i.e.,
close to the end users. MEC reduces latency and enables contextual information
and real-time awareness of the local environment. MEC also allows cloud
offloading and the reduction of traffic congestion. Performance is not the only
requirement that the new 5G services have. New mission-critical applications
also require high security and dependability. These three aspects (security,
dependability, and performance) are rarely addressed together. This survey
fills this gap and presents 5G MEC by addressing all these three aspects.
First, we overview the background knowledge on MEC by referring to the current
standardization efforts. Second, we individually present each aspect by
introducing the related taxonomy (important for the not expert on the aspect),
the state of the art, and the challenges on 5G MEC. Finally, we discuss the
challenges of jointly addressing the three aspects.Comment: 33 pages, 11 figures, 15 tables. This paper is under review at IEEE
Communications Surveys & Tutorials. Copyright IEEE 202
Resource Allocation in Networking and Computing Systems: A Security and Dependability Perspective
In recent years, there has been a trend to integrate networking and computing systems, whose management is getting increasingly complex. Resource allocation is one of the crucial aspects of managing such systems and is affected by this increased complexity. Resource allocation strategies aim to effectively maximize performance, system utilization, and profit by considering virtualization technologies, heterogeneous resources, context awareness, and other features. In such complex scenario, security and dependability are vital concerns that need to be considered in future computing and networking systems in order to provide the future advanced services, such as mission-critical applications. This paper provides a comprehensive survey of existing literature that considers security and dependability for resource allocation in computing and networking systems. The current research works are categorized by considering the allocated type of resources for different technologies, scenarios, issues, attributes, and solutions. The paper presents the research works on resource allocation that includes security and dependability, both singularly and jointly. The future research directions on resource allocation are also discussed. The paper shows how there are only a few works that, even singularly, consider security and dependability in resource allocation in the future computing and networking systems and highlights the importance of jointly considering security and dependability and the need for intelligent, adaptive and robust solutions. This paper aims to help the researchers effectively consider security and dependability in future networking and computing systems.publishedVersio
MLOps: A Review
Recently, Machine Learning (ML) has become a widely accepted method for
significant progress that is rapidly evolving. Since it employs computational
methods to teach machines and produce acceptable answers. The significance of
the Machine Learning Operations (MLOps) methods, which can provide acceptable
answers for such problems, is examined in this study. To assist in the creation
of software that is simple to use, the authors research MLOps methods. To
choose the best tool structure for certain projects, the authors also assess
the features and operability of various MLOps methods. A total of 22 papers
were assessed that attempted to apply the MLOps idea. Finally, the authors
admit the scarcity of fully effective MLOps methods based on which advancements
can self-regulate by limiting human engagement
Towards Runtime Verification via Event Stream Processing in Cloud Computing Infrastructures
Software bugs in cloud management systems often cause erratic behavior, hindering detection, and recovery of failures. As a consequence, the failures are not timely detected and notified, and can silently propagate through the system. To face these issues, we propose a lightweight approach to runtime verification, for monitoring and failure detection of cloud computing systems. We performed a preliminary evaluation of the proposed approach in the OpenStack cloud management platform, an “off-the-shelf” distributed system, showing that the approach can be applied with high failure detection coverage
- …