15 research outputs found

    Big data deployment in containerized infrastructures through the interconnection of network namespaces

    Get PDF
    Big Data applications tackle the challenge of fast handling of large streams of data. Their performance is not only dependent on the data frameworks implementation and the underlying hardware but also on the deployment scheme and its potential for fast scaling. Consequently, several efforts have focused on the ease of deployment of Big Data applications, notably through the use of containerization. This technology was indeed raised to bring multitenancy and multiprocessing out of clusters, providing high deployment flexibility through lightweight container images. Recent studies have focused mostly on Docker containers. Notwithstanding, this article is actually interested in recent Singularity containers as they provide more security and support high-performance computing (HPC) environments and, in this way, they can make Big Data applications benefit from the specialized hardware of HPC. Singularity 2.x, however, does not isolate network resources as required by most Big Data components. Singularity 3.x allows allocating each container with isolated network resources, but their interconnection requires a nontrivial amount of configuration effort. In this context, this article makes a functional contribution in the form of a deployment scheme based on the interconnection of network namespaces, through underlay and overlay networking approaches, to make Big Data applications easily deployable inside Singularity containers. We provide detailed account of our deployment scheme when using both interconnection approaches in the form of a “how-to-do-it” report, and we evaluate it by comparing three Big Data applications based on Hadoop when performing on a bare-metal infrastructure and on scenarios involving Singularity and Docker instances.Peer ReviewedPostprint (author's final draft

    Scanflow-K8s: agent-based framework for autonomic management and supervision of ML workflows in Kubernetes clusters

    Get PDF
    Machine Learning (ML) projects are currently heavily based on workflows composed of some reproducible steps and executed as containerized pipelines to build or deploy ML models efficiently because of the flexibility, portability, and fast delivery they provide to the ML life-cycle. However, deployed models need to be watched and constantly managed, supervised, and debugged to guarantee their availability, validity, and robustness in unexpected situations. Therefore, containerized ML workflows would benefit from leveraging flexible and diverse autonomic capabilities. This work presents an architecture for autonomic ML workflows with abilities for multi-layered control, based on an agent-based approach that enables autonomic management and supervision of ML workflows at the application layer and the infrastructure layer (by collaborating with the orchestrator). We redesign the Scanflow ML framework to support such multi-agent approach by using triggers, primitives, and strategies. We also implement a practical platform, so-called Scanflow-K8s, that enables autonomic ML workflows on Kubernetes clusters based on the Scanflow agents. MNIST image classification and MLPerf ImageNet classification benchmarks are used as case studies to show the capabilities of Scanflow-K8s under different scenarios. The experimental results demonstrate the feasibility and effectiveness of our proposed agent approach and the Scanflow-K8s platform for the autonomic management of ML workflows in Kubernetes clusters at multiple layers.This work was supported by Lenovo as part of Lenovo-BSC 2020 collaboration agreement, by the Spanish Government under contract PID2019-107255GB-C22, and by the Generalitat de Catalunya under contract 2017-SGR-1414 and under grant 2020 FI-B 00257.Peer ReviewedPostprint (author's final draft

    Human-in-the-loop online multi-agent approach to increase trustworthiness in ML models through trust scores and data augmentation

    Get PDF
    Increasing a ML model accuracy is not enough, we must also increase its trustworthiness. This is an important step for building resilient AI systems for safety-critical applications such as automotive, finance, and healthcare. For that purpose, we propose a multi-agent system that combines both machine and human agents. In this system, a checker agent calculates a trust score of each instance (which penalizes overconfidence in predictions) using an agreement-based method and ranks it; then an improver agent filters the anomalous instances based on a human rule-based procedure (which is considered safe), gets the human labels, applies geometric data augmentation, and retrains with the augmented data using transfer learning. We evaluate the system on corrupted versions of the MNIST and FashionMNIST datasets. We get an improvement in accuracy and trust score with just few additional labels compared to a baseline approach.This work was supported by Lenovo as part of LenovoBSC 2020 collaboration agreement, by the Spanish Government under contracts PID2019-107255GB-C21 and PID2019-107255GB-C22, and by the Generalitat de Catalunya under contract 2017-SGR-1414 and under grant 2020 FI-B 00257.Peer ReviewedPostprint (author's final draft

    Scanflow: an end-to-end agent-based autonomic ML workflow manager for clusters

    Get PDF
    Machine Learning (ML) is more than just training models, the whole life-cycle must be considered. Once deployed, a ML model needs to be constantly managed, supervised and debugged to guarantee its availability, validity and robustness in dynamic contexts. This demonstration presents an agent-based ML workflow manager so-called Scanflow1, which enables autonomic management and supervision of the end-to-end life-cycle of ML workflows on distributed clusters. The case study on a MNIST project2 shows that different teams can collaborate using Scanflow within a ML project at different phases, and the effectiveness of agents to maintain the model accuracy and throughput of the model serving while running in production.This work was partially supported by Lenovo as part of LenovoBSC 2020 collaboration agreement, by the Spanish Government under contract PID2019-107255GB-C22, and by the Generalitat de Catalunya under contract 2017-SGR-1414 and under grant 2020 FI-B 00257.Postprint (published version

    Analysis of a new intra-disk redundancy scheme for highreliability RAID storage systems in the presence of unrecoverable errors

    No full text
    Today’s data storage systems are increasingly adopting lowcost disk drives that have higher capacity but lower reliability, leading to more frequent rebuilds and to a higher risk of unrecoverable media errors. We propose a new XORbased intra-disk redundancy scheme, called interleaved parity check (IPC), to enhance the reliability of RAID systems that incurs only negligible I/O performance degradation. The proposed scheme introduces an additional level of redundancy inside each disk, on top of the RAID redundancy across multiple disks. The RAID parity provides protection against disk failures, while the proposed scheme aims to protect against media-related unrecoverable errors. We develop a new model capturing the effect of correlated unrecoverable sector errors and subsequently use it to analyze the proposed scheme as well as the traditional redundancy schemes based on Reed-Solomon (RS) codes and single-parity-check (SPC) codes. We derive closed-form expressions for the mean time to data loss (MTTDL) of RAID 5 and RAID 6 systems in the presence of unrecoverable errors and disk failures. We then combine these results for a comprehensive characterization of the reliability of RAID systems that incorporate the proposed IPC redundancy scheme. Our results show that in the practical case of correlated errors, the proposed scheme provides the same reliability as the optimum albeit more complex RS coding scheme. Finally, the throughput performance of incorporating the intra-disk redundancy on various RAID systems is evaluated by means of event-driven simulations. A detailed description of these contributions is given in [1]

    Advanced Computer-Based Education on the World Wide Web

    No full text
    this paper is to show that CBE is not a new topic that appeared with the WWW, but that the WWW offers a new set of software tools that can extend the usefulness of existing CBE tools. Starting with the popular acceptance of the WWW there has been a flood of "educational" material appearing on the WWW. There have also been people proclaiming the WWW as a remarkable educational resource. But how one defines "educational" can be a tricky and touchy subject. Some of the material is educational in the way some magazines are educational. That is, this material contains information of use in an easy to digest form. Some of the material is educational in the manner that a reference book is educational where one can learn about specific topics by looking up the subject and reading about it. Some of the material is educational in the same way that a written tutorial is educational. If one starts at the beginning and works out the given example problems, then one should learn the subject. These divisions of levels continue up to the level of being CBE in the classical sense. CBE, "in the classical sense," can be thought of as attempting to include one-to-one interaction between the student and the computer in much the same way the student would interact with an individual instructor. This interaction would take the form of queries and responses flowing in both directions. The student can ask for help, look up definitions, etc., by "asking" the computer. In turn, the computer can try to find the student's level of understanding by asking questions and proposing problems to the student. At the highest level the computer tracks all of the student's actions in order to try to extract the student's weaknesses and provide appropriate help. While the computer cannot replace the human ins..

    Scanflow: A multi-graph framework for machine learning workflow management, supervision, and debugging

    No full text
    Machine Learning (ML) is more than just training models, the whole workflow must be considered. Once deployed, a ML model needs to be watched and constantly supervised and debugged to guarantee its validity and robustness in unexpected situations. Debugging in ML aims to identify (and address) the model weaknesses in not trivial contexts. Several techniques have been proposed to identify different types of model weaknesses, such as bias in classification, model decay, adversarial attacks, etc., yet there is not a generic framework that allows them to work in a collaborative, modular, portable, iterative way and, more importantly, flexible enough to allow both human- and machine-driven techniques. In this paper, we propose a novel containerized directed graph framework to support and accelerate end-to-end ML workflow management, supervision, and debugging. The framework allows defining and deploying ML workflows in containers, tracking their metadata, checking their behavior in production, and improving the models by using both learned and human-provided knowledge. We demonstrate these capabilities by integrating in the framework two hybrid systems to detect data drift distribution which identify the samples that are far from the latent space of the original distribution, ask for human intervention, and whether retrain the model or wrap it with a filter to remove the noise of corrupted data at inference time. We test these systems on MNIST-C, CIFAR-10-C, and FashionMNIST-C datasets, obtaining promising accuracy results with the help of human involvement.This work was partially supported by Lenovo as part of Framework Contract Lenovo-BSC 2020, by the Spanish Government under contract PID2019-107255 GB, and by the Generalitat de Catalunya, Spain under contract 2017-SGR-1414 and under grant 2020 FI-B 00257.Peer ReviewedPostprint (author's final draft
    corecore