9,366 research outputs found

    Democratizing Production-Scale Distributed Deep Learning

    Full text link
    The interest and demand for training deep neural networks have been experiencing rapid growth, spanning a wide range of applications in both academia and industry. However, training them distributed and at scale remains difficult due to the complex ecosystem of tools and hardware involved. One consequence is that the responsibility of orchestrating these complex components is often left to one-off scripts and glue code customized for specific problems. To address these restrictions, we introduce \emph{Alchemist} - an internal service built at Apple from the ground up for \emph{easy}, \emph{fast}, and \emph{scalable} distributed training. We discuss its design, implementation, and examples of running different flavors of distributed training. We also present case studies of its internal adoption in the development of autonomous systems, where training times have been reduced by 10x to keep up with the ever-growing data collection

    Aneka: A Software Platform for .NET-based Cloud Computing

    Full text link
    Aneka is a platform for deploying Clouds developing applications on top of it. It provides a runtime environment and a set of APIs that allow developers to build .NET applications that leverage their computation on either public or private clouds. One of the key features of Aneka is the ability of supporting multiple programming models that are ways of expressing the execution logic of applications by using specific abstractions. This is accomplished by creating a customizable and extensible service oriented runtime environment represented by a collection of software containers connected together. By leveraging on these architecture advanced services including resource reservation, persistence, storage management, security, and performance monitoring have been implemented. On top of this infrastructure different programming models can be plugged to provide support for different scenarios as demonstrated by the engineering, life science, and industry applications.Comment: 30 pages, 10 figure

    On-Demand Virtual Research Environments using Microservices

    Full text link
    The computational demands for scientific applications are continuously increasing. The emergence of cloud computing has enabled on-demand resource allocation. However, relying solely on infrastructure as a service does not achieve the degree of flexibility required by the scientific community. Here we present a microservice-oriented methodology, where scientific applications run in a distributed orchestration platform as software containers, referred to as on-demand, virtual research environments. The methodology is vendor agnostic and we provide an open source implementation that supports the major cloud providers, offering scalable management of scientific pipelines. We demonstrate applicability and scalability of our methodology in life science applications, but the methodology is general and can be applied to other scientific domains

    MaRe: a MapReduce-Oriented Framework for Processing Big Data with Application Containers

    Full text link
    Background. Life science is increasingly driven by Big Data analytics, and the MapReduce programming model has been proven successful for data-intensive analyses. However, current MapReduce frameworks offer poor support for reusing existing processing tools in bioinformatics pipelines. Further, these frameworks do not have native support for application containers, which are becoming popular in scientific data processing. Results. Here we present MaRe, a programming model with an associated open-source implementation, which introduces support for application containers in MapReduce. MaRe is based on Apache Spark and Docker, the MapReduce framework and container engine that have collected the largest open source community, thus providing interoperability with the cutting-edge software ecosystem. We demonstrate MaRe on two data-intensive applications in life science, showing ease of use and scalability. Conclusions. MaRe enables scalable data-intensive processing in life science with MapReduce and application containers. When compared with current best practices, that involve the use of workflow systems, MaRe has the advantage of providing data locality, ingestion from heterogeneous storage systems and interactive processing. MaRe is generally-applicable and available as open source software

    NSML: Meet the MLaaS platform with a real-world case study

    Full text link
    The boom of deep learning induced many industries and academies to introduce machine learning based approaches into their concern, competitively. However, existing machine learning frameworks are limited to sufficiently fulfill the collaboration and management for both data and models. We proposed NSML, a machine learning as a service (MLaaS) platform, to meet these demands. NSML helps machine learning work be easily launched on a NSML cluster and provides a collaborative environment which can afford development at enterprise scale. Finally, NSML users can deploy their own commercial services with NSML cluster. In addition, NSML furnishes convenient visualization tools which assist the users in analyzing their work. To verify the usefulness and accessibility of NSML, we performed some experiments with common examples. Furthermore, we examined the collaborative advantages of NSML through three competitions with real-world use cases

    The ISTI Rapid Response on Exploring Cloud Computing 2018

    Full text link
    This report describes eighteen projects that explored how commercial cloud computing services can be utilized for scientific computation at national laboratories. These demonstrations ranged from deploying proprietary software in a cloud environment to leveraging established cloud-based analytics workflows for processing scientific datasets. By and large, the projects were successful and collectively they suggest that cloud computing can be a valuable computational resource for scientific computation at national laboratories

    Multiple Workflows Scheduling in Multi-tenant Distributed Systems: A Taxonomy and Future Directions

    Full text link
    The workflow is a general notion representing the automated processes along with the flow of data. The automation ensures the processes being executed in the order. Therefore, this feature attracts users from various background to build the workflow. However, the computational requirements are enormous and investing for a dedicated infrastructure for these workflows is not always feasible. To cater to the broader needs, multi-tenant platforms for executing workflows were began to be built. In this paper, we identify the problems and challenges in the multiple workflows scheduling that adhere to the platforms. We present a detailed taxonomy from the existing solutions on scheduling and resource provisioning aspects followed by the survey of relevant works in this area. We open up the problems and challenges to shove up the research on multiple workflows scheduling in multi-tenant distributed systems.Comment: Several changes has been done based on reviewers' comments after first round review. This is a pre-print for paper (currently under second round review) submitted to ACM Computing Survey

    ECHO: An Adaptive Orchestration Platform for Hybrid Dataflows across Cloud and Edge

    Full text link
    The Internet of Things (IoT) is offering unprecedented observational data that are used for managing Smart City utilities. Edge and Fog gateway devices are an integral part of IoT deployments to acquire real-time data and enact controls. Recently, Edge-computing is emerging as first-class paradigm to complement Cloud-centric analytics. But a key limitation is the lack of a platform-as-a-service for applications spanning Edge and Cloud. Here, we propose ECHO, an orchestration platform for dataflows across distributed resources. ECHO's hybrid dataflow composition can operate on diverse data models -- streams, micro-batches and files, and interface with native runtime engines like TensorFlow and Storm to execute them. It manages the application's lifecycle, including container-based deployment and a registry for state management. ECHO can schedule the dataflow on different Edge, Fog and Cloud resources, and also perform dynamic task migration between resources. We validate the ECHO platform for executing video analytics and sensor streams for Smart Traffic and Smart Utility applications on Raspberry Pi, NVidia TX1, ARM64 and Azure Cloud VM resources, and present our results.Comment: 17 pages, 5 figures, 2 tables, submitted to ICSOC-201

    On Energy Efficiency and Performance Evaluation of SBC based Clusters: A Hadoop case study

    Full text link
    Energy efficiency in a data center is a challenge and has garnered researchers interest. In this paper we address the energy efficiency issue of a small scale data center by utilizing Single Board Computer (SBC) based clusters. A compact design layout is presented to build two clusters using 20 nodes each. Extensive testing was carried out to analyze the performance of these clusters using popular performance benchmarks for task execution time, memory/storage utilization, network throughput and energy consumption. Further, we investigate the cost of operating SBC based clusters by correlating energy utilization for the execution time of various benchmarks using workloads of different sizes. Results show that, although the low-cost benefit of a cluster built with ARM-based SBCs is desirable, these clusters yield low comparable performance and energy efficiency due to limited onboard capabilities. It is possible to tweak Hadoop configuration parameters for an ARM-based SBC cluster to efficiently utilize resources. We present, a discussion on the effectiveness of the SBC-based clusters as a testbed for inexpensive and green cloud computing research.Comment: 12 pages. Submitted to Electronics Journa

    Hoard: A Distributed Data Caching System to Accelerate Deep Learning Training on the Cloud

    Full text link
    Deep Learning system architects strive to design a balanced system where the computational accelerator -- FPGA, GPU, etc, is not starved for data. Feeding training data fast enough to effectively keep the accelerator utilization high is difficult when utilizing dedicated hardware like GPUs. As accelerators are getting faster, the storage media \& data buses feeding the data have not kept pace and the ever increasing size of training data further compounds the problem. We describe the design and implementation of a distributed caching system called Hoard that stripes the data across fast local disks of multiple GPU nodes using a distributed file system that efficiently feeds the data to ensure minimal degradation in GPU utilization due to I/O starvation. Hoard can cache the data from a central storage system before the start of the job or during the initial execution of the job and feeds the cached data for subsequent epochs of the same job and for different invocations of the jobs that share the same data requirements, e.g. hyper-parameter tuning. Hoard exposes a POSIX file system interface so the existing deep learning frameworks can take advantage of the cache without any modifications. We show that Hoard, using two NVMe disks per node and a distributed file system for caching, achieves a 2.1x speed-up over a 10Gb/s NFS central storage system on a 16 GPU (4 nodes, 4 GPUs per node) cluster for a challenging AlexNet ImageNet image classification benchmark with 150GB of input dataset. As a result of the caching, Hoard eliminates the I/O bottlenecks introduced by the shared storage and increases the utilization of the system by 2x compared to using the shared storage without the cache.Comment: 12 pages, 5 figure