1,610 research outputs found

    Infrastructure for Usable Machine Learning: The Stanford DAWN Project

    Full text link
    Despite incredible recent advances in machine learning, building machine learning applications remains prohibitively time-consuming and expensive for all but the best-trained, best-funded engineering organizations. This expense comes not from a need for new and improved statistical models but instead from a lack of systems and tools for supporting end-to-end machine learning application development, from data preparation and labeling to productionization and monitoring. In this document, we outline opportunities for infrastructure supporting usable, end-to-end machine learning applications in the context of the nascent DAWN (Data Analytics for What's Next) project at Stanford

    Study on Resource Efficiency of Distributed Graph Processing

    Full text link
    Graphs may be used to represent many different problem domains -- a concrete example is that of detecting communities in social networks, which are represented as graphs. With big data and more sophisticated applications becoming widespread in recent years, graph processing has seen an emergence of requirements pertaining data volume and volatility. This multidisciplinary study presents a review of relevant distributed graph processing systems. Herein they are presented in groups defined by common traits (distributed processing paradigm, type of graph operations, among others), with an overview of each system's strengths and weaknesses. The set of systems is then narrowed down to a set of two, upon which quantitative analysis was performed. For this quantitative comparison of systems, focus was cast on evaluating the performance of algorithms for the problem of detecting communities. To help further understand the evaluations performed, a background is provided on graph clustering

    Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management

    Full text link
    High-performance computing platforms such as supercomputers have traditionally been designed to meet the compute demands of scientific applications. Consequently, they have been architected as producers and not consumers of data. The Apache Hadoop ecosystem has evolved to meet the requirements of data processing applications and has addressed many of the limitations of HPC platforms. There exist a class of scientific applications however, that need the collective capabilities of traditional high-performance computing environments and the Apache Hadoop ecosystem. For example, the scientific domains of bio-molecular dynamics, genomics and network science need to couple traditional computing with Hadoop/Spark based analysis. We investigate the critical question of how to present the capabilities of both computing environments to such scientific applications. Whereas this questions needs answers at multiple levels, we focus on the design of resource management middleware that might support the needs of both. We propose extensions to the Pilot-Abstraction to provide a unifying resource management layer. This is an important step that allows applications to integrate HPC stages (e.g. simulations) to data analytics. Many supercomputing centers have started to officially support Hadoop environments, either in a dedicated environment or in hybrid deployments using tools such as myHadoop. This typically involves many intrinsic, environment-specific details that need to be mastered, and often swamp conceptual issues like: How best to couple HPC and Hadoop application stages? How to explore runtime trade-offs (data localities vs. data movement)? This paper provides both conceptual understanding and practical solutions to the integrated use of HPC and Hadoop environments

    Architectural Impact on Performance of In-memory Data Analytics: Apache Spark Case Study

    Full text link
    While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream data processing. However, recent studies on micro-architectural characterization of in-memory data analytics are limited to only batch processing workloads. We compare micro-architectural performance of batch processing and stream processing workloads in Apache Spark using hardware performance counters on a dual socket server. In our evaluation experiments, we have found that batch processing are stream processing workloads have similar micro-architectural characteristics and are bounded by the latency of frequent data access to DRAM. For data accesses we have found that simultaneous multi-threading is effective in hiding the data latencies. We have also observed that (i) data locality on NUMA nodes can improve the performance by 10% on average and(ii) disabling next-line L1-D prefetchers can reduce the execution time by up-to 14\% and (iii) multiple small executors can provide up-to 36\% speedup over single large executor

    Pilot-Abstraction: A Valid Abstraction for Data-Intensive Applications on HPC, Hadoop and Cloud Infrastructures?

    Full text link
    HPC environments have traditionally been designed to meet the compute demand of scientific applications and data has only been a second order concern. With science moving toward data-driven discoveries relying more on correlations in data to form scientific hypotheses, the limitations of HPC approaches become apparent: Architectural paradigms such as the separation of storage and compute are not optimal for I/O intensive workloads (e.g. for data preparation, transformation and SQL). While there are many powerful computational and analytical libraries available on HPC (e.g. for scalable linear algebra), they generally lack the usability and variety of analytical libraries found in other environments (e.g. the Apache Hadoop ecosystem). Further, there is a lack of abstractions that unify access to increasingly heterogeneous infrastructure (HPC, Hadoop, clouds) and allow reasoning about performance trade-offs in this complex environment. At the same time, the Hadoop ecosystem is evolving rapidly and has established itself as de-facto standard for data-intensive workloads in industry and is increasingly used to tackle scientific problems. In this paper, we explore paths to interoperability between Hadoop and HPC, examine the differences and challenges, such as the different architectural paradigms and abstractions, and investigate ways to address them. We propose the extension of the Pilot-Abstraction to Hadoop to serve as interoperability layer for allocating and managing resources across different infrastructures. Further, in-memory capabilities have been deployed to enhance the performance of large-scale data analytics (e.g. iterative algorithms) for which the ability to re-use data across iterations is critical. As memory naturally fits in with the Pilot concept of retaining resources for a set of tasks, we propose the extension of the Pilot-Abstraction to in-memory resources.Comment: Submitted to HPDC 2015, 12 pages, 9 figure

    A Survey on Geographically Distributed Big-Data Processing using MapReduce

    Full text link
    Hadoop and Spark are widely used distributed processing frameworks for large-scale data processing in an efficient and fault-tolerant manner on private or public clouds. These big-data processing systems are extensively used by many industries, e.g., Google, Facebook, and Amazon, for solving a large class of problems, e.g., search, clustering, log analysis, different types of join operations, matrix multiplication, pattern matching, and social network analysis. However, all these popular systems have a major drawback in terms of locally distributed computations, which prevent them in implementing geographically distributed data processing. The increasing amount of geographically distributed massive data is pushing industries and academia to rethink the current big-data processing systems. The novel frameworks, which will be beyond state-of-the-art architectures and technologies involved in the current system, are expected to process geographically distributed data at their locations without moving entire raw datasets to a single location. In this paper, we investigate and discuss challenges and requirements in designing geographically distributed data processing frameworks and protocols. We classify and study batch processing (MapReduce-based systems), stream processing (Spark-based systems), and SQL-style processing geo-distributed frameworks, models, and algorithms with their overhead issues.Comment: IEEE Transactions on Big Data; Accepted June 2017. 20 page

    Development details and computational benchmarking of DEPAM

    Full text link
    In the big data era of observational oceanography, passive acoustics datasets are becoming too high volume to be processed on local computers due to their processor and memory limitations. As a result there is a current need for our community to turn to cloud-based distributed computing. We present a scalable computing system for FFT (Fast Fourier Transform)-based features (e.g., Power Spectral Density) based on the Apache distributed frameworks Hadoop and Spark. These features are at the core of many different types of acoustic analysis where the need of processing data at scale with speed is evident, e.g. serving as long-term averaged learning representations of soundscapes to identify periods of acoustic interest. In addition to provide a complete description of our system implementation, we also performed a computational benchmark comparing our system to three other Scala-only, Matlab and Python based systems in standalone executions, and evaluated its scalability using the speed up metric. Our current results are very promising in terms of computational performance, as we show that our proposed Hadoop/Spark system performs reasonably well on a single node setup comparatively to state-of-the-art processing tools used by the PAM community, and that it could also fully leverage more intensive cluster resources with a almost-linear scalability behaviour above a certain dataset volume

    Snap ML: A Hierarchical Framework for Machine Learning

    Full text link
    We describe a new software framework for fast training of generalized linear models. The framework, named Snap Machine Learning (Snap ML), combines recent advances in machine learning systems and algorithms in a nested manner to reflect the hierarchical architecture of modern computing systems. We prove theoretically that such a hierarchical system can accelerate training in distributed environments where intra-node communication is cheaper than inter-node communication. Additionally, we provide a review of the implementation of Snap ML in terms of GPU acceleration, pipelining, communication patterns and software architecture, highlighting aspects that were critical for achieving high performance. We evaluate the performance of Snap ML in both single-node and multi-node environments, quantifying the benefit of the hierarchical scheme and the data streaming functionality, and comparing with other widely-used machine learning software frameworks. Finally, we present a logistic regression benchmark on the Criteo Terabyte Click Logs dataset and show that Snap ML achieves the same test loss an order of magnitude faster than any of the previously reported results, including those obtained using TensorFlow and scikit-learn.Comment: in Proceedings of the Thirty-Second Conference on Neural Information Processing Systems (NeurIPS 2018

    Asynchronous Complex Analytics in a Distributed Dataflow Architecture

    Full text link
    Scalable distributed dataflow systems have recently experienced widespread adoption, with commodity dataflow engines such as Hadoop and Spark, and even commodity SQL engines routinely supporting increasingly sophisticated analytics tasks (e.g., support vector machines, logistic regression, collaborative filtering). However, these systems' synchronous (often Bulk Synchronous Parallel) dataflow execution model is at odds with an increasingly important trend in the machine learning community: the use of asynchrony via shared, mutable state (i.e., data races) in convex programming tasks, which has---in a single-node context---delivered noteworthy empirical performance gains and inspired new research into asynchronous algorithms. In this work, we attempt to bridge this gap by evaluating the use of lightweight, asynchronous state transfer within a commodity dataflow engine. Specifically, we investigate the use of asynchronous sideways information passing (ASIP) that presents single-stage parallel iterators with a Volcano-like intra-operator iterator that can be used for asynchronous information passing. We port two synchronous convex programming algorithms, stochastic gradient descent and the alternating direction method of multipliers (ADMM), to use ASIPs. We evaluate an implementation of ASIPs within on Apache Spark that exhibits considerable speedups as well as a rich set of performance trade-offs in the use of these asynchronous algorithms

    Project Beehive: A Hardware/Software Co-designed Stack for Runtime and Architectural Research

    Full text link
    The end of Dennard scaling combined with stagnation in architectural and compiler optimizations makes it challenging to achieve significant performance deltas. Solutions based solely in hardware or software are no longer sufficient to maintain the pace of improvements seen during the past few decades. In hardware, the end of single-core scaling resulted in the proliferation of multi-core system architectures, however this has forced complex parallel programming techniques into the mainstream. To further exploit physical resources, systems are becoming increasingly heterogeneous with specialized computing elements and accelerators. Programming across a range of disparate architectures requires a new level of abstraction that programming languages will have to adapt to. In software, emerging complex applications, from domains such as Big Data and computer vision, run on multi-layered software stacks targeting hardware with a variety of constraints and resources. Hence, optimizing for the power-performance (and resiliency) space requires experimentation platforms that offer quick and easy prototyping of hardware/software co-designed techniques. To that end, we present Project Beehive: A Hardware/Software co-designed stack for runtime and architectural research. Project Beehive utilizes various state-of-the-art software and hardware components along with novel and extensible co-design techniques. The objective of Project Beehive is to provide a modern platform for experimentation on emerging applications, programming languages, compilers, runtimes, and low-power heterogeneous many-core architectures in a full-system co-designed manner.Comment: New version of this pape
    corecore