1,528 research outputs found

    GraphR: Accelerating Graph Processing Using ReRAM

    Full text link
    This paper presents GRAPHR, the first ReRAM-based graph processing accelerator. GRAPHR follows the principle of near-data processing and explores the opportunity of performing massive parallel analog operations with low hardware and energy cost. The analog computation is suit- able for graph processing because: 1) The algorithms are iterative and could inherently tolerate the imprecision; 2) Both probability calculation (e.g., PageRank and Collaborative Filtering) and typical graph algorithms involving integers (e.g., BFS/SSSP) are resilient to errors. The key insight of GRAPHR is that if a vertex program of a graph algorithm can be expressed in sparse matrix vector multiplication (SpMV), it can be efficiently performed by ReRAM crossbar. We show that this assumption is generally true for a large set of graph algorithms. GRAPHR is a novel accelerator architecture consisting of two components: memory ReRAM and graph engine (GE). The core graph computations are performed in sparse matrix format in GEs (ReRAM crossbars). The vector/matrix-based graph computation is not new, but ReRAM offers the unique opportunity to realize the massive parallelism with unprecedented energy efficiency and low hardware cost. With small subgraphs processed by GEs, the gain of performing parallel operations overshadows the wastes due to sparsity. The experiment results show that GRAPHR achieves a 16.01x (up to 132.67x) speedup and a 33.82x energy saving on geometric mean compared to a CPU baseline system. Com- pared to GPU, GRAPHR achieves 1.69x to 2.19x speedup and consumes 4.77x to 8.91x less energy. GRAPHR gains a speedup of 1.16x to 4.12x, and is 3.67x to 10.96x more energy efficiency compared to PIM-based architecture.Comment: Accepted to HPCA 201

    A 64mW DNN-based Visual Navigation Engine for Autonomous Nano-Drones

    Full text link
    Fully-autonomous miniaturized robots (e.g., drones), with artificial intelligence (AI) based visual navigation capabilities are extremely challenging drivers of Internet-of-Things edge intelligence capabilities. Visual navigation based on AI approaches, such as deep neural networks (DNNs) are becoming pervasive for standard-size drones, but are considered out of reach for nanodrones with size of a few cm2{}^\mathrm{2}. In this work, we present the first (to the best of our knowledge) demonstration of a navigation engine for autonomous nano-drones capable of closed-loop end-to-end DNN-based visual navigation. To achieve this goal we developed a complete methodology for parallel execution of complex DNNs directly on-bard of resource-constrained milliwatt-scale nodes. Our system is based on GAP8, a novel parallel ultra-low-power computing platform, and a 27 g commercial, open-source CrazyFlie 2.0 nano-quadrotor. As part of our general methodology we discuss the software mapping techniques that enable the state-of-the-art deep convolutional neural network presented in [1] to be fully executed on-board within a strict 6 fps real-time constraint with no compromise in terms of flight results, while all processing is done with only 64 mW on average. Our navigation engine is flexible and can be used to span a wide performance range: at its peak performance corner it achieves 18 fps while still consuming on average just 3.5% of the power envelope of the deployed nano-aircraft.Comment: 15 pages, 13 figures, 5 tables, 2 listings, accepted for publication in the IEEE Internet of Things Journal (IEEE IOTJ

    WaveScript: A Case-Study in Applying a Distributed Stream-Processing Language

    Get PDF
    Applications that combine live data streams with embedded, parallel,and distributed processing are becoming more commonplace. WaveScriptis a domain-specific language that brings high-level, type-safe,garbage-collected programming to these domains. This is made possibleby three primary implementation techniques. First, we employ a novelevaluation strategy that uses a combination of interpretation andreification to partially evaluate programs into stream dataflowgraphs. Second, we use profile-driven compilation to enable manyoptimizations that are normally only available in the synchronous(rather than asynchronous) dataflow domain. Finally, we incorporatean extensible system for rewrite rules to capture algebraic propertiesin specific domains (such as signal processing).We have used our language to build and deploy a sensor-network for theacoustic localization of wild animals, in particular, theYellow-Bellied marmot. We evaluate WaveScript's performance on thisapplication, showing that it yields good performance on both embeddedand desktop-class machines, including distributed execution andsubstantial parallel speedups. Our language allowed us to implementthe application rapidly, while outperforming a previous Cimplementation by over 35%, using fewer than half the lines of code.We evaluate the contribution of our optimizations to this success

    Framing Apache Spark in life sciences

    Get PDF
    Advances in high-throughput and digital technologies have required the adoption of big data for handling complex tasks in life sciences. However, the drift to big data led researchers to face technical and infrastructural challenges for storing, sharing, and analysing them. In fact, this kind of tasks requires distributed computing systems and algorithms able to ensure efficient processing. Cutting edge distributed programming frameworks allow to implement flexible algorithms able to adapt the computation to the data over on-premise HPC clusters or cloud architectures. In this context, Apache Spark is a very powerful HPC engine for large-scale data processing on clusters. Also thanks to specialised libraries for working with structured and relational data, it allows to support machine learning, graph-based computation, and stream processing. This review article is aimed at helping life sciences researchers to ascertain the features of Apache Spark and to assess whether it can be successfully used in their research activities

    Scalable and fault-tolerant data stream processing on multi-core architectures

    Get PDF
    With increasing data volumes and velocity, many applications are shifting from the classical “process-after-store” paradigm to a stream processing model: data is produced and consumed as continuous streams. Stream processing captures latency-sensitive applications as diverse as credit card fraud detection and high-frequency trading. These applications are expressed as queries of algebraic operations (e.g., aggregation) over the most recent data using windows, i.e., finite evolving views over the input streams. To guarantee correct results, streaming applications require precise window semantics (e.g., temporal ordering) for operations that maintain state. While high processing throughput and low latency are performance desiderata for stateful streaming applications, achieving both poses challenges. Computing the state of overlapping windows causes redundant aggregation operations: incremental execution (i.e., reusing previous results) reduces latency but prevents parallelization; at the same time, parallelizing window execution for stateful operations with precise semantics demands ordering guarantees and state access coordination. Finally, streams and state must be recovered to produce consistent and repeatable results in the event of failures. Given the rise of shared-memory multi-core CPU architectures and high-speed networking, we argue that it is possible to address these challenges in a single node without compromising window semantics, performance, or fault-tolerance. In this thesis, we analyze, design, and implement stream processing engines (SPEs) that achieve high performance on multi-core architectures. To this end, we introduce new approaches for in-memory processing that address the previous challenges: (i) for overlapping windows, we provide a family of window aggregation techniques that enable computation sharing based on the algebraic properties of aggregation functions; (ii) for parallel window execution, we balance parallelism and incremental execution by developing abstractions for both and combining them to a novel design; and (iii) for reliable single-node execution, we enable strong fault-tolerance guarantees without sacrificing performance by reducing the required disk I/O bandwidth using a novel persistence model. We combine the above to implement an SPE that processes hundreds of millions of tuples per second with sub-second latencies. These results reveal the opportunity to reduce resource and maintenance footprint by replacing cluster-based SPEs with single-node deployments.Open Acces

    Gunrock: GPU Graph Analytics

    Full text link
    For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs, have presented two significant challenges to developing a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We characterize the performance of various optimization strategies and evaluate Gunrock's overall performance on different GPU architectures on a wide range of graph primitives that span from traversal-based algorithms and ranking algorithms, to triangle counting and bipartite-graph-based algorithms. The results show that on a single GPU, Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives and CPU shared-memory graph libraries such as Ligra and Galois, and better performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing (TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance Graph Processing Library on the GPU

    Designing Automated Deployment Strategies of Face Recognition Solutions in Heterogeneous IoT Platforms

    Get PDF
    In this paper, we tackle the problem of deploying face recognition (FR) solutions in heterogeneous Internet of Things (IoT) platforms. The main challenges are the optimal deployment of deep neural networks (DNNs) in the high variety of IoT devices (e.g., robots, tablets, smartphones, etc.), the secure management of biometric data while respecting the users’ privacy, and the design of appropriate user interaction with facial verification mechanisms for all kinds of users. We analyze different approaches to solving all these challenges and propose a knowledge-driven methodology for the automated deployment of DNN-based FR solutions in IoT devices, with the secure management of biometric data, and real-time feedback for improved interaction. We provide some practical examples and experimental results with state-of-the-art DNNs for FR in Intel’s and NVIDIA’s hardware platforms as IoT devices.This work was supported by the SHAPES project, which has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement no. 857159, and in part by the Spanish Centre for the Development of Industrial Technology (CDTI) through the Project ÉGIDA—RED DE EXCELENCIA EN TECNOLOGIAS DE SEGURIDAD Y PRIVACIDAD under Grant CER20191012

    StreamCloud: An elastic and scalable data streaming system

    Get PDF
    Many applications in several domains such as telecommunications, network security, large scale sensor networks, require online processing of continuous data lows. They produce very high loads that requires aggregating the processing capacity of many nodes. Current Stream Processing Engines do not scale with the input load due to single-node bottlenecks. Additionally, they are based on static con?gurations that lead to either under or over-provisioning. In this paper, we present StreamCloud, a scalable and elastic stream processing engine for processing large data stream volumes. StreamCloud uses a novel parallelization technique that splits queries into subqueries that are allocated to independent sets of nodes in a way that minimizes the distribution overhead. Its elastic protocols exhibit low intrusiveness, enabling effective adjustment of resources to the incoming load. Elasticity is combined with dynamic load balancing to minimize the computational resources used. The paper presents the system design, implementation and a thorough evaluation of the scalability and elasticity of the fully implemented system
    • …
    corecore