65,546 research outputs found
Session-Based Programming for Parallel Algorithms: Expressiveness and Performance
This paper investigates session programming and typing of benchmark examples
to compare productivity, safety and performance with other communications
programming languages. Parallel algorithms are used to examine the above
aspects due to their extensive use of message passing for interaction, and
their increasing prominence in algorithmic research with the rising
availability of hardware resources such as multicore machines and clusters. We
contribute new benchmark results for SJ, an extension of Java for type-safe,
binary session programming, against MPJ Express, a Java messaging system based
on the MPI standard. In conclusion, we observe that (1) despite rich libraries
and functionality, MPI remains a low-level API, and can suffer from commonly
perceived disadvantages of explicit message passing such as deadlocks and
unexpected message types, and (2) the benefits of high-level session
abstraction, which has significant impact on program structure to improve
readability and reliability, and session type-safety can greatly facilitate the
task of communications programming whilst retaining competitive performance
TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments
Deep neural networks (DNNs) have become core computation components within
low latency Function as a Service (FaaS) prediction pipelines: including image
recognition, object detection, natural language processing, speech synthesis,
and personalized recommendation pipelines. Cloud computing, as the de-facto
backbone of modern computing infrastructure for both enterprise and consumer
applications, has to be able to handle user-defined pipelines of diverse DNN
inference workloads while maintaining isolation and latency guarantees, and
minimizing resource waste. The current solution for guaranteeing isolation
within FaaS is suboptimal -- suffering from "cold start" latency. A major cause
of such inefficiency is the need to move large amount of model data within and
across servers. We propose TrIMS as a novel solution to address these issues.
Our proposed solution consists of a persistent model store across the GPU, CPU,
local storage, and cloud storage hierarchy, an efficient resource management
layer that provides isolation, and a succinct set of application APIs and
container technologies for easy and transparent integration with FaaS, Deep
Learning (DL) frameworks, and user code. We demonstrate our solution by
interfacing TrIMS with the Apache MXNet framework and demonstrate up to 24x
speedup in latency for image classification models and up to 210x speedup for
large models. We achieve up to 8x system throughput improvement.Comment: In Proceedings CLOUD 201
Economic and Organizational Issues in Alaska Water Quality Management
The work upon which this report (Proj. A-029-ALAS) is based was supported by funds provided
by the United States Department of the Interior, Office of Water Resources Research, as
authorized under the Water Resources Act of 1964
Conducting a Scan of Your College Access and Success System
Explains how to design and implement an assessment of local systems' ability to improve college attainment, including needs, assets, and challenges; and how to leverage findings for stakeholder engagement, benchmarking, and strategy development
A low-power, high-performance speech recognition accelerator
© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Automatic Speech Recognition (ASR) is becoming increasingly ubiquitous, especially in the mobile segment. Fast and accurate ASR comes at high energy cost, not being affordable for the tiny power-budgeted mobile devices. Hardware acceleration reduces energy-consumption of ASR systems, while delivering high-performance. In this paper, we present an accelerator for largevocabulary, speaker-independent, continuous speech-recognition. It focuses on the Viterbi search algorithm representing the main bottleneck in an ASR system. The proposed design consists of innovative techniques to improve the memory subsystem, since memory is the main bottleneck for performance and power in these accelerators' design. It includes a prefetching scheme tailored to the needs of ASR systems that hides main memory latency for a large fraction of the memory accesses, negligibly impacting area. Additionally, we introduce a novel bandwidth-saving technique that removes off-chip memory accesses by 20 percent. Finally, we present a power saving technique that significantly reduces the leakage power of the accelerators scratchpad memories, providing between 8.5 and 29.2 percent reduction in entire power dissipation. Overall, the proposed design outperforms implementations running on the CPU by orders of magnitude, and achieves speedups between 1.7x and 5.9x for different speech decoders over a highly optimized CUDA implementation running on Geforce-GTX-980 GPU, while reducing the energy by 123-454x.Peer ReviewedPostprint (author's final draft
- …