9,239 research outputs found
Project Beehive: A Hardware/Software Co-designed Stack for Runtime and Architectural Research
The end of Dennard scaling combined with stagnation in architectural and
compiler optimizations makes it challenging to achieve significant performance
deltas. Solutions based solely in hardware or software are no longer sufficient
to maintain the pace of improvements seen during the past few decades. In
hardware, the end of single-core scaling resulted in the proliferation of
multi-core system architectures, however this has forced complex parallel
programming techniques into the mainstream. To further exploit physical
resources, systems are becoming increasingly heterogeneous with specialized
computing elements and accelerators. Programming across a range of disparate
architectures requires a new level of abstraction that programming languages
will have to adapt to. In software, emerging complex applications, from domains
such as Big Data and computer vision, run on multi-layered software stacks
targeting hardware with a variety of constraints and resources. Hence,
optimizing for the power-performance (and resiliency) space requires
experimentation platforms that offer quick and easy prototyping of
hardware/software co-designed techniques. To that end, we present Project
Beehive: A Hardware/Software co-designed stack for runtime and architectural
research. Project Beehive utilizes various state-of-the-art software and
hardware components along with novel and extensible co-design techniques. The
objective of Project Beehive is to provide a modern platform for
experimentation on emerging applications, programming languages, compilers,
runtimes, and low-power heterogeneous many-core architectures in a full-system
co-designed manner.Comment: New version of this pape
Optimal processor assignment for pipeline computations
The availability of large scale multitasked parallel architectures introduces the following processor assignment problem for pipelined computations. Given a set of tasks and their precedence constraints, along with their experimentally determined individual responses times for different processor sizes, find an assignment of processor to tasks. Two objectives are of interest: minimal response given a throughput requirement, and maximal throughput given a response time requirement. These assignment problems differ considerably from the classical mapping problem in which several tasks share a processor; instead, it is assumed that a large number of processors are to be assigned to a relatively small number of tasks. Efficient assignment algorithms were developed for different classes of task structures. For a p processor system and a series parallel precedence graph with n constituent tasks, an O(np2) algorithm is provided that finds the optimal assignment for the response time optimization problem; it was found that the assignment optimizing the constrained throughput in O(np2log p) time. Special cases of linear, independent, and tree graphs are also considered
Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools
Deep Learning (DL) has had an immense success in the recent past, leading to
state-of-the-art results in various domains such as image recognition and
natural language processing. One of the reasons for this success is the
increasing size of DL models and the proliferation of vast amounts of training
data being available. To keep on improving the performance of DL, increasing
the scalability of DL systems is necessary. In this survey, we perform a broad
and thorough investigation on challenges, techniques and tools for scalable DL
on distributed infrastructures. This incorporates infrastructures for DL,
methods for parallel DL training, multi-tenant resource scheduling and the
management of training and model data. Further, we analyze and compare 11
current open-source DL frameworks and tools and investigate which of the
techniques are commonly implemented in practice. Finally, we highlight future
research trends in DL systems that deserve further research.Comment: accepted at ACM Computing Surveys, to appea
Making BREAD: Biomimetic strategies for Artificial Intelligence Now and in the Future
The Artificial Intelligence (AI) revolution foretold of during the 1960s is
well underway in the second decade of the 21st century. Its period of
phenomenal growth likely lies ahead. Still, we believe, there are crucial
lessons that biology can offer that will enable a prosperous future for AI. For
machines in general, and for AI's especially, operating over extended periods
or in extreme environments will require energy usage orders of magnitudes more
efficient than exists today. In many operational environments, energy sources
will be constrained. Any plans for AI devices operating in a challenging
environment must begin with the question of how they are powered, where fuel is
located, how energy is stored and made available to the machine, and how long
the machine can operate on specific energy units. Hence, the materials and
technologies that provide the needed energy represent a critical challenge
towards future use-scenarios of AI and should be integrated into their design.
Here we make four recommendations for stakeholders and especially decision
makers to facilitate a successful trajectory for this technology. First, that
scientific societies and governments coordinate Biomimetic Research for
Energy-efficient, AI Designs (BREAD); a multinational initiative and a funding
strategy for investments in the future integrated design of energetics into AI.
Second, that biomimetic energetic solutions be central to design consideration
for future AI. Third, that a pre-competitive space be organized between
stakeholder partners and fourth, that a trainee pipeline be established to
ensure the human capital required for success in this area
Reconfigurable Hardware Accelerators: Opportunities, Trends, and Challenges
With the emerging big data applications of Machine Learning, Speech
Recognition, Artificial Intelligence, and DNA Sequencing in recent years,
computer architecture research communities are facing the explosive scale of
various data explosion. To achieve high efficiency of data-intensive computing,
studies of heterogeneous accelerators which focus on latest applications, have
become a hot issue in computer architecture domain. At present, the
implementation of heterogeneous accelerators mainly relies on heterogeneous
computing units such as Application-specific Integrated Circuit (ASIC),
Graphics Processing Unit (GPU), and Field Programmable Gate Array (FPGA). Among
the typical heterogeneous architectures above, FPGA-based reconfigurable
accelerators have two merits as follows: First, FPGA architecture contains a
large number of reconfigurable circuits, which satisfy requirements of high
performance and low power consumption when specific applications are running.
Second, the reconfigurable architectures of employing FPGA performs prototype
systems rapidly and features excellent customizability and reconfigurability.
Nowadays, in top-tier conferences of computer architecture, emerging a batch of
accelerating works based on FPGA or other reconfigurable architectures. To
better review the related work of reconfigurable computing accelerators
recently, this survey reserves latest high-level research products of
reconfigurable accelerator architectures and algorithm applications as the
basis. In this survey, we compare hot research issues and concern domains,
furthermore, analyze and illuminate advantages, disadvantages, and challenges
of reconfigurable accelerators. In the end, we prospect the development
tendency of accelerator architectures in the future, hoping to provide a
reference for computer architecture researchers
CAVBench: A Benchmark Suite for Connected and Autonomous Vehicles
Connected and autonomous vehicles (CAVs) have recently attracted a
significant amount of attention both from researchers and industry. Numerous
studies targeting algorithms, software frameworks, and applications on the CAVs
scenario have emerged. Meanwhile, several pioneer efforts have focused on the
edge computing system and architecture design for the CAVs scenario and
provided various heterogeneous platform prototypes for CAVs. However, a
standard and comprehensive application benchmark for CAVs is missing, hindering
the study of these emerging computing systems. To address this challenging
problem, we present CAVBench, the first benchmark suite for the edge computing
system in the CAVs scenario. CAVBench is comprised of six typical applications
covering four dominate CAVs scenarios and takes four datasets as standard
input. CAVBench provides quantitative evaluation results via application and
system perspective output metrics. We perform a series of experiments and
acquire three systemic characteristics of the applications in CAVBench. First,
the operation intensity of the applications is polarized, which explains why
heterogeneous hardware is important for a CAVs computing system. Second, all
applications in CAVBench consume high memory bandwidth, so the system should be
equipped with high bandwidth memory or leverage good memory bandwidth
management to avoid the performance degradation caused by memory bandwidth
competition. Third, some applications have worse data/instruction locality
based on the cache miss observation, so the computing system targeting these
applications should optimize the cache architecture. Last, we use the CAVBench
to evaluate a typical edge computing platform and present the quantitative and
qualitative analysis of the benchmarking results.Comment: 13 pages, The Third ACM/IEEE Symposium on Edge Computing 2018 SE
Reconfiguring the Imaging Pipeline for Computer Vision
Advancements in deep learning have ignited an explosion of research on
efficient hardware for embedded computer vision. Hardware vision acceleration,
however, does not address the cost of capturing and processing the image data
that feeds these algorithms. We examine the role of the image signal processing
(ISP) pipeline in computer vision to identify opportunities to reduce
computation and save energy. The key insight is that imaging pipelines should
be designed to be configurable: to switch between a traditional photography
mode and a low-power vision mode that produces lower-quality image data
suitable only for computer vision. We use eight computer vision algorithms and
a reversible pipeline simulation tool to study the imaging system's impact on
vision performance. For both CNN-based and classical vision algorithms, we
observe that only two ISP stages, demosaicing and gamma compression, are
critical for task performance. We propose a new image sensor design that can
compensate for skipping these stages. The sensor design features an adjustable
resolution and tunable analog-to-digital converters (ADCs). Our proposed
imaging system's vision mode disables the ISP entirely and configures the
sensor to produce subsampled, lower-precision image data. This vision mode can
save ~75% of the average energy of a baseline photography mode while having
only a small impact on vision task accuracy
Parallel Programming Models for Heterogeneous Many-Cores : A Survey
Heterogeneous many-cores are now an integral part of modern computing systems
ranging from embedding systems to supercomputers. While heterogeneous many-core
design offers the potential for energy-efficient high-performance, such
potential can only be unlocked if the application programs are suitably
parallel and can be made to match the underlying heterogeneous platform. In
this article, we provide a comprehensive survey for parallel programming models
for heterogeneous many-core architectures and review the compiling techniques
of improving programmability and portability. We examine various software
optimization techniques for minimizing the communicating overhead between
heterogeneous computing devices. We provide a road map for a wide variety of
different research areas. We conclude with a discussion on open issues in the
area and potential research directions. This article provides both an
accessible introduction to the fast-moving area of heterogeneous programming
and a detailed bibliography of its main achievements.Comment: Accepted to be published at CCF Transactions on High Performance
Computin
SDC - Stacked Dilated Convolution: A Unified Descriptor Network for Dense Matching Tasks
Dense pixel matching is important for many computer vision tasks such as
disparity and flow estimation. We present a robust, unified descriptor network
that considers a large context region with high spatial variance. Our network
has a very large receptive field and avoids striding layers to maintain spatial
resolution. These properties are achieved by creating a novel neural network
layer that consists of multiple, parallel, stacked dilated convolutions (SDC).
Several of these layers are combined to form our SDC descriptor network. In our
experiments, we show that our SDC features outperform state-of-the-art feature
descriptors in terms of accuracy and robustness. In addition, we demonstrate
the superior performance of SDC in state-of-the-art stereo matching, optical
flow and scene flow algorithms on several famous public benchmarks
Sparse Matrix Multiplication On An Associative Processor
Sparse matrix multiplication is an important component of linear algebra
computations. Implementing sparse matrix multiplication on an associative
processor (AP) enables high level of parallelism, where a row of one matrix is
multiplied in parallel with the entire second matrix, and where the execution
time of vector dot product does not depend on the vector size. Four sparse
matrix multiplication algorithms are explored in this paper, combining AP and
baseline CPU processing to various levels. They are evaluated by simulation on
a large set of sparse matrices. The computational complexity of sparse matrix
multiplication on AP is shown to be an O(nnz) where nnz is the number of
nonzero elements. The AP is found to be especially efficient in binary sparse
matrix multiplication. AP outperforms conventional solutions in power
efficiency
- …