3,257 research outputs found
Synergy: A HW/SW Framework for High Throughput CNNs on Embedded Heterogeneous SoC
Convolutional Neural Networks (CNN) have been widely deployed in diverse
application domains. There has been significant progress in accelerating both
their training and inference using high-performance GPUs, FPGAs, and custom
ASICs for datacenter-scale environments. The recent proliferation of mobile and
IoT devices have necessitated real-time, energy-efficient deep neural network
inference on embedded-class, resource-constrained platforms. In this context,
we present {\em Synergy}, an automated, hardware-software co-designed,
pipelined, high-throughput CNN inference framework on embedded heterogeneous
system-on-chip (SoC) architectures (Xilinx Zynq). {\em Synergy} leverages,
through multi-threading, all the available on-chip resources, which includes
the dual-core ARM processor along with the FPGA and the NEON SIMD engines as
accelerators. Moreover, {\em Synergy} provides a unified abstraction of the
heterogeneous accelerators (FPGA and NEON) and can adapt to different network
configurations at runtime without changing the underlying hardware accelerator
architecture by balancing workload across accelerators through work-stealing.
{\em Synergy} achieves 7.3X speedup, averaged across seven CNN models, over a
well-optimized software-only solution. {\em Synergy} demonstrates substantially
better throughput and energy-efficiency compared to the contemporary CNN
implementations on the same SoC architecture.Comment: 34 pages, submitted to ACM Transactions on Embedded Computing Systems
(TECS
A Framework For Performance Evaluation Of ASIPS In Network-Based IDS
Nowadays efficient usage of high-tech security tools and appliances is
considered as an important criterion for security improvement of computer
networks. Based on this assumption, Intrusion Detection and Prevention Systems
(IDPS) have key role for applying the defense in depth strategy. In this
situation, by increasing network bandwidth in addition to increasing number of
threats, Network-based IDPSes have been faced with performance challenge for
processing of huge traffic in the networks. A general solution for this
bottleneck is exploitation of efficient hardware architectures for performance
improvement of IDPS. In this paper a framework for analysis and performance
evaluation of application specific instruction set processors is presented for
usage in application of attack detection in Networkbased Intrusion Detection
Systems(NIDS). By running this framework as a security application on V850,
OR1K, MIPS32, ARM7TDMI and PowerPC32 microprocessors, their performance has
been evaluated and analyzed. For performance improvement, the compiler
optimization levels are employed and at the end; base on O2 optimization level
a new combination of optimization flags is presented. The experiments show that
the framework results 18.10% performance improvements for pattern matching on
ARM7TDMI microprocessors.Comment: 13 pages, 3 figures, International Journal of Network Security & Its
Applications (IJNSA), Vol.4, No.5, September 201
CoCoPIE: Making Mobile AI Sweet As PIE --Compression-Compilation Co-Design Goes a Long Way
Assuming hardware is the major constraint for enabling real-time mobile
intelligence, the industry has mainly dedicated their efforts to developing
specialized hardware accelerators for machine learning and inference. This
article challenges the assumption. By drawing on a recent real-time AI
optimization framework CoCoPIE, it maintains that with effective
compression-compiler co-design, it is possible to enable real-time artificial
intelligence on mainstream end devices without special hardware. CoCoPIE is a
software framework that holds numerous records on mobile AI: the first
framework that supports all main kinds of DNNs, from CNNs to RNNs, transformer,
language models, and so on; the fastest DNN pruning and acceleration framework,
up to 180X faster compared with current DNN pruning on other frameworks such as
TensorFlow-Lite; making many representative AI applications able to run in
real-time on off-the-shelf mobile devices that have been previously regarded
possible only with special hardware support; making off-the-shelf mobile
devices outperform a number of representative ASIC and FPGA solutions in terms
of energy efficiency and/or performance
DNNVM : End-to-End Compiler Leveraging Heterogeneous Optimizations on FPGA-based CNN Accelerators
The convolutional neural network (CNN) has become a state-of-the-art method
for several artificial intelligence domains in recent years. The increasingly
complex CNN models are both computation-bound and I/O-bound. FPGA-based
accelerators driven by custom instruction set architecture (ISA) achieve a
balance between generality and efficiency, but there is much on them left to be
optimized. We propose the full-stack compiler DNNVM, which is an integration of
optimizers for graphs, loops and data layouts, and an assembler, a runtime
supporter and a validation environment. The DNNVM works in the context of deep
learning frameworks and transforms CNN models into the directed acyclic graph:
XGraph. Based on XGraph, we transform the optimization challenges for both the
data layout and pipeline into graph-level problems. DNNVM enumerates all
potentially profitable fusion opportunities by a heuristic subgraph isomorphism
algorithm to leverage pipeline and data layout optimizations, and searches for
the best choice of execution strategies of the whole computing graph. On the
Xilinx ZU2 @330 MHz and ZU9 @330 MHz, we achieve equivalently state-of-the-art
performance on our benchmarks by na\"ive implementations without optimizations,
and the throughput is further improved up to 1.26x by leveraging heterogeneous
optimizations in DNNVM. Finally, with ZU9 @330 MHz, we achieve state-of-the-art
performance for VGG and ResNet50. We achieve a throughput of 2.82 TOPs/s and an
energy efficiency of 123.7 GOPs/s/W for VGG. Additionally, we achieve 1.38
TOPs/s for ResNet50 and 1.41 TOPs/s for GoogleNet.Comment: 18 pages, 9 figures, 5 table
In-RDBMS Hardware Acceleration of Advanced Analytics
The data revolution is fueled by advances in machine learning, databases, and
hardware design. Programmable accelerators are making their way into each of
these areas independently. As such, there is a void of solutions that enables
hardware acceleration at the intersection of these disjoint fields. This paper
sets out to be the initial step towards a unifying solution for in-Database
Acceleration of Advanced Analytics (DAnA). Deploying specialized hardware, such
as FPGAs, for in-database analytics currently requires hand-designing the
hardware and manually routing the data. Instead, DAnA automatically maps a
high-level specification of advanced analytics queries to an FPGA accelerator.
The accelerator implementation is generated for a User Defined Function (UDF),
expressed as a part of an SQL query using a Python-embedded Domain-Specific
Language (DSL). To realize an efficient in-database integration, DAnA
accelerators contain a novel hardware structure, Striders, that directly
interface with the buffer pool of the database. Striders extract, cleanse, and
process the training data tuples that are consumed by a multi-threaded FPGA
engine that executes the analytics algorithm. We integrate DAnA with PostgreSQL
to generate hardware accelerators for a range of real-world and synthetic
datasets running diverse ML algorithms. Results show that DAnA-enhanced
PostgreSQL provides, on average, 8.3x end-to-end speedup for real datasets,
with a maximum of 28.2x. Moreover, DAnA-enhanced PostgreSQL is, on average,
4.0x faster than the multi-threaded Apache MADLib running on Greenplum. DAnA
provides these benefits while hiding the complexity of hardware design from
data scientists and allowing them to express the algorithm in =30-60 lines of
Python
CUDAMPF++: A Proactive Resource Exhaustion Scheme for Accelerating Homologous Sequence Search on CUDA-enabled GPU
Genomic sequence alignment is an important research topic in bioinformatics
and continues to attract significant efforts. As genomic data grow
exponentially, however, most of alignment methods face challenges due to their
huge computational costs. HMMER, a suite of bioinformatics tools, is widely
used for the analysis of homologous protein and nucleotide sequences with high
sensitivity, based on profile hidden Markov models (HMMs). Its latest version,
HMMER3, introdues a heuristic pipeline to accelerate the alignment process,
which is carried out on central processing units (CPUs) with the support of
streaming SIMD extensions (SSE) instructions. Few acceleration results have
since been reported based on HMMER3. In this paper, we propose a five-tiered
parallel framework, CUDAMPF++, to accelerate the most computationally intensive
stages of HMMER3's pipeline, multiple/single segment Viterbi (MSV/SSV), on a
single graphics processing unit (GPU). As an architecture-aware design, the
proposed framework aims to fully utilize hardware resources via exploiting
finer-grained parallelism (multi-sequence alignment) compared with its
predecessor (CUDAMPF). In addition, we propose a novel method that proactively
sacrifices L1 Cache Hit Ratio (CHR) to get improved performance and scalability
in return. A comprehensive evaluation shows that the proposed framework
outperfroms all existig work and exhibits good consistency in performance
regardless of the variation of query models or protein sequence datasets. For
MSV (SSV) kernels, the peak performance of the CUDAMPF++ is 283.9 (471.7) GCUPS
on a single K40 GPU, and impressive speedups ranging from 1.x (1.7x) to 168.3x
(160.7x) are achieved over the CPU-based implementation (16 cores, 32 threads).Comment: 15 pages, submitted to academic journa
Reconfigurable Hardware Accelerators: Opportunities, Trends, and Challenges
With the emerging big data applications of Machine Learning, Speech
Recognition, Artificial Intelligence, and DNA Sequencing in recent years,
computer architecture research communities are facing the explosive scale of
various data explosion. To achieve high efficiency of data-intensive computing,
studies of heterogeneous accelerators which focus on latest applications, have
become a hot issue in computer architecture domain. At present, the
implementation of heterogeneous accelerators mainly relies on heterogeneous
computing units such as Application-specific Integrated Circuit (ASIC),
Graphics Processing Unit (GPU), and Field Programmable Gate Array (FPGA). Among
the typical heterogeneous architectures above, FPGA-based reconfigurable
accelerators have two merits as follows: First, FPGA architecture contains a
large number of reconfigurable circuits, which satisfy requirements of high
performance and low power consumption when specific applications are running.
Second, the reconfigurable architectures of employing FPGA performs prototype
systems rapidly and features excellent customizability and reconfigurability.
Nowadays, in top-tier conferences of computer architecture, emerging a batch of
accelerating works based on FPGA or other reconfigurable architectures. To
better review the related work of reconfigurable computing accelerators
recently, this survey reserves latest high-level research products of
reconfigurable accelerator architectures and algorithm applications as the
basis. In this survey, we compare hot research issues and concern domains,
furthermore, analyze and illuminate advantages, disadvantages, and challenges
of reconfigurable accelerators. In the end, we prospect the development
tendency of accelerator architectures in the future, hoping to provide a
reference for computer architecture researchers
The Andromeda Study: A Femto-Spacecraft Mission to Alpha Centauri
This paper discusses the physics, engineering and mission architecture
relating to a gram-sized interstellar probe propelled by a laser beam. The
objectives are to design a fly-by mission to Alpha Centauri with a total
mission duration of 50 years travelling at a cruise speed of 0.1c. Furthermore,
optical data from the target star system is to be obtained and sent back to the
Solar system. The main challenges of such a mission are presented and possible
solutions proposed. The results show that by extrapolating from currently
existing technology, such a mission would be feasible. The total mass of the
proposed spacecraft is 23g and the space-based laser infrastructure has a beam
power output of 15GW. Rurther exploration of the laser - spacecraft tradespace
and associated technologies are necessary
Exploring Computation-Communication Tradeoffs in Camera Systems
Cameras are the defacto sensor. The growing demand for real-time and
low-power computer vision, coupled with trends towards high-efficiency
heterogeneous systems, has given rise to a wide range of image processing
acceleration techniques at the camera node and in the cloud. In this paper, we
characterize two novel camera systems that use acceleration techniques to push
the extremes of energy and performance scaling, and explore the
computation-communication tradeoffs in their design. The first case study
targets a camera system designed to detect and authenticate individual faces,
running solely on energy harvested from RFID readers. We design a
multi-accelerator SoC design operating in the sub-mW range, and evaluate it
with real-world workloads to show performance and energy efficiency
improvements over a general purpose microprocessor. The second camera system
supports a 16-camera rig processing over 32 Gb/s of data to produce real-time
3D-360 degree virtual reality video. We design a multi-FPGA processing pipeline
that outperforms CPU and GPU configurations by up to 10x in computation time,
producing panoramic stereo video directly from the camera rig at 30 frames per
second. We find that an early data reduction step, either before complex
processing or offloading, is the most critical optimization for in-camera
systems
Quantum Computer Architecture: Towards Full-Stack Quantum Accelerators
This paper presents the definition and implementation of a quantum computer
architecture to enable creating a new computational device - a quantum computer
as an accelerator. In this paper, we present explicitly the idea of a quantum
accelerator which contains the full stack of the layers of an accelerator. Such
a stack starts at the highest level describing the target application of the
accelerator. The next layer abstracts the quantum logic outlining the algorithm
that is to be executed on the quantum accelerator. In our case, the logic is
expressed in the universal quantum-classical hybrid computation language
developed in the group, called OpenQL, which visualised the quantum processor
as a computational accelerator. The OpenQL compiler translates the program to a
common assembly language, called cQASM, which can be executed on a quantum
simulator. The cQASM represents the instruction set that can be executed by the
micro-architecture implemented in the quantum accelerator. In a subsequent
step, the compiler can convert the cQASM to generate the eQASM, which is
executable on a particular experimental device incorporating the
platform-specific parameters. This way, we are able to distinguish clearly the
experimental research towards better qubits, and the industrial and societal
applications that need to be developed and executed on a quantum device. The
first case offers experimental physicists with a full-stack experimental
platform using realistic qubits with decoherence and error-rates while the
second case offers perfect qubits to the quantum application developer, where
there is no decoherence nor error-rates. We conclude the paper by explicitly
presenting three examples of full-stack quantum accelerators, for an
experimental superconducting processor, for quantum accelerated genome
sequencing and for near-term generic optimisation problems based on quantum
heuristic approaches
- …