2,099 research outputs found
A Review on Software Architectures for Heterogeneous Platforms
The increasing demands for computing performance have been a reality
regardless of the requirements for smaller and more energy efficient devices.
Throughout the years, the strategy adopted by industry was to increase the
robustness of a single processor by increasing its clock frequency and mounting
more transistors so more calculations could be executed. However, it is known
that the physical limits of such processors are being reached, and one way to
fulfill such increasing computing demands has been to adopt a strategy based on
heterogeneous computing, i.e., using a heterogeneous platform containing more
than one type of processor. This way, different types of tasks can be executed
by processors that are specialized in them. Heterogeneous computing, however,
poses a number of challenges to software engineering, especially in the
architecture and deployment phases. In this paper, we conduct an empirical
study that aims at discovering the state-of-the-art in software architecture
for heterogeneous computing, with focus on deployment. We conduct a systematic
mapping study that retrieved 28 studies, which were critically assessed to
obtain an overview of the research field. We identified gaps and trends that
can be used by both researchers and practitioners as guides to further
investigate the topic
FASTCUDA: Open Source FPGA Accelerator & Hardware-Software Codesign Toolset for CUDA Kernels
Using FPGAs as hardware accelerators that communicate with a central CPU is becoming a common practice in the embedded design world but there is no standard methodology and toolset to facilitate this path yet. On the other hand, languages such as CUDA and OpenCL provide standard development environments for Graphical Processing Unit (GPU) programming. FASTCUDA is a platform that provides the necessary software toolset, hardware architecture, and design methodology to efficiently adapt the CUDA approach into a new FPGA design flow. With FASTCUDA, the CUDA kernels of a CUDA-based application are partitioned into two groups with minimal user intervention: those that are compiled and executed in parallel software, and those that are synthesized and implemented in hardware. A modern low power FPGA can provide the processing power (via numerous embedded micro-CPUs) and the logic capacity for both the software and hardware implementations of the CUDA kernels. This paper describes the system requirements and the architectural decisions behind the FASTCUDA approach
A Survey of Techniques For Improving Energy Efficiency in Embedded Computing Systems
Recent technological advances have greatly improved the performance and
features of embedded systems. With the number of just mobile devices now
reaching nearly equal to the population of earth, embedded systems have truly
become ubiquitous. These trends, however, have also made the task of managing
their power consumption extremely challenging. In recent years, several
techniques have been proposed to address this issue. In this paper, we survey
the techniques for managing power consumption of embedded systems. We discuss
the need of power management and provide a classification of the techniques on
several important parameters to highlight their similarities and differences.
This paper is intended to help the researchers and application-developers in
gaining insights into the working of power management techniques and designing
even more efficient high-performance embedded systems of tomorrow
An Efficient and Cost Effective FPGA Based Implementation of the Viola-Jones Face Detection Algorithm
We present an field programmable gate arrays (FPGA) based implementation of the popular Viola-Jones face detection algorithm, which is an essential building block in many applications such as video surveillance and tracking. Our implementation is a complete system level hardware design described in a hardware description language and validated on the affordable DE2-115 evaluation board. Our primary objective is to study the achievable performance with a low-end FPGA chip based implementation. In addition, we release to the public domain the entire project. We hope that this will enable other researchers to easily replicate and compare their results to ours and that it will encourage and facilitate further research and educational ideas in the areas of image processing, computer vision, and advanced digital design and FPGA prototyping
A Reconfigurable Vector Instruction Processor for Accelerating a Convection Parametrization Model on FPGAs
High Performance Computing (HPC) platforms allow scientists to model
computationally intensive algorithms. HPC clusters increasingly use
General-Purpose Graphics Processing Units (GPGPUs) as accelerators; FPGAs
provide an attractive alternative to GPGPUs for use as co-processors, but they
are still far from being mainstream due to a number of challenges faced when
using FPGA-based platforms. Our research aims to make FPGA-based high
performance computing more accessible to the scientific community. In this work
we present the results of investigating the acceleration of a particular
atmospheric model, Flexpart, on FPGAs. We focus on accelerating the most
computationally intensive kernel from this model. The key contribution of our
work is the architectural exploration we undertook to arrive at a solution that
best exploits the parallelism available in the legacy code, and is also
convenient to program, so that eventually the compilation of high-level legacy
code to our architecture can be fully automated. We present the three different
types of architecture, comparing their resource utilization and performance,
and propose that an architecture where there are a number of computational
cores, each built along the lines of a vector instruction processor, works best
in this particular scenario, and is a promising candidate for a generic
FPGA-based platform for scientific computation. We also present the results of
experiments done with various configuration parameters of the proposed
architecture, to show its utility in adapting to a range of scientific
applications.Comment: This is an extended pre-print version of work that was presented at
the international symposium on Highly Efficient Accelerators and
Reconfigurable Technologies (HEART2014), Sendai, Japan, June 911, 201
Reconfigurable computing for large-scale graph traversal algorithms
This thesis proposes a reconfigurable computing approach for supporting parallel processing in large-scale graph traversal algorithms. Our approach is based on a reconfigurable hardware architecture which exploits the capabilities of both FPGAs (Field-Programmable Gate Arrays) and a multi-bank parallel memory subsystem.
The proposed methodology to accelerate graph traversal algorithms has been applied to three case studies, revealing that application-specific hardware customisations can benefit performance. A summary of our four contributions is as follows.
First, a reconfigurable computing approach to accelerate large-scale graph traversal algorithms. We propose a reconfigurable hardware architecture which decouples computation and communication while keeping multiple memory requests in flight at any given time, taking advantage of the high bandwidth of multi-bank memory subsystems.
Second, a demonstration of the effectiveness of our approach through two case studies: the breadth-first search algorithm, and a graphlet counting algorithm from bioinformatics. Both case studies involve graph traversal, but each of them adopts a different graph data representation.
Third, a method for using on-chip memory resources in FPGAs to reduce off-chip memory accesses for accelerating graph traversal algorithms, through a case-study of the All-Pairs Shortest-Paths algorithm. This case study has been applied to process human brain network data.
Fourth, an evaluation of an approach based on instruction-set extension for FPGA design against many-core GPUs (Graphics Processing Units), based on a set of benchmarks with different memory access characteristics. It is shown that while GPUs excel at streaming applications, the proposed approach can outperform GPUs in applications with poor locality characteristics, such as graph traversal problems.Open Acces
- …