1,221 research outputs found
HERO: Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Accelerators on FPGA
Heterogeneous embedded systems on chip (HESoCs) co-integrate a standard host
processor with programmable manycore accelerators (PMCAs) to combine
general-purpose computing with domain-specific, efficient processing
capabilities. While leading companies successfully advance their HESoC
products, research lags behind due to the challenges of building a prototyping
platform that unites an industry-standard host processor with an open research
PMCA architecture. In this work we introduce HERO, an FPGA-based research
platform that combines a PMCA composed of clusters of RISC-V cores, implemented
as soft cores on an FPGA fabric, with a hard ARM Cortex-A multicore host
processor. The PMCA architecture mapped on the FPGA is silicon-proven,
scalable, configurable, and fully modifiable. HERO includes a complete software
stack that consists of a heterogeneous cross-compilation toolchain with support
for OpenMP accelerator programming, a Linux driver, and runtime libraries for
both host and PMCA. HERO is designed to facilitate rapid exploration on all
software and hardware layers: run-time behavior can be accurately analyzed by
tracing events, and modifications can be validated through fully automated hard
ware and software builds and executed tests. We demonstrate the usefulness of
HERO by means of case studies from our research
From plasma to beefarm: Design experience of an FPGA-based multicore prototype
In this paper, we take a MIPS-based open-source uniprocessor soft core, Plasma, and extend it to obtain the Beefarm infrastructure for FPGA-based multiprocessor emulation, a popular research topic of the last few years both in the FPGA and the computer architecture communities. We discuss various design tradeoffs and we demonstrate superior scalability through experimental results compared to traditional software instruction set simulators. Based on our experience of designing and building a complete FPGA-based multiprocessor emulation system that supports run-time and compiler infrastructure and on the actual executions of our experiments running Software Transactional Memory (STM) benchmarks, we comment on the pros, cons and future trends of using hardware-based emulation for research.Peer ReviewedPostprint (author's final draft
Modularizing and Specifying Protocols among Threads
We identify three problems with current techniques for implementing protocols
among threads, which complicate and impair the scalability of multicore
software development: implementing synchronization, implementing coordination,
and modularizing protocols. To mend these deficiencies, we argue for the use of
domain-specific languages (DSL) based on existing models of concurrency. To
demonstrate the feasibility of this proposal, we explain how to use the model
of concurrency Reo as a high-level protocol DSL, which offers appropriate
abstractions and a natural separation of protocols and computations. We
describe a Reo-to-Java compiler and illustrate its use through examples.Comment: In Proceedings PLACES 2012, arXiv:1302.579
Embedded System Optimization of Radar Post-processing in an ARM CPU Core
Algorithms executed on the radar processor system contributes to a significant performance bottleneck of the overall radar system. One key performance concern is
the latency in target detection when dealing with hard deadline systems. Research has shown software optimization as one major contributor to radar system performance
improvements. This thesis aims at software optimizations using a manual and automatic approach and analyzing the results to make informed future decisions
while working with an ARM processor system. In order to ascertain an optimized implementation, a question put forward was whether the algorithms on the ARM
processor could work with a 6-antenna implementation without a decline in the performance. However, an answer would also help project how many additional
algorithms can still be added without performance decline.
The manual optimization was done based on the quantitative analysis of the software execution time. The manual optimization approach looked at the vectorization
strategy using the NEON vector register on the ARM CPU to reimplement the initial Constant False Alarm Rate(CFAR) Detection algorithm. An additional
optimization approach was eliminating redundant loops while going through the Range Gates and Doppler filters. In order to determine the best compiler for automatic
code optimization for the radar algorithms on the ARM processor, the GCC and Clang compilers were used to compile the initial algorithms and the optimized
implementation on the radar post-processing stage.
Analysis of the optimization results showed that it is possible to run the radar post-processing algorithms on the ARM processor at the 6-antenna implementation
without system load stress. In addition, the results show an excellent headroom margin based on the defined scenario. The result analysis further revealed that the
effect of dynamic memory allocation could not be underrated in situations where performance is a significant concern. Additional statements from the result demonstrated
that the GCC and Clang compiler has their strength and weaknesses when used in the compilation. One limiting factor to note on the optimization using the
NEON register is the sample size’s effect on the optimization implementation. Although it fits into the test samples used based on the defined scenario, there might
be varying results in varying window cell size situations that might not necessarily improve the time constraints
Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips
The trend in industry is towards heterogeneous multicore processors (HMCs),
including chips with CPUs and massively-threaded throughput-oriented processors
(MTTOPs) such as GPUs. Although current homogeneous chips tightly couple the
cores with cache-coherent shared virtual memory (CCSVM), this is not the
communication paradigm used by any current HMC. In this paper, we present a
CCSVM design for a CPU/MTTOP chip, as well as an extension of the pthreads
programming model, called xthreads, for programming this HMC. Our goal is to
evaluate the potential performance benefits of tightly coupling heterogeneous
cores with CCSVM
- …