1,221 research outputs found

    HERO: Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Accelerators on FPGA

    Full text link
    Heterogeneous embedded systems on chip (HESoCs) co-integrate a standard host processor with programmable manycore accelerators (PMCAs) to combine general-purpose computing with domain-specific, efficient processing capabilities. While leading companies successfully advance their HESoC products, research lags behind due to the challenges of building a prototyping platform that unites an industry-standard host processor with an open research PMCA architecture. In this work we introduce HERO, an FPGA-based research platform that combines a PMCA composed of clusters of RISC-V cores, implemented as soft cores on an FPGA fabric, with a hard ARM Cortex-A multicore host processor. The PMCA architecture mapped on the FPGA is silicon-proven, scalable, configurable, and fully modifiable. HERO includes a complete software stack that consists of a heterogeneous cross-compilation toolchain with support for OpenMP accelerator programming, a Linux driver, and runtime libraries for both host and PMCA. HERO is designed to facilitate rapid exploration on all software and hardware layers: run-time behavior can be accurately analyzed by tracing events, and modifications can be validated through fully automated hard ware and software builds and executed tests. We demonstrate the usefulness of HERO by means of case studies from our research

    Performance Aspects of Synthesizable Computing Systems

    Get PDF

    From plasma to beefarm: Design experience of an FPGA-based multicore prototype

    Get PDF
    In this paper, we take a MIPS-based open-source uniprocessor soft core, Plasma, and extend it to obtain the Beefarm infrastructure for FPGA-based multiprocessor emulation, a popular research topic of the last few years both in the FPGA and the computer architecture communities. We discuss various design tradeoffs and we demonstrate superior scalability through experimental results compared to traditional software instruction set simulators. Based on our experience of designing and building a complete FPGA-based multiprocessor emulation system that supports run-time and compiler infrastructure and on the actual executions of our experiments running Software Transactional Memory (STM) benchmarks, we comment on the pros, cons and future trends of using hardware-based emulation for research.Peer ReviewedPostprint (author's final draft

    Modularizing and Specifying Protocols among Threads

    Full text link
    We identify three problems with current techniques for implementing protocols among threads, which complicate and impair the scalability of multicore software development: implementing synchronization, implementing coordination, and modularizing protocols. To mend these deficiencies, we argue for the use of domain-specific languages (DSL) based on existing models of concurrency. To demonstrate the feasibility of this proposal, we explain how to use the model of concurrency Reo as a high-level protocol DSL, which offers appropriate abstractions and a natural separation of protocols and computations. We describe a Reo-to-Java compiler and illustrate its use through examples.Comment: In Proceedings PLACES 2012, arXiv:1302.579

    Embedded System Optimization of Radar Post-processing in an ARM CPU Core

    Get PDF
    Algorithms executed on the radar processor system contributes to a significant performance bottleneck of the overall radar system. One key performance concern is the latency in target detection when dealing with hard deadline systems. Research has shown software optimization as one major contributor to radar system performance improvements. This thesis aims at software optimizations using a manual and automatic approach and analyzing the results to make informed future decisions while working with an ARM processor system. In order to ascertain an optimized implementation, a question put forward was whether the algorithms on the ARM processor could work with a 6-antenna implementation without a decline in the performance. However, an answer would also help project how many additional algorithms can still be added without performance decline. The manual optimization was done based on the quantitative analysis of the software execution time. The manual optimization approach looked at the vectorization strategy using the NEON vector register on the ARM CPU to reimplement the initial Constant False Alarm Rate(CFAR) Detection algorithm. An additional optimization approach was eliminating redundant loops while going through the Range Gates and Doppler filters. In order to determine the best compiler for automatic code optimization for the radar algorithms on the ARM processor, the GCC and Clang compilers were used to compile the initial algorithms and the optimized implementation on the radar post-processing stage. Analysis of the optimization results showed that it is possible to run the radar post-processing algorithms on the ARM processor at the 6-antenna implementation without system load stress. In addition, the results show an excellent headroom margin based on the defined scenario. The result analysis further revealed that the effect of dynamic memory allocation could not be underrated in situations where performance is a significant concern. Additional statements from the result demonstrated that the GCC and Clang compiler has their strength and weaknesses when used in the compilation. One limiting factor to note on the optimization using the NEON register is the sample size’s effect on the optimization implementation. Although it fits into the test samples used based on the defined scenario, there might be varying results in varying window cell size situations that might not necessarily improve the time constraints

    Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips

    Full text link
    The trend in industry is towards heterogeneous multicore processors (HMCs), including chips with CPUs and massively-threaded throughput-oriented processors (MTTOPs) such as GPUs. Although current homogeneous chips tightly couple the cores with cache-coherent shared virtual memory (CCSVM), this is not the communication paradigm used by any current HMC. In this paper, we present a CCSVM design for a CPU/MTTOP chip, as well as an extension of the pthreads programming model, called xthreads, for programming this HMC. Our goal is to evaluate the potential performance benefits of tightly coupling heterogeneous cores with CCSVM
    corecore