Search CORE

139 research outputs found

Rapid codesign of a soft vector processor and its compiler

Author: Moore SW
Naylor M
Publication venue: Conference Digest - 24th International Conference on Field Programmable Logic and Applications, FPL 2014
Publication date: 01/09/2014
Field of study

Despite a decade of activity in the development of soft vector processors for FPGAs, high-level language support remains thin. We attribute this problem to a design method in which the high-level vector programming interface is only really considered once the processor architecture has been perfected, by which point the designer may be committed to the timeconsuming development of a complicated compiler. In this paper, we present the codesign of a soft vector processor and a lightweight compiler, which together lift the level of abstraction for the programmer while allowing a rapid compiler implementation phase.We demonstrate the effectiveness of our approach on a range of applications from digital signal processing, neuroscience, and machine learning.This work is sponsored by EPSRC grant EP/G015783/1.This is the accepted manuscript version. The final version is available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6927425&tag=1. © IEEE 201

Crossref

Apollo (Cambridge)

EMVS: Embedded Multi Vector-core System

Author: Ayguadé Parra Eduard
Cristal Kestelman Adrián
Haider Amna
Hussain Tassadaq
Publication venue: 'Elsevier BV'
Publication date: 01/01/2018
Field of study

With the increase in the density and performance of digital electronics, the demand for a power-efficient high-performance computing (HPC) system has been increased for embedded applications. The existing embedded HPC systems suffer from issues like programmability, scalability, and portability. Therefore, a parameterizable and programmable high-performance processor system architecture is required to execute the embedded HPC applications. In this work, we proposed an Embedded Multi Vector-core System (EMVS) which executes the embedded application by managing the multiple vectorized tasks and their memory operations. The system is designed and ported on an Altera DE4 FPGA development board. The performance of EMVS is compared with the Heterogeneous Multi-Processing Odroid XU3, Parallela and GPU Jetson TK1 embedded systems. In contrast to the embedded systems, the results show that EMVS improves 19.28 and 10.22 times of the application and system performance respectively and consumes 10.6 times less energy.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Digital.CSIC

Hardware-software co-design for low-cost AI processing in space processors

Author: Solé i Bonet Marc
Publication venue: Universitat Politècnica de Catalunya
Publication date: 28/10/2021
Field of study

In the recent years there has been an increasing interest in artificial intelligence (AI) and machine learning (ML). The advantages of such applications are widespread across many areas and have drawn the attention of different sectors, such as aerospace. However, these applications require much more performance than the one provided by space processors. In space the environment is not ideal for high-performance cutting-edge processors, due to radiation. For this reason, radiation hardened or radiation tolerant processors are required, which use older technologies and redundant logic, reducing the available die resources that can be exploited. In order to accelerate demanding AI applications in space processors, this thesis presents SPARROW, a low-cost SIMD accelerator for AI operations. SPARROW has been designed following a hardware-software co-design approach by analyzing the requirements of common AI applications in order to improve the efficiency of the module. The design of such module does not use any existing vector extension and instead has in its portability one of the key advantages over other implementations. Furthermore, SPARROW reuses the integer register file of the processor avoiding complex managing of the data while reducing significantly the hardware cost of the module, which is specially interesting in the space domain due to the constraints in the processor area. SPARROW operates with 8-bit integer vector components in two different stages, performing parallel computations in the first and reduction operations in the second. This design is integrated within the baseline processor not requiring any additional pipeline stage nor a modification of the processor frequency. SPARROW also includes swizzling and masking capabilities for the input vectors as well as saturation to work with 8 bits without overflow. SPARROW has been integrated with the LEON3 and NOEL-V space-grade processors, both distributed by Cobham Gaisler. Since each of the baseline processors has a different architecture set, software support for SPARROW has been provided for both SPARC v8 and RISC-V ISAs, showing the portability of the design. Software support been developed using two well established compilers, LLVM and GCC allowing for a comparison of the cost of developing support for each of them. The modifications have included the SPARROW instructions in the assembly language of each architecture and with the use of inline assembly and macros allow a programming model similar to SIMD intrinsics. LEON3 and NOEL-V extended with SPARROW have been implemented on a FPGA to evaluate the performance increase provided by our proposal. In order to compare the performance with the scalar version of the processor, different AI related applications have been tested such as matrix multiplication and image filters, which are essential building blocks for convolutional neural networks. With the use of SPARROW a speed-ups of 6x and up to 15x have been achieved

UPCommons. Portal del coneixement obert de la UPC

Vector processor virtualization: distributed memory hierarchy and simultaneous multithreading

Author: Rooholamin SeyedAmin
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/2016
Field of study

Taking advantage of DLP (Data-Level Parallelism) is indispensable in most data streaming and multimedia applications. Several architectures have been proposed to improve both the performance and energy consumption for such applications. Superscalar and VLIW (Very Long Instruction Word) processors, along with SIMD (Single-Instruction Multiple-Data) and vector processor (VP) accelerators, are among the available options for designers to accomplish their desired requirements. On the other hand, these choices turn out to be large resource and energy consumers, while also not being always used efficiently due to data dependencies among instructions and limited portion of vectorizable code in single applications that deploy them. This dissertation proposes an innovative architecture for a multithreaded VP which separates the path for performing data shuffle and memory-indexed accesses from the data path for executing other vector instructions that access the memory. This separation speeds up the most common memory access operations by avoiding extra delays and unnecessary stalls. In this multilane-based VP design, each vector lane uses its own private memory to avoid any stalls during memory access instructions. More importantly, the proposed VP has an innovative multithreaded architecture which makes it highly suitable for concurrent sharing in multicore environments. To this end, the VP which is developed in VHDL and prototyped on an FPGA (Field-Programmable Gate Array), serves as a coprocessor for one or more scalar cores in various system architectures presented in the dissertation. In the first system architecture, the VP is allocated exclusively to a single scalar core. Benchmarking shows that the VP can achieve very high performance. The inclusion of distributed data shuffle engines across vector lanes has a spectacular impact on the execution time, primarily for applications like FFT (Fast-Fourier Transform) that require large amounts of data shuffling. In the second system architecture, a VP virtualization technique is presented which, when applied, enables the multithreaded VP to simultaneously execute many threads of various vector lengths. The threads compete simultaneously for the VP resources having as a goal an improved aggregate VP utilization. This approach yields high VP utilization even under low utilization for the individual threads. A vector register file (VRF) virtualization technique dynamically allocates physical vector registers to running threads. The technique is implemented for a multi-core processor embedded in an FPGA. Under the dynamic creation of threads, benchmarking demonstrates large VP speedups and drastic energy savings when compared to the first system architecture. In the last system architecture, further improvements focus on VP virtualization relying exclusively on hardware. Moreover, a pipelined data shuffle network replaces the non-pipelined shuffle engines. The VP can then take advantage of identical instruction flows that may be present in different vector applications by running in a fused instruction mode that increases its utilization. A power dissipation model is introduced as well as two optimization policies towards minimizing the consumed energy, or the product of the energy and runtime for a given application. Benchmarking shows the positive impact of these optimizations

Digital Commons @ New Jersey Institute of Technology (NJIT)

Memory controller for vector processor

Author: Ayguadé Parra Eduard
Cristal Kestelman Adrián
Hussain Tassadaq
Palomar Oscar
Unsal Osman Sabri
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

To manage power and memory wall affects, the HPC industry supports FPGA reconfigurable accelerators and vector processing cores for data-intensive scientific applications. FPGA based vector accelerators are used to increase the performance of high-performance application kernels. Adding more vector lanes does not affect the performance, if the processor/memory performance gap dominates. In addition if on/off-chip communication time becomes more critical than computation time, causes performance degradation. The system generates multiple delays due to application’s irregular data arrangement and complex scheduling scheme. Therefore, just like generic scalar processors, all sets of vector machine – vector supercomputers to vector microprocessors – are required to have data management and access units that improve the on/off-chip bandwidth and hide main memory latency. In this work, we propose an Advanced Programmable Vector Memory Controller (PVMC), which boosts noncontiguous vector data accesses by integrating descriptors of memory patterns, a specialized on-chip memory, a memory manager in hardware, and multiple DRAM controllers. We implemented and validated the proposed system on an Altera DE4 FPGA board. The PVMC is also integrated with ARM Cortex-A9 processor on Xilinx Zynq All-Programmable System on Chip architecture. We compare the performance of a system with vector and scalar processors without PVMC. When compared with a baseline vector system, the results show that the PVMC system transfers data sets up to 1.40x to 2.12x faster, achieves between 2.01x to 4.53x of speedup for 10 applications and consumes 2.56 to 4.04 times less energy.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Digital.CSIC

Evaluating the performance of legacy applications on emerging parallel architectures

Author: Pennycook Simon J.
Publication venue
Publication date
Field of study

The gap between a supercomputer's theoretical maximum (\peak") oatingpoint performance and that actually achieved by applications has grown wider over time. Today, a typical scientific application achieves only 5{20% of any given machine's peak processing capability, and this gap leaves room for significant improvements in execution times. This problem is most pronounced for modern \accelerator" architectures { collections of hundreds of simple, low-clocked cores capable of executing the same instruction on dozens of pieces of data simultaneously. This is a significant change from the low number of high-clocked cores found in traditional CPUs, and effective utilisation of accelerators typically requires extensive code and algorithmic changes. In many cases, the best way in which to map a parallel workload to these new architectures is unclear. The principle focus of the work presented in this thesis is the evaluation of emerging parallel architectures (specifically, modern CPUs, GPUs and Intel MIC) for two benchmark codes { the LU benchmark from the NAS Parallel Benchmark Suite and Sandia's miniMD benchmark { which exhibit complex parallel behaviours that are representative of many scientific applications. Using combinations of low-level intrinsic functions, OpenMP, CUDA and MPI, we demonstrate performance improvements of up to 7x for these workloads. We also detail a code development methodology that permits application developers to target multiple architecture types without maintaining completely separate implementations for each platform. Using OpenCL, we develop performance portable implementations of the LU and miniMD benchmarks that are faster than the original codes, and at most 2x slower than versions highly-tuned for particular hardware. Finally, we demonstrate the importance of evaluating architectures at scale (as opposed to on single nodes) through performance modelling techniques, highlighting the problems associated with strong-scaling on emerging accelerator architectures

Warwick Research Archives Portal Repository

A Survey of Recent Developments in Testability, Safety and Security of RISC-V Processors

Author: Abolfazl Sajadi
Bernd Becker
Carles Hernandez
Denis Schwachhofer
Ilia Polian
Ilya Tuzov
Jens Anders
Mahnaz Namazi Rizi
Mathias Sauer
Matteo Sonza Reorda
Nele Mentens
Nikolaos Deligiannis
Nourhan Elhamawy
Nuša Zidaric
Pablo Andreu
Riccardo Cantoro
Stefan Wagner
Steffen Becker
Tobias Faller
Todor Stefanov
Publication venue: IEEE
Publication date: 01/01/2023
Field of study

With the continued success of the open RISC-V architecture, practical deployment of RISC-V processors necessitates an in-depth consideration of their testability, safety and security aspects. This survey provides an overview of recent developments in this quickly-evolving field. We start with discussing the application of state-of-the-art functional and system-level test solutions to RISC-V processors. Then, we discuss the use of RISC-V processors for safety-related applications; to this end, we outline the essential techniques necessary to obtain safety both in the functional and in the timing domain and review recent processor designs with safety features. Finally, we survey the different aspects of security with respect to RISC-V implementations and discuss the relationship between cryptographic protocols and primitives on the one hand and the RISC-V processor architecture and hardware implementation on the other. We also comment on the role of a RISC-V processor for system security and its resilience against side-channel attacks

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Leiden University Scholary Publications

IPPro: FPGA based image processing processor

Author: Bardak Burak
Rafferty Karen
Russell Matthew
Siddiqui Fahad Manzoor
Woods Roger
Publication venue
Publication date: 01/10/2014
Field of study

Queen's University Belfast Research Portal

Crossref

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Recommended from our members

AN ARCHITECTURE EVALUATION AND IMPLEMENTATION OF A SOFT GPGPU FOR FPGAs

Author: Andryc Kevin
Publication venue: ScholarWorks@UMass Amherst
Publication date: 25/10/2018
Field of study

Embedded and mobile systems must be able to execute a variety of different types of code, often with minimal available hardware. Many embedded systems now come with a simple processor and an FPGA, but not more energy-hungry components, such as a GPGPU. In this dissertation we present FlexGrip, a soft architecture which allows for the execution of GPGPU code on an FPGA without the need to recompile the design. The architecture is optimized for FPGA implementation to effectively support the conditional and thread-based execution characteristics of GPGPU execution without FPGA design recompilation. This architecture supports direct CUDA compilation to a binary which is executable on the FPGA-based GPGPU. Our architecture is customizable, thus providing the FPGA designer with a selection of GPGPU cores which display performance versus area tradeoffs. This dissertation describes the FlexGrip architecture in detail and showcases the benefits by evaluating the design for a collection of five standard CUDA benchmarks which are compiled using standard GPGPU compilation tools. Speedups of 23x, on average, versus a MicroBlaze microprocessor are achieved for designs which take advantage of the conditional execution capabilities offered by FlexGrip. We also show FlexGrip can achieve an 80% average reduction of dynamic energy versus the MicroBlaze microprocessor. The dissertation furthers discussion by exploring application-customized versions of the soft GPGPU, thus exploiting the overlay architecture. We expand the architecture to multiple processors per GPGPU and optimizing away features which are not needed by certain classes of applications. These optimizations, which include the effective use of block RAMs and DSP blocks, are critical to the performance of FlexGrip. By implementing a 2 GPGPU design, we show speedups of 44x on average versus a MicroBlaze microprocessor. Application-customized versions of the soft GPGPU can be used to further reduce dynamic energy consumption by an average of 14%. To complete this thesis, we augmented a GPGPU cycle accurate simulator to emulate FlexGrip and evaluate different levels of cache design spaces. We show performance increases for select benchmarks, however, we also show that 64% and 45% of benchmarks exhibited performance decreases when L1D cache was enabled for the 1 SMP and 2 SMP configurations, and only one benchmark showed performance improvement when the L2 cache was enabled

ScholarWorks@UMass Amherst

Vicuna: A Timing-Predictable RISC-V Vector Coprocessor for Scalable Parallel Computation

Author: Platzer Michael
Puschner Peter
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 33rd Euromicro Conference on Real-Time Systems (ECRTS 2021)
Publication date: 01/01/2021
Field of study

Dagstuhl Research Online Publication Server