Search CORE

9 research outputs found

Targeting static and dynamic workloads with a reconfigurable VLIW processor

Author: Hoozemans J.J. (author)
Publication venue
Publication date: 21/06/2018
Field of study

Embedded systems range from very simple devices, such as a digital watch, to highly complex systems such as smartphones. In these complex devices, an increasing number of applications need to be executed on a computing platform. Moreover, the number of applications (or programs) usually exceeds the number of processors found on such platforms. This creates the need for scheduling. Furthermore, each program exhibits different characteristics and their interaction with the (real-life) environmentment leads to real-time requirements. Consequently, the set of programs, called workload, exhibits highly dynamic behavior. Workloads can be dynamic in intensity (i.e., the number of concurrent tasks), characteristics (amount and type of parallelism), and requirements (real-time constraints, power budgets, performance). We argue that dynamic workloads require a dynamic computing platform and propose to use one that comprises the 휌-VEX reconfigurable VLIW processor. It can dynamically adapt to the workload while it is running. Adaptations can be triggered by a user, programmer, compiler, or an operating system. The latter two methods can operate fully automatic and exploring these is one of the goals of this work. Besides dynamic workloads, a number of new classes of embedded devices are running application programs that are very static, but require very high throughput. Examples are the latest generations mobile telecommunications hardware and vision-based applications (automation, surveillance, automated driving). In this case, adapting to the workload at run-time is not advantageous because there are no changes to adapt to. Optimizing for these applications is possible, but must be done before the hardware platform is manufactured (during the design phase) or by making use of Field-Programmable Gate Arrays (FPGAs). This thesis explores the use of the proposed reconfigurable processor to target the full spectrum of embedded workloads. First, design-time reconfigurability is employed to optimize a hardware platform for a static, streaming image processing workload. Second, we explore the run-time reconfigurable processor for dynamic workloads. This is achieved by adapting to a single program to optimize energy efficiency, followed by adapting to a generated set of programs optimizing for throughput. Third, the real-time characteristics of the processor are evaluated and it is shown to have better schedulability compared to static processors. The VLIW architecture results in good timing-predictability, which allows finding tight bounds on the worst-case execution time. Last, we show that the processor is able to assign more parallel execution resources to a static program that is added into the workload, while still guaranteeing time-safety for critical tasks.Computer Engineerin

TU Delft Repository

Porting Linux to the rVEX reconfigurable VLIW softcore

Author: Hoozemans J.J. (author)
Publication venue
Publication date: 24/01/2014
Field of study

This thesis describes the design and implementation of an FPGA-based hardware platform based on the rVEX VLIW softcore and the adaption of a Linux 2.0 no_mmu kernel to run on that platform. The rVEX is a runtime reconfigurable VLIW softcore processor. It supports various configurations that allow programs to run faster or more efficient. The rVEX core can switch between different configurations while it is running. Reconfigurations are typically performed by a software program that is running on a different processor. We discuss the concept of using an Operating System, running on the core itself, that monitors the execution of its tasks and orchestrates core reconfigurations during task switches. In addition to using statically found optimal configurations, performance counters could be added to the core that measure how efficient a program is running on the current core configuration. The OS could use that data to evaluate if another configuration would be beneficial. The implementation of the hardware platform and the porting of the Linux kernel represent the first steps in working towards that final goal. To support our Linux port, a vectored trap controller has been designed. Additionally, a debugging environment has been created by designing a hardware debug unit, implementing an RSP server program and adding rVEX support to the GNU debugger (GDB).CESCTElectrical Engineering, Mathematics and Computer Scienc

TU Delft Repository

Energy Efficient Multistandard Decompressor ASIP

Author: Al-Ars Z. (author)
Hoozemans J.J. (author)
Jaaskelainen Pekka (author)
Tervo Kati (author)
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2021
Field of study

Many applications make extensive use of various forms of compression techniques for storing and communicating data. As decompression is highly regular and repetitive, it is a suitable candidate for acceleration. Examples are offloading (de)compression to a dedicated circuit on a heterogeneous System-on-Chip, or attaching FPGAs or ASICs directly to storage so they can perform these tasks on-the-fly and transparently to the application. ASIC or FPGA implementations will usually result in higher energy-efficiency compared to CPUs. Various ASIC and FPGA accelerators have been developed, but they typically target a single algorithm. However, supporting different compression algorithms could be desirable in many situations. For example, the Apache Parquet file format popular in Big Data analytics supports using different compression standards, even between blocks in a single file. This calls for a more flexible software based co-processor approach. To this end, we propose a compiler-supported Application-Specific Instruction-set Processor (ASIP) design that is able to decompress a range of lossless compression standard without FPGA reconfiguration. We perform a case study of searching a compressed database dump of the entire English Wikipedia. Computer Engineerin

TU Delft Repository

Trepo - Institutional Repository of Tampere University

Improved Dynamic Cache Sharing for Communicating Threads on a Runtime-Adaptable Processor

Author: Hoozemans J.J. (author)
Lorenzon A.F. (author)
Schneider Beck Filho A.C. (author)
Wong J.S.S.M. (author)
Publication venue
Publication date: 01/01/2017
Field of study

Abstract—Multi-threaded applications execute their threads on different cores with their own local caches and need to share data among the threads. Shared caches are used to avoid lengthy and costly main memory accesses. The degree of cache sharing is a balance between reducing misses and increased hit latency.Dynamic caches have been proposed to adapt this balance to the workload type. Similarly, dynamic processors aim to execute workloads as efficient as possible to being able to balance between exploiting Instruction-level parallelism (ILP) and Thread-level parallelism (TLP). To support this, they consist of multiple processing components and caches that have adaptable interconnectsbetween them. Depending on the workload characteristics, these can connect them together to form a large core that exploits ILP, or split them up to form multiple cores that can run multiple threads (exploiting TLP). In this paper, we propose a cache system that is able to further exploit this additional connectivityof a dynamic VLIW processor by being able to forward cache accesses to multiple cache blocks while the processor is running in multi-threaded (‘split’) mode. Additionally, only requests to global data are broadcasted, while accesses to local data are kept private. This will improve the hit rates similar to existing cachesharing schemes, but reduce the penalty due to stalling the other subcores. Local accesses are recognized by distinguishing memory accesses relative to the stack frame pointer. Results show that our cache exhibits similar miss rate reductions as shared caches (up to 90% and on average 26%), and reduces the number ofbroadcasted accesses by 21%.Computer Engineerin

TU Delft Repository

AEx: Automated High-Level Synthesis of Compiler Programmable Co-Processors

Author: Hepola Kari (author)
Hirvonen Alex (author)
Hoozemans J.J. (author)
Jääskeläinen Pekka (author)
Leppänen Topi (author)
Multanen Joonas (author)
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2023
Field of study

Modern High Level Synthesis (HLS) tools succeed well in their engineering productivity goal, but still require toolset and target technology specific modifications to the source code to guide the process towards an efficient implementation. Furthermore, their end result is a fixed function accelerator with limited field and runtime flexibility. In this paper we describe the status of AEx, a novel work-in-progress HLS tool developed in the FitOptiVis ECSEL JU project. AEx is based on automated exploration of architectures using a flexible and lightweight parallel co-processor template. We compare its current performance in CHStone C-language benchmarks to the state of the art FPGA HLS tool Vitis, provide ASIC implementation numbers, and identify the main remaining toolset features that are expected to dramatically further improve the performance. The potential is explored with a hand-optimized case study that shows only 1.64x performance slowdown with the programmable co-processor in comparison to the fixed function Vitis HLS result.Computer Engineerin

TU Delft Repository

FPGA Acceleration for Big Data Analytics: Challenges and Opportunities

Author: Al-Ars Z. (author)
Hadnagy A. (author)
Hofstee H.P. (author)
Hoozemans J.J. (author)
Nonnenmacher Fabian (author)
Peltenburg J.W. (author)
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

The big data revolution has ushered an era with ever increasing volumes and complexity of data requiring ever faster computational analysis. During this very same era, CPU performance growth has been stagnating, pushing the industry to either scale their computation horizontally using multiple nodes in datacenters, or to scale vertically using heterogeneous components to reduce compute time. However, networking and storage continue to provide both higher throughput and lower latency, which allows for leveraging heterogeneous components, deployed in data centers around the world. Still, the integration of big data analytics frameworks with heterogeneous hardware components such as GPGPUs and FPGAs is challenging, because there is an increasing gap in the level of abstraction between analytics solutions developed with big data analytics frameworks, and accelerated kernels developed with heterogeneous components. In this article, we focus on FPGA accelerators that have seen wide-scale deployment in large cloud infrastructures. FPGAs allow the implementation of highly optimized hardware architectures, tailored exactly to an application, and unburdened by the overhead associated with traditional general-purpose computer architectures. FPGAs implementing dataflow-oriented architectures with high levels of (pipeline) parallelism can provide high application throughput, often providing high energy efficiency. Latency-sensitive applications can leverage FPGA accelerators by directly connecting to the physical layer of a network, and perform data transformations without going through the software stacks of the host system. While these advantages of FPGA accelerators hold promise, difficulties associated with programming and integration limit their use. This article explores the existing practices in big data analytics frameworks, discusses the aforementioned gap in development abstractions, and provides some perspectives on how to address these challenges in the future.Accepted author manuscriptComputer Engineerin

TU Delft Repository

ALMARVI Execution Platform: Heterogeneous Video Processing SoC Platform on FPGA

Author: Al-Ars Z. (author)
Hoozemans J.J. (author)
Kadlec Jiri (author)
Tervo Aleksi (author)
van Straten J. (author)
Viitanen Timo (author)
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

The proliferation of processing hardware alternatives allows developers to use various customized computing platforms to run their applications in an optimal way. However, porting application code on custom hardware requires a lot of development and porting effort. This paper describes a heterogeneous computational platform (the ALMARVI execution platform) comprising of multiple communicating processors that allow easy programmability through an interface to OpenCL. The ALMARVI platform uses processing elements based on both VLIW and Transport Triggered Architectures (ρ-VEX and TCE cores, respectively). It can be implemented on Zynq devices such as the ZedBoard, and supports OpenCL by means of the pocl (Portable OpenCL) project and our ALMAIF interface specification. This allows developers to execute kernels transparently on either processing elements, thereby allowing to optimize execution time with minimal design and development effort.Computer Engineerin

TU Delft Repository

Trepo - Institutional Repository of Tampere University

Battling the CPU Bottleneck in Apache Parquet to Arrow Conversion Using FPGA

Author: Al-Ars Z. (author)
Fang J. (author)
Hofstee H.P. (author)
Hoozemans J.J. (author)
Peltenburg J.W. (author)
Van Leeuwen Lars T.J. (author)
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

In the domain of big data analytics, the bottleneck of converting storage-focused file formats to in-memory data structures has shifted from the bandwidth of storage to the performance of decoding and decompression software. Two widely used formats for big data storage and in-memory data are Apache Parquet and Apache Arrow, respectively. In order to improve the speed at which data can be loaded from disk to memory, we propose an FPGA accelerator design that converts Parquet files to Arrow in-memory data structures. We describe an extensible, publicly available, free and open-source implementation of the proposed converter that supports various Parquet file configurations. The performance of the converter is measured on an AWS EC2 F1 system and on a POWER9 system using the recently released OpenCAPI interface. A single instance of the converter can reach between 6 and 12 GB/s of end-to-end throughput, and shows up to a threefold improvement over the fastest single-thread CPU implementation. It has a low resource utilization (less than 5% for all types of FPGA resources). This allows scaling out the design to match the bandwidth of the coming generation of accelerator interfaces. The proposed design and implementation can be extended to support more of the many possible Parquet file configurations.Computer Engineerin

TU Delft Repository

Frame-based Programming, Stream-Based Processing for Medical Image Processing Applications

Author: Al-Ars Z. (author)
de Jong Rob (author)
Elango Uttam Kumar (author)
Hoozemans J.J. (author)
van der Vlugt Steven (author)
van Straten J. (author)
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

This paper presents and evaluates an approach to deploy image and video processing pipelines that are developed frame-oriented on a hardware platform that is stream-oriented, such as an FPGA. First, this calls for a specialized streaming memory hierarchy and accompanying software framework that transparently moves image segments between stages in the image processing pipeline. Second, we use softcore VLIW processors, that are targetable by a C compiler and have hardware debugging capabilities, to evaluate and debug the software before moving to a High-Level Synthesis flow. The algorithm development phase, including debugging and optimizing on the target platform, is often a very time consuming step in the development of a new product. Our proposed platform allows both software developers and hardware designers to test iterations in a matter of seconds (compilation time) instead of hours (synthesis or circuit simulation time).Computer Engineerin

TU Delft Repository