Search CORE

41 research outputs found

Revisiting the high-performance reconfigurable computing for future datacenters

Author: Asghar H
Bashir AK
Bourennane EB
Ijaz Q
Publication venue: 'MDPI AG'
Publication date: 01/04/2020
Field of study

Modern datacenters are reinforcing the computational power and energy efficiency by assimilating field programmable gate arrays (FPGAs). The sustainability of this large-scale integration depends on enabling multi-tenant FPGAs. This requisite amplifies the importance of communication architecture and virtualization method with the required features in order to meet the high-end objective. Consequently, in the last decade, academia and industry proposed several virtualization techniques and hardware architectures for addressing resource management, scheduling, adoptability, segregation, scalability, performance-overhead, availability, programmability, time-to-market, security, and mainly, multitenancy. This paper provides an extensive survey covering three important aspects-discussion on non-standard terms used in existing literature, network-on-chip evaluation choices as a mean to explore the communication architecture, and virtualization methods under latest classification. The purpose is to emphasize the importance of choosing appropriate communication architecture, virtualization technique and standard language to evolve the multi-tenant FPGAs in datacenters. None of the previous surveys encapsulated these aspects in one writing. Open problems are indicated for scientific community as well

Multidisciplinary Digital Publishing Institute

HAL-uB

E-space: Manchester Metropolitan University's Research Repository

Efficient Smart CMOS Camera Based on FPGAs Oriented to Embedded Image Processing

Author: Alfredo Gardel
Bravo
Bravo
Draper
Felipe Espinosa
Ignacio Bravo
Javier Baliñas
Jorge García
José L. Lázaro
Kölsch
Martín
Meribout
Rowe
Shi
Tierno
Publication venue: Molecular Diversity Preservation International (MDPI)
Publication date: 01/02/2011
Field of study

This article describes an image processing system based on an intelligent ad-hoc camera, whose two principle elements are a high speed 1.2 megapixel Complementary Metal Oxide Semiconductor (CMOS) sensor and a Field Programmable Gate Array (FPGA). The latter is used to control the various sensor parameter configurations and, where desired, to receive and process the images captured by the CMOS sensor. The flexibility and versatility offered by the new FPGA families makes it possible to incorporate microprocessors into these reconfigurable devices, and these are normally used for highly sequential tasks unsuitable for parallelization in hardware. For the present study, we used a Xilinx XC4VFX12 FPGA, which contains an internal Power PC (PPC) microprocessor. In turn, this contains a standalone system which manages the FPGA image processing hardware and endows the system with multiple software options for processing the images captured by the CMOS sensor. The system also incorporates an Ethernet channel for sending processed and unprocessed images from the FPGA to a remote node. Consequently, it is possible to visualize and configure system operation and captured and/or processed images remotely

Multidisciplinary Digital Publishing Institute

Crossref

Directory of Open Access Journals

PubMed Central

A Scalable Unsegmented Multiport Memory for FPGA-Based Systems

Author: Attia Osama Gamal Mohamed
Jones Phillip Harrison
Townsend Kevin R.
Zambreno Joseph
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2015
Field of study

On-chip multiport memory cores are crucial primitives for many modern high-performance reconfigurable architectures and multicore systems. Previous approaches for scaling memory cores come at the cost of operating frequency, communication overhead, and logic resources without increasing the storage capacity of the memory. In this paper, we present two approaches for designing multiport memory cores that are suitable for reconfigurable accelerators with substantial on-chip memory or complex communication. Our design approaches tackle these challenges by banking RAM blocks and utilizing interconnect networks which allows scaling without sacrificing logic resources. With banking, memory congestion is unavoidable and we evaluate our multiport memory cores under different memory access patterns to gain insights about different design trade-offs. We demonstrate our implementation with up to 256 memory ports using a Xilinx Virtex-7 FPGA. Our experimental results report high throughput memories with resource usage that scales with the number of ports

Digital Repository @ Iowa State University (ISU)

Crossref

Directory of Open Access Journals

A unified hardware/software runtime environment for FPGA-based reconfigurable computers using BORPH

Author: Brodersen R
So HKH
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2008
Field of study

Fulltext linkThis paper explores the design and implementation of BORPH, an operating system designed for FPGA-based reconfigurable computers. Hardware designs execute as normal UNIX processes under BORPH, having access to standard OS services, such as file system support. Hardware and software components of user designs may, therefore, run as communicating processes within BORPH's runtime environment. The familiar language independent UNIX kernel interface facilitates easy design reuse and rapid application development. To develop hardware designs, a Simulink-based design flow that integrates with BORPH is employed. Performances of BORPH on two on-chip systems implemented on a BEE2 platform are compared. © 2008 ACM.link_to_subscribed_fulltex

HKU Scholars Hub

Rethinking FPGA Architectures for Deep Neural Network applications

Author: Rasoulinezhad Seyedramin
Publication venue: Faculty of Engineering, School of Electrical and Information Engineering
Publication date: 01/01/2023
Field of study

The prominence of machine learning-powered solutions instituted an unprecedented trend of integration into virtually all applications with a broad range of deployment constraints from tiny embedded systems to large-scale warehouse computing machines. While recent research confirms the edges of using contemporary FPGAs to deploy or accelerate machine learning applications, especially where the latency and energy consumption are strictly limited, their pre-machine learning optimised architectures remain a barrier to the overall efficiency and performance. Realizing this shortcoming, this thesis demonstrates an architectural study aiming at solutions that enable hidden potentials in the FPGA technology, primarily for machine learning algorithms. Particularly, it shows how slight alterations to the state-of-the-art architectures could significantly enhance the FPGAs toward becoming more machine learning-friendly while maintaining the near-promised performance for the rest of the applications. Eventually, it presents a novel systematic approach to deriving new block architectures guided by designing limitations and machine learning algorithm characteristics through benchmarking. First, through three modifications to Xilinx DSP48E2 blocks, an enhanced digital signal processing (DSP) block for important computations in embedded deep neural network (DNN) accelerators is described. Then, two tiers of modifications to FPGA logic cell architecture are explained that deliver a variety of performance and utilisation benefits with only minor area overheads. Eventually, with the goal of exploring this new design space in a methodical manner, a problem formulation involving computing nested loops over multiply-accumulate (MAC) operations is first proposed. A quantitative methodology for deriving efficient coarse-grained compute block architectures from benchmarks is then suggested together with a family of new embedded blocks, called MLBlocks

Sydney eScholarship

Exploring Hardware Fault Impacts on Different Real Number Representations of the Structural Resilience of TCUs in GPUs

Author: Guerrero-Balaguera Juan-David
Limas Sierra Robert
Rodriguez Condia Josie E.
Sonza Reorda Matteo
Publication venue: MDPI
Publication date: 01/01/2024
Field of study

The most recent generations of graphics processing units (GPUs) boost the execution of convolutional operations required by machine learning applications by resorting to specialized and efficient in-chip accelerators (Tensor Core Units or TCUs) that operate on matrix multiplication tiles. Unfortunately, modern cutting-edge semiconductor technologies are increasingly prone to hardware defects, and the trend to highly stress TCUs during the execution of safety-critical and high-performance computing (HPC) applications increases the likelihood of TCUs producing different kinds of failures. In fact, the intrinsic resiliency to hardware faults of arithmetic units plays a crucial role in safety-critical applications using GPUs (e.g., in automotive, space, and autonomous robotics). Recently, new arithmetic formats have been proposed, particularly those suited to neural network execution. However, the reliability characterization of TCUs supporting different arithmetic formats was still lacking. In this work, we quantitatively assessed the impact of hardware faults in TCU structures while employing two distinct formats (floating-point and posit) and using two different configurations (16 and 32 bits) to represent real numbers. For the experimental evaluation, we resorted to an architectural description of a TCU core (PyOpenTCU) and performed 120 fault simulation campaigns, injecting around 200,000 faults per campaign and requiring around 32 days of computation. Our results demonstrate that the posit format of TCUs is less affected by faults than the floating-point one (by up to three orders of magnitude for 16 bits and up to twenty orders for 32 bits). We also identified the most sensible fault locations (i.e., those that produce the largest errors), thus paving the way to adopting smart hardening solutions

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

An integrated soft- and hard-programmable multithreaded architecture

Author: Zhong Shi
Publication venue: The University of Edinburgh
Publication date: 01/01/2007
Field of study

Edinburgh Research Archive

Combining dynamic and static scheduling in high-level synthesis

Author: Cheng Jianyi
Publication venue: Electrical and Electronic Engineering, Imperial College London
Publication date: 01/07/2023
Field of study

Field Programmable Gate Arrays (FPGAs) are starting to become mainstream devices for custom computing, particularly deployed in data centres. However, using these FPGA devices requires familiarity with digital design at a low abstraction level. In order to enable software engineers without a hardware background to design custom hardware, high-level synthesis (HLS) tools automatically transform a high-level program, for example in C/C++, into a low-level hardware description. A central task in HLS is scheduling: the allocation of operations to clock cycles. The classic approach to scheduling is static, in which each operation is mapped to a clock cycle at compile time, but recent years have seen the emergence of dynamic scheduling, in which an operation’s clock cycle is only determined at run-time. Both approaches have their merits: static scheduling can lead to simpler circuitry and more resource sharing, while dynamic scheduling can lead to faster hardware when the computation has a non-trivial control flow. This thesis proposes a scheduling approach that combines the best of both worlds. My idea is to use existing program analysis techniques in software designs, such as probabilistic analysis and formal verification, to optimize the HLS hardware. First, this thesis proposes a tool named DASS that uses a heuristic-based approach to identify the code regions in the input program that are amenable to static scheduling and synthesises them into statically scheduled components, also known as static islands, leaving the top-level hardware dynamically scheduled. Second, this thesis addresses a problem of this approach: that the analysis of static islands and their dynamically scheduled surroundings are separate, where one treats the other as black boxes. We apply static analysis including dependence analysis between static islands and their dynamically scheduled surroundings to optimize the offsets of static islands for high performance. We also apply probabilistic analysis to estimate the performance of the dynamically scheduled part and use this information to optimize the static islands for high area efficiency. Finally, this thesis addresses the problem of conservatism in using sequential control flow designs which can limit the throughput of the hardware. We show this challenge can be solved by formally proving that certain control flows can be safely parallelised for high performance. This thesis demonstrates how to use automated formal verification to find out-of-order loop pipelining solutions and multi-threading solutions from a sequential program.Open Acces

Spiral - Imperial College Digital Repository

Rapid Prototyping and Exploration Environment for Generating C-to-Hardware-Compilers

Author: Stock Florian-Wolfgang
Publication venue
Publication date: 01/03/2019
Field of study

There is today an ever-increasing demand for more computational power coupled with a desire to minimize energy requirements. Hardware accelerators currently appear to be the best solution to this problem. While general purpose computation with GPUs seem to be very successful in this area, they perform adequately only in those cases where the data access patterns and utilized algorithms fit the underlying architecture. ASICs on the other hand can yield even better results in terms of performance and energy consumption, but are very inflexible, as they are manufactured with an application specific circuitry. Field Programmable Gate Arrays (FPGAs) represent a combination of approaches: With their application specific hardware they provide high computational power while requiring, for many applications, less energy than a CPU or a GPU. On the other hand they are far more flexible than an ASIC due to their reconfigurability. The only remaining problem is the programming of the FPGAs, as they are far more difficult to program compared to regular software. To allow common software developers, who have at best very limited knowledge in hardware design, to make use of these devices, tools were developed that take a regular high level language and generate hardware from it. Among such tools, C-to-HDL compilers are a particularly wide-spread approach. These compilers attempt to translate common C code into a hardware description language from which a datapath is generated. Most of these compilers have many restrictions for the input and differ in their underlying generated micro architecture, their scheduling method, their applied optimizations, their execution model and even their target hardware. Thus, a comparison of a certain aspect alone, like their implemented scheduling method or their generated micro architecture, is almost impossible, as they differ in so many other aspects. This work provides a survey of the existing C-to-HDL compilers and presents a new approach to evaluating and exploring different micro architectures for dynamic scheduling used by such compilers. From a mathematically formulated rule set the Triad compiler generates a backend for the Scale compiler framework, which then implements a hardware generation backend with described dynamic scheduling. While more than a factor of four slower than hardware from highly optimized compilers, this environment allows easy comparison and exploration of different rule sets and the micro architecture for the dynamically scheduled datapaths generated from them. For demonstration purposes a rule set modeling the COCOMA token flow model from the COMRADE 2.0 compiler was implemented. Multiple variants of it were explored: Savings of up to 11% of the required hardware resources were possible

TUbiblio

tuprints

Unifying software and hardware of multithreaded reconfigurable applications within operating system processes

Author: Vuletic Miljan
Publication venue: Lausanne, EPFL
Publication date: 22/08/2006
Field of study

Novel reconfigurable System-on-Chip (SoC) devices offer combining software with application-specific hardware accelerators to speed up applications. However, by mixing user software and user hardware, principal programming abstractions and system-software commodities are usually lost, since hardware accelerators (1) do not have execution context —it is typically the programmer who is supposed to provide it, for each accelerator, (2) do not have virtual memory abstraction —it is again programmer who shall communicate data from user software space to user hardware, even if it is usually burdensome (or sometimes impossible!), (3) cannot invoke system services (e.g., to allocate memory, open files, communicate), and (4) are not easily portable —they depend mostly on system-level interfacing, although they logically belong to the application level. We introduce a unified Operating System (OS) process for codesigned reconfigurable applications that provides (1) unified memory abstraction for software and hardware application parts, (2) execution transfers from software to hardware and vice versa, thus enabling hardware accelerators to use systems services and callback other software and hardware functions, and (3) multithreaded execution of multiple software and hardware threads. The unified OS process ensures portability of codesigned applications, by providing standardised means of interfacing. Having just-another abstraction layer usually affects performance: we show that the runtime optimisations in the system layer supporting the unified OS process can minimise the performance loss and even outperform typical approaches. The unified OS process also fosters unrestricted automated synthesis of software to hardware, thus allowing unlimited migration of application components. We demonstrate the advantages of the unified OS process in practice, for Linux systems running on Xilinx Virtex-II Pro and Altera Excalibur reconfigurable devices

Infoscience - École polytechnique fédérale de Lausanne