30 research outputs found
The review of heterogeneous design frameworks/Platforms for digital systems embedded in FPGAs and SoCs
Systems-on-a-chip integrate specialized modules to provide well-defined functionality. In order to guarantee its efficiency, designersare careful to choose high-level electronic components. In particular,FPGAs (field-programmable gate array) have demonstrated theirability to meet the requirements of emerging technology. However,traditional design methods cannot keep up with the speed andefficiency imposed by the embedded systems industry, so severalframeworks have been developed to simplify the design process of anelectronic system, from its modeling to its physical implementation.This paper illustrates some of them and presents a comparative studybetween them. Indeed, we have selected design methods of SoC(ESP4ML and HLS4ML, OpenESP, LiteX, RubyRTL, PyMTL,SysPy, PyRTL, DSSoC) and NoC networks on OCN chip (PyOCN)and in general on FPGA (PRGA, OpenFPGA, AnyHLS, PYNQ, andPyLog).The objective of this article is to analyze each tool at several levelsand to discuss the benefit of each in the scientific community. Wewill analyze several aspects constituting the architecture and thestructure of the platforms to make a comparative study of thehardware and software design flows of digital systems.
A real-time capable dynamic partial reconfiguration system for an applicationspecific soft-core processor
Modern FPGAs (Field Programmable Gate Arrays) are becoming increasingly important when it comes to embedded system development. Within these FPGAs, soft-core processors are often used to solve a wide range of different tasks. Soft-core processors are a cost-effective and time-efficient way to realize embedded systems. When using the full potential of FPGAs, it is possible to dynamically reconfigure parts of them during run time without the need to stop the device. This feature is called dynamic partial reconfiguration (DPR). If the DPR approach is to be applied in a real-time application-specific soft-core processor, an architecture must be created that ensures strict compliance with the real-time constraint at all times. In this paper, a novel method that addresses this problem is introduced, and its realization is described. In the first step, an application-specializable soft-core processor is presented that is capable of solving problems while adhering to hard real-time deadlines. This is achieved by the full design time analyzability of the soft-core processor. Its special architecture and other necessary features are discussed. Furthermore, a method for the optimized generation of partial bitstreams for the DPR as well as its practical implementation in a tool is presented. This tool is able to minimize given bitstreams with the help of a differential frame bitmap. Experiments that realize the DPR within the soft-core framework are presented, with respect to the need for hard real-time capability. Those experiments show a significant resource reduction of about 40% compared to a functionally equivalent non-DPR design
Conceptual design and realization of a dynamic partial reconfiguration extension of an existing soft-core processor
Viele aktuelle Field Programmable Gate Arrays (FPGAs) unterstützen die Technik der partiellen Rekonfiguration (PR), durch die dynamisch zur Laufzeit ein Hardware-Design auch nur teilweise ausgetauscht werden kann. Die vorliegende Arbeit integriert PR-Funktionalität in die an der Technischen Universität Ilmenau für harte Echtzeitaufgaben mit hochpräzisen Fließkommaberechnungen entwickelte VHDL Integrated Softcore Architecture for Reconfigurable Devices (ViSARD). Zu diesem Zweck wird die arithmetisch-logische Einheit angepasst, um das Auswechseln von Fließkomma-Ausführungseinheiten zu ermöglichen. Ziele der Entwicklung des PR-Systems sind hohe Geschwindigkeit, niedrige Latenz, niedrige Ressourcenkosten und harte Echtzeitfähigkeit. Erreicht werden diese durch die Umsetzung einer eigenen Steuereinheit (partial reconfiguration controller), die partielle Bitströme aus externem RAM über einen standardmäßigen AXI-Bus lädt sowie die entsprechende Erweiterung der ViSARD. In einem Testdesign, das zwischen drei verschiedenen Konfigurationen mit je zwischen einer und drei Ausführungseinheiten wechselt, hat das entwickelte PR-System den maximal spezifierten Bitstromdurchsatz auf dem Ziel-FPGA erreicht und den Verbrauch an Lookup-Tabellen um etwa 40 % verringert.Many modern field-programmable gate arrays (FPGAs) support partial reconfiguration, which allows to dynamically replace only a part of a design at run time. In this thesis, partial reconfiguration capability is integrated with the VHDL Integrated Softcore Architecture for Reconfigurable Devices (ViSARD) developed at Technische Universität Ilmenau and conceived for hard real-time tasks requiring floating-point calculations with high precision. Specifically, its arithmetic logic unit is modified to allow exchanging floating-point arithmetic execution units. Design goals of the partial reconfiguration system are high speed, low latency, low resource overhead, and hard real-time capability. They are reached by implementing a custom partial reconfiguration controller loading partial bitstreams from external RAM over a standard AXI bus and extending the ViSARD appropriately. In a test design that switched between 3 different configurations each containing between 1 and 3 execution units, the proposed partial reconfiguration system achieved the maximum specified bitstream throughput on the target FPGA and allowed for roughly 40 % reduced look-up table usage
Consensus protocols exploiting network programmability
Services rely on replication mechanisms to be available at all time. The service demanding high availability is replicated on a set of machines called replicas. To maintain the consistency of replicas, a consensus protocol such as Paxos or Raft is used to synchronize the replicas' state. As a result, failures of a minority of replicas will not affect the service as other non-faulty replicas continue serving requests. A consensus protocol is a procedure to achieve an agreement among processors in a distributed system involving unreliable processors. Unfortunately, achieving such an agreement involves extra processing on every request, imposing a substantial performance degradation. Consequently, performance has long been a concern for consensus protocols. Although many efforts have been made to improve consensus performance, it continues to be an important problem for researchers. This dissertation presents a novel approach to improving consensus performance. Essentially, it exploits the programmability of a new breed of network devices to accelerate consensus protocols that traditionally run on commodity servers. The benefits of using programmable network devices to run consensus protocols are twofold: The network switches process packets faster than commodity servers and consensus messages travel fewer hops in the network. It means that the system throughput is increased and the latency of requests is reduced. The evaluation of our network-accelerated consensus approach shows promising results. Individual components of our FPGA- based and switch-based consensus implementations can process 10 million and 2.5 billion consensus messages per second, respectively. Our FPGA-based system as a whole delivers 4.3 times performance of a traditional software consensus implementation. The latency is also better for our system and is only one third of the latency of the software consensus implementation when both systems are under half of their maximum throughputs. In order to drive even higher performance, we apply a partition mechanism to our switch-based system, leading to 11 times better throughput and 5 times better latency. By dynamically switching between software-based and network-based implementations, our consensus systems not only improve performance but also use energy more efficiently. Encouraged by those benefits, we developed a fault-tolerant non-volatile memory system. A prototype using software memory controller demonstrated reasonable overhead over local memory access, showing great promise as scalable main memory. Our network-based consensus approach would have a great impact in data centers. It not only improves performance of replication mechanisms which relied on consensus, but also enhances performance of services built on top of those replication mechanisms. Our approach also motivates others to move new functionalities into the network, such as, key-value store and stream processing. We expect that in the near future, applications that typically run on traditional servers will be folded into networks for performance
A configurable vector processor for accelerating speech coding algorithms
The growing demand for voice-over-packer (VoIP) services and multimedia-rich
applications has made increasingly important the efficient, real-time implementation of
low-bit rates speech coders on embedded VLSI platforms. Such speech coders are
designed to substantially reduce the bandwidth requirements thus enabling dense multichannel
gateways in small form factor. This however comes at a high computational cost
which mandates the use of very high performance embedded processors.
This thesis investigates the potential acceleration of two major ITU-T speech coding
algorithms, namely G.729A and G.723.1, through their efficient implementation on a
configurable extensible vector embedded CPU architecture. New scalar and vector ISAs
were introduced which resulted in up to 80% reduction in the dynamic instruction count
of both workloads. These instructions were subsequently encapsulated into a parametric,
hybrid SISD (scalar processor)–SIMD (vector) processor. This work presents the research
and implementation of the vector datapath of this vector coprocessor which is tightly-coupled
to a Sparc-V8 compliant CPU, the optimization and simulation methodologies
employed and the use of Electronic System Level (ESL) techniques to rapidly design
SIMD datapaths
Rapid Prototyping and Exploration Environment for Generating C-to-Hardware-Compilers
There is today an ever-increasing demand for more computational power
coupled with a desire to minimize energy requirements. Hardware
accelerators currently appear to be the best solution to this problem.
While general purpose computation with GPUs seem to be
very successful in this area, they perform adequately only in those cases where the data
access patterns and utilized algorithms fit the underlying
architecture. ASICs on the other hand can yield even better results
in terms of performance and energy consumption, but are very
inflexible, as they are manufactured with an application specific
circuitry. Field Programmable Gate Arrays (FPGAs) represent a combination of approaches: With their application specific hardware they provide
high computational power while requiring, for many applications, less
energy than a CPU or a GPU. On the other hand they are far more
flexible than an ASIC due to their reconfigurability.
The only remaining problem is the programming of
the FPGAs, as they are far more difficult to program compared to
regular software. To allow common software developers, who have at
best very limited
knowledge in hardware design, to make use of these devices, tools were
developed that take a regular high level language and generate
hardware from it.
Among such tools, C-to-HDL compilers are a particularly wide-spread approach. These compilers attempt to translate common C code into a hardware description language from which a
datapath is generated. Most of these compilers have many
restrictions for the input and differ in their underlying generated
micro architecture, their scheduling method, their applied
optimizations, their execution model and even their target
hardware. Thus, a comparison of a certain aspect alone, like their
implemented scheduling method or their generated micro architecture,
is almost impossible, as they differ in so many other aspects.
This work provides a survey of the existing C-to-HDL compilers and
presents a new approach to evaluating and exploring
different micro architectures for dynamic scheduling used by such
compilers. From a mathematically formulated rule set the Triad
compiler generates a backend for the Scale compiler framework,
which then implements a hardware generation backend with described dynamic
scheduling.
While more than a factor of four slower than hardware from highly
optimized compilers, this environment allows easy comparison and
exploration of different rule sets and the micro architecture for the
dynamically scheduled datapaths generated from them. For demonstration
purposes a rule set modeling the COCOMA token flow model from
the COMRADE 2.0 compiler was implemented. Multiple variants of
it were explored: Savings of up to 11% of the required hardware
resources were possible