146 research outputs found
A proposed synthesis method for Application-Specific Instruction Set Processors
Due to the rapid technology advancement in integrated circuit era, the need for the high computation
performance together with increasing complexity and manufacturing costs has raised the demand for
high-performance con
fi
gurable designs; therefore, the Application-Speci
fi
c Instruction Set Processors
(ASIPs) are widely used in SoC design. The automated generation of software tools for ASIPs is a
commonly used technique, but the automated hardware model generation is less frequently applied in
terms of
fi
nal RTL implementations. Contrary to this, the
fi
nal register-transfer level models are usually
created, at least partly, manually. This paper presents a novel approach for automated hardware model
generation for ASIPs. The new solution is based on a novel abstract ASIP model and a modeling language
(Algorithmic Microarchitecture Description Language, AMDL) optimized for this architecture model. The
proposed AMDL-based pre-synthesis method is based on a set of pre-de
fi
ned VHDL implementation
schemes, which ensure the qualities of the automatically generated register-transfer level models in
terms of resource requirement and operation frequency. The design framework implementing the
algorithms required by the synthesis method is also presented
Efficient implementation of video processing algorithms on FPGA
The work contained in this portfolio thesis was carried out as part of an Engineering Doctorate (Eng.D) programme from the Institute for System Level Integration. The work was sponsored by Thales Optronics, and focuses on issues surrounding the implementation of video processing algorithms on field programmable gate arrays (FPGA).
A description is given of FPGA technology and the currently dominant methods of designing and verifying firmware. The problems of translating a description of behaviour into one of structure are discussed, and some of the latest methodologies for tackling this problem are introduced.
A number of algorithms are then looked at, including methods of contrast enhancement, deconvolution, and image fusion. Algorithms are characterised according to the nature of their execution flow, and this is used as justification for some of the design choices that are made. An efficient method of performing large two-dimensional convolutions is also described.
The portfolio also contains a discussion of an FPGA implementation of a PID control algorithm, an overview of FPGA dynamic reconfigurability, and the development of a demonstration platform for rapid deployment of video processing algorithms in FPGA hardware
Application-specific instruction set processor for speech recognition.
Cheung Man Ting.Thesis (M.Phil.)--Chinese University of Hong Kong, 2005.Includes bibliographical references (leaves 69-71).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- The Emergence of ASIP --- p.1Chapter 1.1.1 --- Related Work --- p.3Chapter 1.2 --- Motivation --- p.6Chapter 1.3 --- ASIP Design Methodologies --- p.7Chapter 1.4 --- Fundamentals of Speech Recognition --- p.8Chapter 1.5 --- Thesis outline --- p.10Chapter 2 --- Automatic Speech Recognition --- p.11Chapter 2.1 --- Overview of ASR system --- p.11Chapter 2.2 --- Theory of Front-end Feature Extraction --- p.12Chapter 2.3 --- Theory of HMM-based Speech Recognition --- p.14Chapter 2.3.1 --- Hidden Markov Model (HMM) --- p.14Chapter 2.3.2 --- The Typical Structure of the HMM --- p.14Chapter 2.3.3 --- Discrete HMMs and Continuous HMMs --- p.15Chapter 2.3.4 --- The Three Basic Problems for HMMs --- p.17Chapter 2.3.5 --- Probability Evaluation --- p.18Chapter 2.4 --- The Viterbi Search Engine --- p.19Chapter 2.5 --- Isolated Word Recognition (IWR) --- p.22Chapter 3 --- Design of ASIP Platform --- p.24Chapter 3.1 --- Instruction Fetch --- p.25Chapter 3.2 --- Instruction Decode --- p.26Chapter 3.3 --- Datapath --- p.29Chapter 3.4 --- Register File Systems --- p.30Chapter 3.4.1 --- Memory Hierarchy --- p.30Chapter 3.4.2 --- Register File Organization --- p.31Chapter 3.4.3 --- Special Registers --- p.34Chapter 3.4.4 --- Address Generation --- p.34Chapter 3.4.5 --- Load and Store --- p.36Chapter 4 --- Implementation of Speech Recognition on ASIP --- p.37Chapter 4.1 --- Hardware Architecture Exploration --- p.37Chapter 4.1.1 --- Floating Point and Fixed Point --- p.37Chapter 4.1.2 --- Multiplication and Accumulation --- p.38Chapter 4.1.3 --- Pipelining --- p.41Chapter 4.1.4 --- Memory Architecture --- p.43Chapter 4.1.5 --- Saturation Logic --- p.44Chapter 4.1.6 --- Specialized Addressing Modes --- p.44Chapter 4.1.7 --- Repetitive Operation --- p.47Chapter 4.2 --- Software Algorithm Implementation --- p.49Chapter 4.2.1 --- Implementation Using Base Instruction Set --- p.49Chapter 4.2.2 --- Implementation Using Refined Instruction Set --- p.54Chapter 5 --- Simulation Results --- p.56Chapter 6 --- Conclusions and Future Work --- p.60Appendices --- p.62Chapter A --- Base Instruction Set --- p.62Chapter B --- Special Registers --- p.65Chapter C --- Chip Microphotograph of ASIP --- p.67Chapter D --- The Testing Board of ASIP --- p.68Bibliography --- p.6
Optimisations arithmétiques et synthèse de haut niveau
High-level synthesis (HLS) tools offer increased productivity regarding FPGA programming.However, due to their relatively young nature, they still lack many arithmetic optimizations.This thesis proposes safe arithmetic optimizations that should always be applied.These optimizations are simple operator specializations, following the C semantic.Other require to a lift the semantic embedded in high-level input program languages, which are inherited from software programming, for an improved accuracy/cost/performance ratio.To demonstrate this claim, the sum-of-product of floating-point numbers is used as a case study. The sum is performed on a fixed-point format, which is tailored to the application, according to the context in which the operator is instantiated.In some cases, there is not enough information about the input data to tailor the fixed-point accumulator.The fall-back strategy used in this thesis is to generate an accumulator covering the entire floating-point range.This thesis explores different strategies for implementing such a large accumulator, including new ones.The use of a 2's complement representation instead of a sign+magnitude is demonstrated to save resources and to reduce the accumulation loop delay.Based on a tapered precision scheme and an exact accumulator, the posit number systems claims to be a candidate to replace the IEEE floating-point format.A throughout analysis of posit operators is performed, using the same level of hardware optimization as state-of-the-art floating-point operators.Their cost remains much higher that their floating-point counterparts in terms of resource usage and performance. Finally, this thesis presents a compatibility layer for HLS tools that allows one code to be deployed on multiple tools.This library implements a strongly typed custom size integer type along side a set of optimized custom operators.À cause de la nature relativement jeune des outils de synthèse de haut-niveau (HLS), de nombreuses optimisations arithmétiques n'y sont pas encore implémentées. Cette thèse propose des optimisations arithmétiques se servant du contexte spécifique dans lequel les opérateurs sont instanciés.Certaines optimisations sont de simples spécialisations d'opérateurs, respectant la sémantique du C.D'autres nécéssitent de s'éloigner de cette sémantique pour améliorer le compromis précision/coût/performance.Cette proposition est démontré sur des sommes de produits de nombres flottants.La somme est réalisée dans un format en virgule-fixe défini par son contexte.Quand trop peu d’informations sont disponibles pour définir ce format en virgule-fixe, une stratégie est de générer un accumulateur couvrant l'intégralité du format flottant.Cette thèse explore plusieurs implémentations d'un tel accumulateur.L'utilisation d'une représentation en complément à deux permet de réduire le chemin critique de la boucle d'accumulation, ainsi que la quantité de ressources utilisées. Un format alternatif aux nombres flottants, appelé posit, propose d'utiliser un encodage à précision variable.De plus, ce format est augmenté par un accumulateur exact.Pour évaluer précisément le coût matériel de ce format, cette thèse présente des architectures d'opérateurs posits, implémentés avec le même degré d'optimisation que celui de l'état de l'art des opérateurs flottants.Une analyse détaillée montre que le coût des opérateurs posits est malgré tout bien plus élevé que celui de leurs équivalents flottants.Enfin, cette thèse présente une couche de compatibilité entre outils de HLS, permettant de viser plusieurs outils avec un seul code. Cette bibliothèque implémente un type d'entiers de taille variable, avec de plus une sémantique strictement typée, ainsi qu'un ensemble d'opérateurs ad-hoc optimisés
Hardware Acceleration Using Functional Languages
Cílem této práce je prozkoumat možnosti využití funkcionálního paradigmatu pro hardwarovou akceleraci, konkrétně pro datově paralelní úlohy. Úroveň abstrakce tradičních jazyků pro popis hardwaru, jako VHDL a Verilog, přestáví stačit. Pro popis na algoritmické či behaviorální úrovni se rozmáhají jazyky původně navržené pro vývoj softwaru a modelování, jako C/C++, SystemC nebo MATLAB. Funkcionální jazyky se s těmi imperativními nemůžou měřit v rozšířenosti a oblíbenosti mezi programátory, přesto je předčí v mnoha vlastnostech, např. ve verifikovatelnosti, schopnosti zachytit inherentní paralelismus a v kompaktnosti kódu. Pro akceleraci datově paralelních výpočtů se často používají jednotky FPGA, grafické karty (GPU) a vícejádrové procesory. Praktická část této práce rozšiřuje existující knihovnu Accelerate pro počítání na grafických kartách o výstup do VHDL. Accelerate je možno chápat jako doménově specifický jazyk vestavěný do Haskellu s backendem pro prostředí NVIDIA CUDA. Rozšíření pro vysokoúrovňovou syntézu obvodů ve VHDL představené v této práci používá stejný jazyk a frontend.The aim of this thesis is to research how the functional paradigm can be used for hardware acceleration with an emphasis on data-parallel tasks. The level of abstraction of the traditional hardware description languages, such as VHDL or Verilog, is becoming to low. High-level languages from the domains of software development and modeling, such as C/C++, SystemC or MATLAB, are experiencing a boom for hardware description on the algorithmic or behavioral level. Functional Languages are not so commonly used, but they outperform imperative languages in verification, the ability to capture inherent paralellism and the compactness of code. Data-parallel task are often accelerated on FPGAs, GPUs and multicore processors. In this thesis, we use a library for general-purpose GPU programs called Accelerate and extend it to produce VHDL. Accelerate is a domain-specific language embedded into Haskell with a backend for the NVIDIA CUDA platform. We use the language and its frontend, and create a new backend for high-level synthesis of circuits in VHDL.
A High Level Synthesis Flow Using Model Driven Engineering
Intensive Signal Processing (ISP) applications handle large amounts of data and are characterized by hierarchical and data parallel tasks, which manip- ulate multidimensional data arrays according to complex data dependencies. Performance requirements often preclude ISP applications from being im- plemented purely in software and instead call for using custom and efficient hardware accelerators. A hardware accelerator is an electronic design dedi- cated to the execution of a specific application. Its hardware architecture can be designed for a maximal parallelization of the algorithm needed to execute its application and for optimal execution support for regular and repetitive tasks. However, the complexity of hardware accelerators makes them difficult to manipulate at low abstraction levels (in a Hardware Description Language (HDL) for instance). The description of complex ISP applications is also error prone and tedious when using tools that constrain the number of dimensions of data arrays. High Level Synthesis (HLS) seeks to simplify the design of hardware accel- erators by describing applications at a high abstraction level and by generat- ing the corresponding low level implementation. Application specification is easier at a high abstraction level since hardware designers do not need to han- dle all low level implementation details. HLS thus aims to achieve algorithm- architecture matching by construction, through the automated synthesis of a hardware architecture for an application specified at a high level. The automatic generation of low level implementations drastically reduces non- recurring engineering costs and the time to market compared to hand-tuned implementations in HDL. For these reasons, HLS tools have been increasingly successful among the hardware designer community. This trend is followed by the continual integration of new capabilities and functionality in the tools. Therefore, successful HLS has to support rapidly evolving technologies and be maintainable in order to capitalize on efforts. We present some design challenges faced by HLS and how model-driven engineering can meet them
Validation and verification of the interconnection of hardware intellectual property blocks for FPGA-based packet processing systems
As networks become more versatile, the computational requirement for supporting additional
functionality increases. The increasing demands of these networks can be met by Field Programmable
Gate Arrays (FPGA), which are an increasingly popular technology for implementing packet processing
systems. The fine-grained parallelism and density of these devices can be exploited to meet the
computational requirements and implement complex systems on a single chip. However, the increasing
complexity of FPGA-based systems makes them susceptible to errors and difficult to test and debug.
To tackle the complexity of modern designs, system-level languages have been developed to provide
abstractions suited to the domain of the target system. Unfortunately, the lack of formality in
these languages can give rise to errors that are not caught until late in the design cycle. This
thesis presents three techniques for verifying and validating FPGA-based packet processing systems
described in a system-level description language. First, a type system is applied to the system
description language to detect errors before implementation. Second, system-level transaction
monitoring is used to observe high-level events on-chip following implementation. Third, the
high-level information embodied in the system description language is exploited to allow the system
to be automatically instrumented for on-chip monitoring.
This thesis demonstrates that these techniques catch errors which are undetected by traditional
verification and validation tools. The locations of faults are specified and errors are caught
earlier in the design flow, which saves time by reducing synthesis iterations
Cryptographic key distribution in wireless sensor networks: a hardware perspective
In this work the suitability of different methods of symmetric key distribution for application in wireless sensor networks are discussed. Each method is considered in terms of its security implications for the network. It is concluded that an asymmetric scheme is the optimum choice for key distribution. In particular, Identity-Based Cryptography (IBC) is proposed as the most suitable of the various asymmetric approaches. A protocol for key distribution using identity based Non-Interactive Key Distribution Scheme (NIKDS) and Identity-Based Signature (IBS) scheme is presented. The protocol is analysed on the ARM920T processor and measurements were taken for the run time and energy of its components parts. It was found that the Tate pairing component of the NIKDS consumes significants amounts of energy, and so it should be ported to hardware. An accelerator was implemented in 65nm Complementary Metal Oxide Silicon (CMOS) technology and area, timing and energy figures have been obtained for the design. Initial results indicate that a hardware implementation of IBC would meet the strict energy constraint of a wireless sensor network node
Generic low power reconfigurable distributed arithmetic processor
Higher performance, lower cost, increasingly minimizing integrated circuit components, and
higher packaging density of chips are ongoing goals of the microelectronic and computer
industry. As these goals are being achieved, however, power consumption and flexibility are
increasingly becoming bottlenecks that need to be addressed with the new technology in Very
Large-Scale Integrated (VLSI) design.
For modern systems, more energy is required to support the powerful computational capability
which accords with the increasing requirements, and these requirements cause the change of
standards not only in audio and video broadcasting but also in communication such as wireless
connection and network protocols. Powerful flexibility and low consumption are repellent, but
their combination in one system is the ultimate goal of designers.
A generic domain-specific low-power reconfigurable processor for the distributed
arithmetic algorithm is presented in this dissertation. This domain reconfigurable processor
features high efficiency in terms of area, power and delay, which approaches the
performance of an ASIC design, while retaining the flexibility of programmable platforms.
The architecture not only supports typical distributed arithmetic algorithms which can be
found in most still picture compression standards and video conferencing standards, but
also offers implementation ability for other distributed arithmetic algorithms found in
digital signal processing, telecommunication protocols and automatic control.
In this processor, a simple reconfigurable low power control unit is implemented with
good performance in area, power and timing. The generic characteristic of the architecture
makes it applicable for any small and medium size finite state machines which can be used
as control units to implement complex system behaviour and can be found in almost all
engineering disciplines. Furthermore, to map target applications efficiently onto the
proposed architecture, a new algorithm is introduced for searching for the best common
sharing terms set and it keeps the area and power consumption of the implementation at
low level. The software implementation of this algorithm is presented, which can be used
not only for the proposed architecture in this dissertation but also for all the
implementations with adder-based distributed arithmetic algorithms. In addition, some low
power design techniques are applied in the architecture, such as unsymmetrical design
style including unsymmetrical interconnection arranging, unsymmetrical PTBs selection
and unsymmetrical mapping basic computing units. All these design techniques achieve
extraordinary power consumption saving. It is believed that they can be extended to more
low power designs and architectures.
The processor presented in this dissertation can be used to implement complex, high
performance distributed arithmetic algorithms for communication and image processing
applications with low cost in area and power compared with the traditional
methods
- …