12 research outputs found

    Design synthesis for dynamically reconfigurable logic systems

    Get PDF
    Dynamic reconfiguration of logic circuits has been a research problem for over four decades. While applications using logic reconfiguration in practical scenarios have been demonstrated, the design of these systems has proved to be a difficult process demanding the skills of an experienced reconfigurable logic design expert. This thesis proposes an automatic synthesis method which relieves designers of some of the difficulties associated with designing partially dynamically reconfigurable systems. A new design abstraction model for reconfigurable systems is proposed in order to support design exploration using the presented method. Given an input behavioural model, a technology server and a set of design constraints, the method will generate a reconfigurable design solution in the form of a 3D floorplan and a configuration schedule. The approach makes use of genetic algorithms. It facilitates global optimisation to accommodate multiple design objectives common in reconfigurable system design, while making realistic estimates of configuration overheads and of the potential for resource sharing between configurations. A set of custom evolutionary operators has been developed to cope with a multiple-objective search space. Furthermore, the application of a simulation technique verifying the lll results of such an automatic exploration is outlined in the thesis. The qualities of the proposed method are evaluated using a set of benchmark designs taking data from a real reconfigurable logic technology. Finally, some extensions to the proposed method and possible research directions are discussed

    Using embedded hardware monitor cores in critical computer systems

    Get PDF
    The integration of FPGA devices in many different architectures and services makes monitoring and real time detection of errors an important concern in FPGA system design. A monitor is a tool, or a set of tools, that facilitate analytic measurements in observing a given system. The goal of these observations is usually the performance analysis and optimisation, or the surveillance of the system. However, System-on-Chip (SoC) based designs leave few points to attach external tools such as logic analysers. Thus, an embedded error detection core that allows observation of critical system nodes (such as processor cores and buses) should enforce the operation of the FPGA-based system, in order to prevent system failures. The core should not interfere with system performance and must ensure timely detection of errors. This thesis is an investigation onto how a robust hardware-monitoring module can be efficiently integrated in a target PCI board (with FPGA-based application processing features) which is part of a critical computing system. [Continues.

    Cryptographic primitives on reconfigurable platforms.

    Get PDF
    Tsoi Kuen Hung.Thesis (M.Phil.)--Chinese University of Hong Kong, 2002.Includes bibliographical references (leaves 84-92).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Motivation --- p.1Chapter 1.2 --- Objectives --- p.3Chapter 1.3 --- Contributions --- p.3Chapter 1.4 --- Thesis Organization --- p.4Chapter 2 --- Background and Review --- p.6Chapter 2.1 --- Introduction --- p.6Chapter 2.2 --- Cryptographic Algorithms --- p.6Chapter 2.3 --- Cryptographic Applications --- p.10Chapter 2.4 --- Modern Reconfigurable Platforms --- p.11Chapter 2.5 --- Review of Related Work --- p.14Chapter 2.5.1 --- Montgomery Multiplier --- p.14Chapter 2.5.2 --- IDEA Cipher --- p.16Chapter 2.5.3 --- RC4 Key Search --- p.17Chapter 2.5.4 --- Secure Random Number Generator --- p.18Chapter 2.6 --- Summary --- p.19Chapter 3 --- The IDEA Cipher --- p.20Chapter 3.1 --- Introduction --- p.20Chapter 3.2 --- The IDEA Algorithm --- p.21Chapter 3.2.1 --- Cipher Data Path --- p.21Chapter 3.2.2 --- S-Box: Multiplication Modulo 216 + 1 --- p.23Chapter 3.2.3 --- Key Schedule --- p.24Chapter 3.3 --- FPGA-based IDEA Implementation --- p.24Chapter 3.3.1 --- Multiplication Modulo 216 + 1 --- p.24Chapter 3.3.2 --- Deeply Pipelined IDEA Core --- p.26Chapter 3.3.3 --- Area Saving Modification --- p.28Chapter 3.3.4 --- Key Block in Memory --- p.28Chapter 3.3.5 --- Pipelined Key Block --- p.30Chapter 3.3.6 --- Interface --- p.31Chapter 3.3.7 --- Pipelined Design in CBC Mode --- p.31Chapter 3.4 --- Summary --- p.32Chapter 4 --- Variable Radix Montgomery Multiplier --- p.33Chapter 4.1 --- Introduction --- p.33Chapter 4.2 --- RSA Algorithm --- p.34Chapter 4.3 --- Montgomery Algorithm - Ax B mod N --- p.35Chapter 4.4 --- Systolic Array Structure --- p.36Chapter 4.5 --- Radix-2k Core --- p.37Chapter 4.5.1 --- The Original Kornerup Method (Bit-Serial) --- p.37Chapter 4.5.2 --- The Radix-2k Method --- p.38Chapter 4.5.3 --- Time-Space Relationship of Systolic Cells --- p.38Chapter 4.5.4 --- Design Correctness --- p.40Chapter 4.6 --- Implementation Details --- p.40Chapter 4.7 --- Summary --- p.41Chapter 5 --- Parallel RC4 Engine --- p.42Chapter 5.1 --- Introduction --- p.42Chapter 5.2 --- Algorithms --- p.44Chapter 5.2.1 --- RC4 --- p.44Chapter 5.2.2 --- Key Search --- p.46Chapter 5.3 --- System Architecture --- p.47Chapter 5.3.1 --- RC4 Cell Design --- p.47Chapter 5.3.2 --- Key Search --- p.49Chapter 5.3.3 --- Interface --- p.50Chapter 5.4 --- Implementation --- p.50Chapter 5.4.1 --- RC4 cell --- p.51Chapter 5.4.2 --- Floorplan --- p.53Chapter 5.5 --- Summary --- p.53Chapter 6 --- Blum Blum Shub Random Number Generator --- p.55Chapter 6.1 --- Introduction --- p.55Chapter 6.2 --- RRNG Algorithm . . --- p.56Chapter 6.3 --- PRNG Algorithm --- p.58Chapter 6.4 --- Architectural Overview --- p.59Chapter 6.5 --- Implementation --- p.59Chapter 6.5.1 --- Hardware RRNG --- p.60Chapter 6.5.2 --- BBS PRNG --- p.61Chapter 6.5.3 --- Interface --- p.66Chapter 6.6 --- Summary --- p.66Chapter 7 --- Experimental Results --- p.68Chapter 7.1 --- Design Platform --- p.68Chapter 7.2 --- IDEA Cipher --- p.69Chapter 7.2.1 --- Size of IDEA Cipher --- p.70Chapter 7.2.2 --- Performance of IDEA Cipher --- p.70Chapter 7.3 --- Variable Radix Systolic Array --- p.71Chapter 7.4 --- Parallel RC4 Engine --- p.75Chapter 7.5 --- BBS Random Number Generator --- p.76Chapter 7.5.1 --- Size --- p.76Chapter 7.5.2 --- Speed --- p.76Chapter 7.5.3 --- External Clock --- p.77Chapter 7.5.4 --- Random Performance --- p.78Chapter 7.6 --- Summary --- p.78Chapter 8 --- Conclusion --- p.81Chapter 8.1 --- Future Development --- p.83Bibliography --- p.8

    Unifying software and hardware of multithreaded reconfigurable applications within operating system processes

    Get PDF
    Novel reconfigurable System-on-Chip (SoC) devices offer combining software with application-specific hardware accelerators to speed up applications. However, by mixing user software and user hardware, principal programming abstractions and system-software commodities are usually lost, since hardware accelerators (1) do not have execution context —it is typically the programmer who is supposed to provide it, for each accelerator, (2) do not have virtual memory abstraction —it is again programmer who shall communicate data from user software space to user hardware, even if it is usually burdensome (or sometimes impossible!), (3) cannot invoke system services (e.g., to allocate memory, open files, communicate), and (4) are not easily portable —they depend mostly on system-level interfacing, although they logically belong to the application level. We introduce a unified Operating System (OS) process for codesigned reconfigurable applications that provides (1) unified memory abstraction for software and hardware application parts, (2) execution transfers from software to hardware and vice versa, thus enabling hardware accelerators to use systems services and callback other software and hardware functions, and (3) multithreaded execution of multiple software and hardware threads. The unified OS process ensures portability of codesigned applications, by providing standardised means of interfacing. Having just-another abstraction layer usually affects performance: we show that the runtime optimisations in the system layer supporting the unified OS process can minimise the performance loss and even outperform typical approaches. The unified OS process also fosters unrestricted automated synthesis of software to hardware, thus allowing unlimited migration of application components. We demonstrate the advantages of the unified OS process in practice, for Linux systems running on Xilinx Virtex-II Pro and Altera Excalibur reconfigurable devices

    Closing the Gap between FPGA and ASIC:Balancing Flexibility and Efficiency

    Get PDF
    Despite many advantages of Field-Programmable Gate Arrays (FPGAs), they fail to take over the IC design market from Application-Specific Integrated Circuits (ASICs) for high-volume and even medium-volume applications, as FPGAs come with significant cost in area, delay, and power consumption. There are two main reasons that FPGAs have huge efficiency gap with ASICs: (1) FPGAs are extremely flexible as they have fully programmable soft-logic blocks and routing networks, and (2) FPGAs have hard-logic blocks that are only usable by a subset of applications. In other words, current FPGAs have a heterogeneous structure comprised of the flexible soft-logic and the efficient hard-logic blocks that suffer from inefficiency and inflexibility, respectively. The inefficiency of the soft-logic is a challenge for any application that is mapped to FPGAs, and lack of flexibility in the hard-logic results in a waste of resources when an application cannot use the hard-logic. In this thesis, we approach the inefficiency problem of FPGAs by bridging the efficiency/flexibility gap of the hard- and soft-logic. The main goal of this thesis is to compromise on efficiency of the hard-logic for flexibility, on the one hand, and to compromise on flexibility of the soft-logic for efficiency, on the other hand. In other words, this thesis deals with two issues: (1) adding more generality to the hard-logic of FPGAs, and (2) improving the soft-logic by adapting it to the generic requirements of applications. In the first part of the thesis, we introduce new techniques that expand the functionality of FPGAs hard-logic. The hard-logic includes the dedicated resources that are tightly coupled with the soft-logic –i.e., adder circuitry and carry chains –as well as the stand-alone ones –i.e., DSP blocks. These specialized resources are intended to accelerate critical arithmetic operations that appear in the pre-synthesis representation of applications; we introduce mapping and architectural solutions, which enable both types of the hard-logic to support additional arithmetic operations. We first present a mapping technique that extends the application of FPGAs carry chains for carry-save arithmetic, and then to increase the generality of the hard-logic, we introduce novel architectures; using these architectures, more applications can take advantage of FPGAs hard-logic. In the second part of the thesis, we improve the efficiency of FPGAs soft-logic by exploiting the circuit patterns that emerge after logic synthesis, i.e., connection and logic patterns. Using these patterns, we design new soft-logic blocks that have less flexibility, but more efficiency than current ones. In this part, we first introduce logic chains, fixed connections that are integrated between the soft-logic blocks of FPGAs and are well-suited for long chains of logic that appear post-synthesis. Logic chains provide fast and low cost connectivity, increase the bandwidth of the logic blocks without changing their interface with the routing network, and improve the logic density of soft-logic blocks. In addition to logic chains and as a complementary contribution, we present a non-LUT soft-logic block that comprises simple and pre-connected cells. The structure of this logic block is inspired from the logic patterns that appear post-synthesis. This block has a complexity that is only linear in the number of inputs, it sports the potential for multiple independent outputs, and the delay is only logarithmic in the number of inputs. Although this new block is less flexible than a LUT, we show (1) that effective mapping algorithms exist, (2) that, due to their simplicity, poor utilization is less of an issue than with LUTs, and (3) that a few LUTs can still be used in extreme unfortunate cases. In summary, to bridge the gap between FPGAs and ASICs, we approach the problem from two complementary directions, which balance flexibility and efficiency of the logic blocks of FPGAs. However, we were able to explore a few design points in this thesis, and future work could focus on further exploration of the design space

    Efficient implementation of video processing algorithms on FPGA

    Get PDF
    The work contained in this portfolio thesis was carried out as part of an Engineering Doctorate (Eng.D) programme from the Institute for System Level Integration. The work was sponsored by Thales Optronics, and focuses on issues surrounding the implementation of video processing algorithms on field programmable gate arrays (FPGA). A description is given of FPGA technology and the currently dominant methods of designing and verifying firmware. The problems of translating a description of behaviour into one of structure are discussed, and some of the latest methodologies for tackling this problem are introduced. A number of algorithms are then looked at, including methods of contrast enhancement, deconvolution, and image fusion. Algorithms are characterised according to the nature of their execution flow, and this is used as justification for some of the design choices that are made. An efficient method of performing large two-dimensional convolutions is also described. The portfolio also contains a discussion of an FPGA implementation of a PID control algorithm, an overview of FPGA dynamic reconfigurability, and the development of a demonstration platform for rapid deployment of video processing algorithms in FPGA hardware

    Recent Advances in Embedded Computing, Intelligence and Applications

    Get PDF
    The latest proliferation of Internet of Things deployments and edge computing combined with artificial intelligence has led to new exciting application scenarios, where embedded digital devices are essential enablers. Moreover, new powerful and efficient devices are appearing to cope with workloads formerly reserved for the cloud, such as deep learning. These devices allow processing close to where data are generated, avoiding bottlenecks due to communication limitations. The efficient integration of hardware, software and artificial intelligence capabilities deployed in real sensing contexts empowers the edge intelligence paradigm, which will ultimately contribute to the fostering of the offloading processing functionalities to the edge. In this Special Issue, researchers have contributed nine peer-reviewed papers covering a wide range of topics in the area of edge intelligence. Among them are hardware-accelerated implementations of deep neural networks, IoT platforms for extreme edge computing, neuro-evolvable and neuromorphic machine learning, and embedded recommender systems

    Parallel and Distributed Computing

    Get PDF
    The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing

    Optimal program variant generation for hybrid manycore systems

    Get PDF
    Field Programmable Gate Arrays promise to deliver superior energy efficiency in heterogeneous high performance computing, as compared to multicore CPUs and GPUs. The rate of adoption is however hampered by the relative difficulty of programming FPGAs. High-level synthesis tools such as Xilinx Vivado, Altera OpenCL or Intel's HLS address a large part of the programmability issue by synthesizing a Hardware Description Languages representation from a high-level specification of the application, given in programming languages such as OpenCL C, typically used to program CPUs and GPUs. Although HLS solutions make programming easier, they fail to also lighten the burden of optimization. Application developers must rely on expert knowledge to manually optimize their applications for each target device, meaning that traditional HLS solutions do not offer a solution to the issue of performance portability. This state of fact prompted the development of compiler frameworks such as TyTra that operate at an even higher level of abstraction that is amenable to the use of Design Space Exploration (DSE). With DSE the initial program specification can be seen as the starting location in a search-space of correct-by-construction program transformations. In TyTra the search-space is generated from the transitive-closure of term-level transformations derived from type-level transformations. Compiler frameworks such as TyTra theoretically solve the issue of performance portability by providing a way to automatically generate alternative correct program variants. They however suffer from the very practical issue that the generated space is often too large to fully explore. As a consequence, the globally optimal solution may be overlooked. In this work we provide a novel solution to issue performance portability by deriving an efficient yet effective DSE strategy for the TyTra compiler framework. We make use of categorical data types to derive categorical semantics for the formal languages that describe the terms, types, cost-performance estimates and their transformations. From these we define a category of interpretations for TyTra applications, from which we derive a DSE strategy that finds the globally optimal transformation sequence in polynomial time. This is achieved by reducing the size of the generated search space. We formally state and prove a theorem for this claim and then show that the polynomial run-time for our DSE strategy has practically negligible coefficients leading to sub-second exploration times for realistic applications

    Design and implementation of a real-time miniaturized embedded stereo-vision system

    Get PDF
    The main motivation of the thesis is to develop a fully integrated, modular, small baseline (\u3c=3cm), low cost (\u3c=CAD$600), real-time miniaturized embedded stereo-vision system which fits within 5x5cm and consumes very low power ([email protected]). The system consists of two small profile cameras and a dualcore embedded media processor, running at 600MHz per core. The stereo-matching engine performs sub-sampling, rectification, pre-processing using census transform, correlation-based Sum of Hamming Distance matching using three levels of recursion, LRC check and post-processing. The novel post processing algorithm removes outliers due to low-texture regions and depth-discontinuities. A quantitative performance of the post processing algorithm is presented which shows that for all regions, it has an average percentage improvement of 13.61% (based on 2006 Middlebury dataset). To further enhance the performance of the system, optimization steps are employed to achieve a speed of around 10fps for disparity maps in MESVS-I and 20fps in MESVS-II system
    corecore