25 research outputs found

    Программные средства высокоуровневого синтеза для многокристальных реконфигурируемых вычислительных систем

    Get PDF
    The article describes an original complex of high-level synthesis that converts sequential programs into a circuit configuration of specialized hardware for reconfigurable computing systems. An absolutely parallel form, an information graph, is constructed from the original sequential program. Further, the graph is transformed into a resource-independent parallel–pipeline form — a personnel structure that can be adapted to various hardware resources. The transformation of the personnel structure into an information-equivalent structure, but occupying a smaller hardware resource, is performed using formalized methods of performance reduction, which allows you to automatically obtain a rational solution for a given multi-chip reconfigurable computing system. Unlike the known means of high-level synthesis, the result of the transformation is not the IP core of a computationally time-consuming fragment, but an automatically synchronized solution of an applied problem for all FPGA crystals of a reconfigurable computing system. Compared with parallelizing compilers, the number of analyzed variants of the synthesis of a rational solution is significantly less, which is a distinctive feature of the described complex. The application of high-level synthesis software is considered by the example of the problem of solving a system of linear algebraic equations by the Gauss method containing information-interdependent computational fragments with significantly different degrees of parallelism.В статье описывается оригинальный комплекс высокоуровневого синтеза, преобразующий последовательные программы в схемотехническую конфигурацию специализированных аппаратных средств для реконфигурируемых вычислительных систем. Из исходной последовательной программы строится абсолютно-параллельная форма — информационный граф. Далее, граф преобразуется в ресурсонезависимую параллельно-конвейерную форму — кадровую структуру, которую можно адаптировать к различному аппаратному ресурсу. Преобразование кадровой структуры в информационно-эквивалентную, но занимающую меньший аппаратный ресурс, структуру выполняется с помощью формализованных методов редукции производительности, что позволяет автоматически получить рациональное решение для заданной многокристальной реконфигурируемой вычислительной системы. В отличие от известных средств высокоуровневого синтеза результатом преобразования является не IP-ядро вычислительно-трудоемкого фрагмента, а автоматически синхронизированное решение прикладной задачи для всех кристаллов ПЛИС реконфигурируемой вычислительной системы. По сравнению с распараллеливающими компиляторами, число анализируемых вариантов синтеза рационального решения существенно меньше, что является отличительной особенностью описываемого комплекса. Применение программных средств высокоуровневого синтеза рассматривается на примере задачи решения системы линейных алгебраических уравнений методом Гаусса, содержащей информационно-взаимозависимые вычислительные фрагменты с существенно разной степенью параллелизма

    Extending BORPH for shared memory reconfigurable computers

    Get PDF
    We extend BORPH for shared memory reconfigurable computers in this paper. BORPH is an operating system designed for FPGA based reconfigurable computers. BORPH introduced the concept of hardware process in contrast to software process. With our extension, hardware processes are supported to communicate with other processes based on shared memory. In our system, the program of hardware process is not just hardware design, but the software program running on embedded processor in FPGA. Our experiment shows the overhead of shared memory segments management is acceptable. And with independent virtual memory access, bandwidth of repeated shared memory access is high.published_or_final_versio

    Just In Time Assembly (JITA) - A Run Time Interpretation Approach for Achieving Productivity of Creating Custom Accelerators in FPGAs

    Get PDF
    The reconfigurable computing community has yet to be successful in allowing programmers to access FPGAs through traditional software development flows. Existing barriers that prevent programmers from using FPGAs include: 1) knowledge of hardware programming models, 2) the need to work within the vendor specific CAD tools and hardware synthesis. This thesis presents a series of published papers that explore different aspects of a new approach being developed to remove the barriers and enable programmers to compile accelerators on next generation reconfigurable manycore architectures. The approach is entitled Just In Time Assembly (JITA) of hardware accelerators. The approach has been defined to allow hardware accelerators to be built and run through software compilation and run time interpretation outside of CAD tools and without requiring each new accelerator to be synthesized. The approach advocates the use of libraries of pre-synthesized components that can be referenced through symbolic links in a similar fashion to dynamically linked software libraries. Synthesis still must occur but is moved out of the application programmers software flow and into the initial coding process that occurs when programming patterns that define a Domain Specific Language (DSL) are first coded. Programmers see no difference between creating software or hardware functionality when using the DSL. A new run time interpreter is introduced to assemble the individual pre-synthesized hardware accelerators that comprise the accelerator functionality within a configurable tile array of partially reconfigurable slots at run time. Quantitative results are presented that compares utilization, performance, and productivity of the approach to what would be achieved by full custom accelerators created through traditional CAD flows using hardware programming models and passing through synthesis

    Tradeoff Attacks on Symmetric Ciphers

    Get PDF
    Tradeoff attacks on symmetric ciphers can be considered as the generalization of the exhaustive search. Their main objective is reducing the time complexity by exploiting the memory after preparing very large tables at a cost of exhaustively searching all the space during the precomputation phase. It is possible to utilize data (plaintext/ciphertext pairs) in some cases like the internal state recovery attacks for stream ciphers to speed up further both online and offline phases. However, how to take advantage of data in a tradeoff attack against block ciphers for single key recovery cases is still unknown. We briefly assess the state of art of tradeoff attacks on symmetric ciphers, introduce some open problems and discuss the security criterion on state sizes. We discuss the strict lower bound for the internal state size of keystream generators and propose more practical and fair bound along with our reasoning. The adoption of our new criterion can break a fresh ground in boosting the security analysis of small keystream generators and in designing ultra-lightweight stream ciphers with short internal states for their usage in specially low source devices such as IoT devices, wireless sensors or RFID tags

    FPGA accelerator for gradient boosting decision trees

    Get PDF
    A decision tree is a well-known machine learning technique. Recently their popularity has increased due to the powerful Gradient Boosting ensemble method that allows to gradually increasing accuracy at the cost of executing a large number of decision trees. In this paper we present an accelerator designed to optimize the execution of these trees while reducing the energy consumption. We have implemented it in an FPGA for embedded systems, and we have tested it with a relevant case-study: pixel classification of hyperspectral images. In our experiments with different images our accelerator can process the hyperspectral images at the same speed at which they are generated by the hyperspectral sensors. Compared to a high-performance processor running optimized software, on average our design is twice as fast and consumes 72 times less energy. Compared to an embedded processor, it is 30 times faster and consumes 23 times less energy

    Pipelined Asynchronous High Level Synthesis for General Programs

    Get PDF
    High-level synthesis (HLS) translates algorithms from software programming language into hardware. We use the dataflow HLS methodology to translate programs into asynchronous circuits by implementing programs using asynchronous dataflow elements as hardware building blocks. We extend the prior work in dataflow synthesis in the following aspects:i) we propose Fluid to synthesize pipelined dataflow circuits for real-world programs with complex control flows, which are not supported in the previous work; ii) we propose PipeLink to permit pipelined access to shared resources in the dataflow circuit. Dataflow circuit results in distributed control and an implicitly pipelined implementation. However, resource sharing in the presence of pipelining is challenging in this context due to the absence of a global scheduler. Traditional solutions to this problem impose restrictions on pipelining to guarantee mutually exclusive access to the shared resource, but PipeLink removes such restrictions and can generate pipelined asynchronous dataflow circuits for shared function calls, pipelined memory accesses and function pointers; iii) we apply several dataflow optimizations to improve the quality of the synthesized dataflow circuits; iv) we implement our system (Fluid + PipeLink) on the LLVM compiler framework, which allows us to take advantage of the optimization efforts from the compiler community; v) we compare our system with a widely-used academic HLS tool and two commercial HLS tools. Compared to commercial (academic) HLS tools, our system results in 12X (20X) reduction in energy, 1.29X (1.64X) improvement in throughput, 1.27X (1.61X) improvement in latency at a cost of 2.4X (1.61X) increase in the area

    Degradation in FPGAs: Monitoring, Modeling and Mitigation

    Get PDF
    This dissertation targets the transistor aging degradation as well as the associated thermal challenges in FPGAs (since there is an exponential relation between aging and chip temperature). The main objectives are to perform experimentation, analysis and device-level model abstraction for modeling the degradation in FPGAs, then to monitor the FPGA to keep track of aging rates and ultimately to propose an aging-aware FPGA design flow to mitigate the aging

    Recent Advances in Embedded Computing, Intelligence and Applications

    Get PDF
    The latest proliferation of Internet of Things deployments and edge computing combined with artificial intelligence has led to new exciting application scenarios, where embedded digital devices are essential enablers. Moreover, new powerful and efficient devices are appearing to cope with workloads formerly reserved for the cloud, such as deep learning. These devices allow processing close to where data are generated, avoiding bottlenecks due to communication limitations. The efficient integration of hardware, software and artificial intelligence capabilities deployed in real sensing contexts empowers the edge intelligence paradigm, which will ultimately contribute to the fostering of the offloading processing functionalities to the edge. In this Special Issue, researchers have contributed nine peer-reviewed papers covering a wide range of topics in the area of edge intelligence. Among them are hardware-accelerated implementations of deep neural networks, IoT platforms for extreme edge computing, neuro-evolvable and neuromorphic machine learning, and embedded recommender systems

    Design of Intellectual Property-Based Hardware Blocks Integrable with Embedded RISC Processors

    Get PDF
    The main focus of this thesis is to research methods, architecture, and implementation of hardware acceleration for a Reduced Instruction Set Computer (RISC) platform. The target platform is a single-core general-purpose embedded processor (the COFFEE core) which was developed by our group at Tampere University of Technology. The COFFEE core alone cannot meet the requirements of the modern applications due to the lack of several components of which the Memory Management Unit (MMU) is one of the prominent ones. Since the MMU is one of the main requirements of today’s processors, COFFEE with no MMU was not able to run an operating system. In the design of the MMU, we employed two additional micro-Translation-Lookaside Buffers (TLBs) to speed up the translation process, as well as minimizing congestions of the data/instruction address translations with a unified TLB. The MMU is tightly-coupled with the COFFEE RISC core through the Peripheral Control Block (PCB) interface of the core. The hardware implementation, alongside some optimization techniques and post synthesis results are presented, as well.Another intention of this work is to prepare a reconfigurable platform to send and receive data packets of the next generation wireless communications. Hence, we will further discuss a recently emerged wireless modulation technique known as Non-Contiguous Orthogonal Frequency Division Multiplexing (NC-OFDM), a promising technique to alleviate spectrum scarcity problem. However, one of the primary concerns in such systems is the synchronization. To that end, we developed a reconfigurable hardware component to perform as a synchronizer. The developed module exploits Partial Reconfiguration (PR) feature in order to reconfigure itself. Eventually, we will come up with several architectural choices for systems with different limiting factors such as power consumption, operating frequency, and silicon area. The synchronizer can be loosely-coupled via one of the available co-processor slots of the target processor, the COFFEE RISC core.In addition, we are willing to improve the versatility of the COFFEE core even in industrial use cases. Hence, we developed a reconfigurable hardware component capable of operating in the Controller Area Network (CAN) protocol. In the first step of this implementation, we mainly concentrate on receiving, decoding, and extracting the data segment of a CAN-based packet. Moreover, this hardware block can reconfigure itself on-the-fly to operate on different data frames. More details regarding hardware implementation issues, as well as post synthesis results are also presented. The CAN module is loosely-coupled with the COFFEE RISC processor through one of the available co-processor block

    Reconfigurable computing for large-scale graph traversal algorithms

    Get PDF
    This thesis proposes a reconfigurable computing approach for supporting parallel processing in large-scale graph traversal algorithms. Our approach is based on a reconfigurable hardware architecture which exploits the capabilities of both FPGAs (Field-Programmable Gate Arrays) and a multi-bank parallel memory subsystem. The proposed methodology to accelerate graph traversal algorithms has been applied to three case studies, revealing that application-specific hardware customisations can benefit performance. A summary of our four contributions is as follows. First, a reconfigurable computing approach to accelerate large-scale graph traversal algorithms. We propose a reconfigurable hardware architecture which decouples computation and communication while keeping multiple memory requests in flight at any given time, taking advantage of the high bandwidth of multi-bank memory subsystems. Second, a demonstration of the effectiveness of our approach through two case studies: the breadth-first search algorithm, and a graphlet counting algorithm from bioinformatics. Both case studies involve graph traversal, but each of them adopts a different graph data representation. Third, a method for using on-chip memory resources in FPGAs to reduce off-chip memory accesses for accelerating graph traversal algorithms, through a case-study of the All-Pairs Shortest-Paths algorithm. This case study has been applied to process human brain network data. Fourth, an evaluation of an approach based on instruction-set extension for FPGA design against many-core GPUs (Graphics Processing Units), based on a set of benchmarks with different memory access characteristics. It is shown that while GPUs excel at streaming applications, the proposed approach can outperform GPUs in applications with poor locality characteristics, such as graph traversal problems.Open Acces
    corecore