86 research outputs found

    Mapping and FPGA global routing using Mean Field Annealing

    Get PDF
    Ankara : Department of Computer Engineering and Information Science and Institute of Engineering and Science, Bilkent University, 1994.Thesis (Master's) -- -Bilkent University, 1994.Includes bibliographical references leaves 71-73Haritaoğlu, İsmailM.S

    High performance communication on reconfigurable clusters

    Get PDF
    High Performance Computing (HPC) has matured to where it is an essential third pillar, along with theory and experiment, in most domains of science and engineering. Communication latency is a key factor that is limiting the performance of HPC, but can be addressed by integrating communication into accelerators. This integration allows accelerators to communicate with each other without CPU interactions, and even bypassing the network stack. Field Programmable Gate Arrays (FPGAs) are the accelerators that currently best integrate communication with computation. The large number of Multi-gigabit Transceivers (MGTs) on most high-end FPGAs can provide high-bandwidth and low-latency inter-FPGA connections. Additionally, the reconfigurable FPGA fabric enables tight coupling between computation kernel and network interface. Our thesis is that an application-aware communication infrastructure for a multi-FPGA system makes substantial progress in solving the HPC communication bottleneck. This dissertation aims to provide an application-aware solution for communication infrastructure for FPGA-centric clusters. Specifically, our solution demonstrates application-awareness across multiple levels in the network stack, including low-level link protocols, router microarchitectures, routing algorithms, and applications. We start by investigating the low-level link protocol and the impact of its latency variance on performance. Our results demonstrate that, although some link jitter is always present, we can still assume near-synchronous communication on an FPGA-cluster. This provides the necessary condition for statically-scheduled routing. We then propose two novel router microarchitectures for two different kinds of workloads: a wormhole Virtual Channel (VC)-based router for workloads with dynamic communication, and a statically-scheduled Virtual Output Queueing (VOQ)-based router for workloads with static communication. For the first (VC-based) router, we propose a framework that generates application-aware router configurations. Our results show that, by adding application-awareness into router configuration, the network performance of FPGA clusters can be substantially improved. For the second (VOQ-based) router, we propose a novel offline collective routing algorithm. This shows a significant advantage over a state-of-the-art collective routing algorithm. We apply our communication infrastructure to a critical strong-scaling HPC kernel, the 3D FFT. The experimental results demonstrate that the performance of our design is faster than that on CPUs and GPUs by at least one order of magnitude (achieving strong scaling for the target applications). Surprisingly, the FPGA cluster performance is similar to that of an ASIC-cluster. We also implement the 3D FFT on another multi-FPGA platform: the Microsoft Catapult II cloud. Its performance is also comparable or superior to CPU and GPU HPC clusters. The second application we investigate is Molecular Dynamics Simulation (MD). We model MD on both FPGA clouds and clusters. We find that combining processing and general communication in the same device leads to extremely promising performance and the prospect of MD simulations well into the us/day range with a commodity cloud

    Data routing in multicore processors using dimension increment method

    Full text link
    A Deadlock-free routing algorithm can be generated for arbitrary interconnection network using the concept of virtual channels but the virtual channels will lead to more complex algorithms and more demands of NOC resource. In this thesis, we study a Torus topology for NOC application, design its structure and propose a routing algorithm exploiting the characteristics of NOC. We have chosen a typical 16 (4 by 4) routers Torus and propose the corresponding route algorithm. In our algorithm, all the channels are assigned 4 different dimensions (n0,n1,n2 & n3). By following the dimension increment method, we break the dependent route circles, and avoid dead lock and live-lock and avoid the overhead of virtual channels. Xilinx offers two soft core processors, namely Picoblaze and Microblaze. The Picoblaze processor is 8-bit configurable processor core. These soft processor cores offer designers tremendous flexibility during the design process, allowing the designers to configure the processor to meet the needs of their systems (e.g., adding custom instructions or including/excluding particular data path coprocessors) and to quickly integrate the processor within any FPGA. Unlike single chip Microprocessor/FPGA systems using hard-core processors, soft processor cores allow designers to incorporate varying numbers of processors within a single FPGA design depending on an application’s needs.Soft processor cores implemented using FPGAs typically have higher power consumption and decreased performance compared with hard-core processors. Key features of the Picoblaze processor, as well as other soft processor cores, include the user configurable options that allow a designer to tailor the processor’s functionality to their specific design. The proposed design implements sixteen instances of a soft processor, Picoblaze, connected in a torus topology. Data is passed from one processor to another employing a routing algorithm which is based on dimension increment method. Thus we design an NOC with multiple microcontrollers and related logic, synthesize the process and test its performance in a simulation environment

    Повнофункціональна побітова потокова арифметика зі зменшеними витратами обладнання

    Get PDF
    Побітова потокова обробка у двійковій системі числення є одним з напрямів вирішення проблеми інформаційних зв’язків у цифровій техніці. Однак вона має обмежену функціональність через неможливість виконання побітового ділення в одному потоці з іншими арифметичними операціями. У монографії пропонується інформаційно-структурний підхід до повнофункціональної організації такої обробки на основі визначення оптимальної за витратами обладнання надлишкової системи числення та розробки у ній потокових методів і пристроїв зі зменшеними витратами обладнання для побітового виконання всіх арифметичних операцій

    A Reconfigurable Distributed Computing Fabric Exploiting Multilevel Parallelism

    Get PDF
    This paper presents a novel reconfigurable data flow processing architecture that promises high performance by explicitly targeting both fine- and course-grained parallelism. This architecture is based on multiple FPGAs organized in a scalable direct network that is substantially more interconnect-efficient than currently used crossbar technology. In addition, we discuss several ancillary issues and propose solutions required to support this architecture and achieve maximal performance for general-purpose applications; these include supporting IP, mapping techniques, and routing policies that enable greater flexibility for architectural evolution and code portability

    A Reconfigurable Distributed Computing Fabric Exploiting Multilevel Parallelism

    Get PDF
    This paper presents a novel reconfigurable data flow processing architecture that promises high performance by explicitly targeting both fine- and course-grained parallelism. This architecture is based on multiple FPGAs organized in a scalable direct network that is substantially more interconnectefficient than currently used crossbar technology. In addition, we discuss several ancillary issues and propose solutions required to support this architecture and achieve maximal performance for general-purpose applications; these include supporting IP, mapping techniques, and routing policies that enable greater flexibility for architectural evolution and code portability. 1

    Performance evaluation of network-on-chip interconnect architectures

    Full text link
    With a communication design style, Network-on-Chips (NoCs) have been proposed as a new Multi-Processor System-on-Chip paradigm. Simulation and functional validation are essential to assess the correctness and performance of the NoC design. In this thesis, a cycle-accurate NoC simulation system in Verilog HDL is developed to evaluate the performance of various NoC architectures. First, a library of NoC components is developed based on an existing design. Each NoC architecture to be evaluated is constructed from the library according to the topology description which specifies the network topology, network size, and routing algorithm. The network performance of four NoC architectures under uniform and three non-uniform traffic patterns is tested on ModelSim 6.4. The developed NoC simulation system provides useful resources for the future development of the FPGA-based NoC emulation system
    corecore