23 research outputs found

    FASTCUDA: Open Source FPGA Accelerator & Hardware-Software Codesign Toolset for CUDA Kernels

    Get PDF
    Using FPGAs as hardware accelerators that communicate with a central CPU is becoming a common practice in the embedded design world but there is no standard methodology and toolset to facilitate this path yet. On the other hand, languages such as CUDA and OpenCL provide standard development environments for Graphical Processing Unit (GPU) programming. FASTCUDA is a platform that provides the necessary software toolset, hardware architecture, and design methodology to efficiently adapt the CUDA approach into a new FPGA design flow. With FASTCUDA, the CUDA kernels of a CUDA-based application are partitioned into two groups with minimal user intervention: those that are compiled and executed in parallel software, and those that are synthesized and implemented in hardware. A modern low power FPGA can provide the processing power (via numerous embedded micro-CPUs) and the logic capacity for both the software and hardware implementations of the CUDA kernels. This paper describes the system requirements and the architectural decisions behind the FASTCUDA approach

    Heap Management in Hardware

    No full text
    Critical time priority queues require hardware instead of software management. The heap data structure is an efficient way to organize a priority queue. In this report we present the various design issues and evaluate the hardware complexity and timing requirements of different possible hardware organizations of a heap using one FPGA and external memory. We also describe the organization and the datapaths we simulated in Verilog. Finally we present some more efficient techniques applicable only to ASIC implementations. Heap Management in Hardware Ioannis Mavroidis Computer Architecture and VLSI Systems Division (CARV) Institute of Computer Science (ICS) Foundation for Research & Technology - Hellas (FORTH) P.O. Box 1385, Science and Technology Park of Crete, Heraklion, Crete, GR-711-10 GREECE Tel: +30 81 391662, Fax: +30 81 391661 email: [email protected] Technical Report FORTH-ICS/TR-222 July 1998 Work performed under "ASICCOM" project and as a B.Sc. Thesis at the Univ. of Cret..

    Accelerating Emulation and Providing Full Chip Observability and Controllability

    No full text

    Υλοποίηση αλγορίθμων σε συστήματα αναδιατασσόμενης λογικής και σε συστήματα με πολλαπλούς ενσωματωμένους επεξεργαστές

    No full text
    The Traveling Salesman Problem (TSP) is probably the most-studied combinatorial optimization problem of all time. TSP applications range from logistics, and job scheduling, to computing DNA sequences, designing and testing VLSI circuits, x-ray crystallography, and many others. Many researchers, both mathematicians and computer scientists, have attacked the TSP problem for decades resulting in a plethora of heuristics that offer a broad range of tradeoffs between running time and quality of solution. These heuristics are typically classified as either tour construction procedures that gradually build a feasible tour, or tour improvement procedures that try to optimize an existing tour by performing various tour modifications. Probably the best-known such tour modification is the 2-Opt. In this thesis we attack the 2-Opt algorithm from a novel perspective and manage to uncover previously unknown fine-grain parallelism. We propose a baseline architecture that exploits this type of parallelism and demonstrate the implementation of various versions of this architecture on an FPGA as well as on a multi-threaded GPU. Our algorithm guarantees the 2-Optimality of the final resulting tour. We evaluate our implementations and find that the FPGA implementation manages to outperform Concorde, the current state-of-the-art software implementation, for small-scale TSP problems, in both speed and quality of final results. The GPU implementation is able to handle bigger-scale TSP problems and achieve similar quality of final results, but lags behind Concorde in speed

    Wormhole IP over (Connectionless) ATM

    No full text
    ABSTRACT: In the eighties, high throughput and low latency requirements in multiprocessor interconnection networks led to wormhole routing. Today, the same techniques are applicable to routing internet packets over ATM hardware athigh speed. Just like virtual channels in wormhole routing carry packets segmented into flits, a number of hardware-managed VC’s in ATM can carry IP packets segmented into cells according to AAL-5; each VC is dedicated to one packet for the duration of that packet, and is afterwards reassigned to another packet, in hardware. This idea was introduced by Barnett [Barn97] and was named connectionless ATM. We modify the Barnett proposal to make it applicable to existing ATMequipment: we propose a single-input, single-output Wormhole IP Router, that functions as a VP/VC translation filter between ATM subnetworks; fast IP routing lookups can be as in [GuLK98]. Based on actual internet traces, we show by simulation that a few tens of hardware-managed VC’s per outgoing VP suffice for all but 10 −4 or less of the packets. We ana-lyze the hardware cost of a wormhole IP routing filter, and show that it can be built at low cost: 10 off-the-shelf chips will do for 622 Mb/s operation; using pipelining, oper-ation is feasible even at 10 Gb/s, today

    Wormhole IP over (Connectionless) ATM

    No full text
    Abstract− − High speed switches and routers internally operate using fixed-size cells or segments; variable-size packets are segmented and later reassembled. Connectionless ATMwas proposed to quickly carry IP packets segmented into cells (AAL5) using a number of hardware-managed ATM VC’s. We show that this is analogous to wormhole routing. We modify this architecture to make it applicable to existing ATM equipment: we propose a low-cost, singleinput, single-output Wormhole IP Router that functions as a VP/VC translation filter between ATM subnetworks. When compared to IP routers, the proposed architecture features simpler hardware and lower latency. When compared to software-based IP-over-ATM techniques, the new architecture avoids the overheads of a large number of labels, or the delays of establishing new flows in software after the first few packets have suffered considerable latencies. We simulated a wormhole IP routing filter, showing that a few tens of hardware-managed VC’s per outgoing VP usually suffice. We built and successfully tested a prototype, operating at 2 × 155 Mbps, using one FPGA and DRAM. Simple analysis shows that operation at 10 Gbps and beyond is feasible today. Index Terms− − IP over ATM, connectionless ATM, wormhole routing, gigabit router, wormhole IP, routing filter. 1

    ECOSCALE: Reconfigurable Computing and Runtime System for Future Exascale Systems

    Get PDF
    In order to reach exascale performance, current HPC systems need to be improved. Simple hardware scaling is not a feasible solution due to the increasing utility costs and power consumption limitations. Apart from improvements in implementation technology, what is needed is to refine the HPC application development flow as well as the system architecture of future HPC systems. ECOSCALE tackles these challenges by proposing a scalable programming environment and architecture, aiming to substantially reduce energy consumption as well as data traffic and latency. ECOSCALE introduces a novel heterogeneous energy-efficient hierarchical architecture, as well as a hybrid many-core+OpenCL programming environment and runtime system. The ECOSCALE approach is hierarchical and is expected to scale well by partitioning the physical system into multiple independent Workers (i.e. compute nodes). Workers are interconnected in a tree-like fashion and define a contiguous global address space that can be viewed either as a set of partitions in a Partitioned Global Address Space (PGAS), or as a set of nodes hierarchically interconnected via an MPI protocol. To further increase energy efficiency, as well as to provide resilience, the Workers employ reconfigurable accelerators mapped into the virtual address space utilizing a dual stage System Memory Management Unit with coherent memory access. The architecture supports shared partitioned reconfigurable resources accessed by any Worker in a PGAS partition, as well as automated hardware synthesis of these resources from an OpenCL-based programming model
    corecore