223 research outputs found

    APEnet+: a 3D toroidal network enabling Petaflops scale Lattice QCD simulations on commodity clusters

    Full text link
    Many scientific computations need multi-node parallelism for matching up both space (memory) and time (speed) ever-increasing requirements. The use of GPUs as accelerators introduces yet another level of complexity for the programmer and may potentially result in large overheads due to the complex memory hierarchy. Additionally, top-notch problems may easily employ more than a Petaflops of sustained computing power, requiring thousands of GPUs orchestrated with some parallel programming model. Here we describe APEnet+, the new generation of our interconnect, which scales up to tens of thousands of nodes with linear cost, thus improving the price/performance ratio on large clusters. The project target is the development of the Apelink+ host adapter featuring a low latency, high bandwidth direct network, state-of-the-art wire speeds on the links and a PCIe X8 gen2 host interface. It features hardware support for the RDMA programming model and experimental acceleration of GPU networking. A Linux kernel driver, a set of low-level RDMA APIs and an OpenMPI library driver are available, allowing for painless porting of standard applications. Finally, we give an insight of future work and intended developments

    APEnet+: high bandwidth 3D torus direct network for petaflops scale commodity clusters

    Full text link
    We describe herein the APElink+ board, a PCIe interconnect adapter featuring the latest advances in wire speed and interface technology plus hardware support for a RDMA programming model and experimental acceleration of GPU networking; this design allows us to build a low latency, high bandwidth PC cluster, the APEnet+ network, the new generation of our cost-effective, tens-of-thousands-scalable cluster network architecture. Some test results and characterization of data transmission of a complete testbench, based on a commercial development card mounting an Altera FPGA, are provided.Comment: 6 pages, 7 figures, proceeding of CHEP 2010, Taiwan, October 18-2

    Using an FPGA for Fast Bit Accurate SoC Simulation

    Get PDF
    In this paper we describe a sequential simulation method to simulate large parallel homo- and heterogeneous systems on a single FPGA. The method is applicable for parallel systems were lengthy cycle and bit accurate simulations are required. It is particularly designed for systems that do not fit completely on the simulation platform (i.e. FPGA). As a case study, we use a Network-on-Chip (NoC) that is simulated in SystemC and on the described FPGA simulator. This enables us to observe the NoC behavior under a large variety of traffic patterns. Compared with the SystemC simulation we achieved a factor 80-300 of speed improvement, without compromising the cycle and bit level accuracy

    Fast, Accurate and Detailed NoC Simulations

    Get PDF
    Network-on-Chip (NoC) architectures have a wide variety of parameters that can be adapted to the designer's requirements. Fast exploration of this parameter space is only possible at a high-level and several methods have been proposed. Cycle and bit accurate simulation is necessary when the actual router's RTL description needs to be evaluated and verified. However, extensive simulation of the NoC architecture with cycle and bit accuracy is prohibitively time consuming. In this paper we describe a simulation method to simulate large parallel homogeneous and heterogeneous network-on-chips on a single FPGA. The method is especially suitable for parallel systems where lengthy cycle and bit accurate simulations are required. As a case study, we use a NoC that was modelled and simulated in SystemC. We simulate the same NoC on the described FPGA simulator. This enables us to observe the NoC behavior under a large variety of traffic patterns. Compared with the SystemC simulation we achieved a speed-up of 80-300, without compromising the cycle and bit level accuracy

    High performance IPC hardware accelerator and communication network for MPSoCs

    Get PDF
    In this paper, we explain a configurable IPC module for multimedia MPSoCs, which was implemented in a MPW chip that include three ARM7 CPU cores. According to the test results for an M-JPEG and a H.264 decoder, its IPC synchronization overheads are not more than 1% when the synchronization period is about 5000 cycles.This work was supported by the IC Design Education Center (IDEC) in KAIST, and the Seoul R&BD Program

    A Two Channel Analog Front end Design AFE Design with Continuous Time Ī£-Ī” Modulator for ECG Signal

    Get PDF
    In this context, the AFE with 2-channels is described, which has high impedance for low power application of bio-medical electrical activity. The challenge in obtaining accurate recordings of biomedical signals such as EEG/ECG to study the human body in research work. This paper is to propose Multi-Vt in AFE circuit design cascaded with CT modulator. The new architecture is anticipated with two dissimilar input signals filtered from 2-channel to one modulator. In this methodology, the amplifier is low powered multi-VT Analog Front-End which consumes less power by applying dual threshold voltage. Type -I category 2 channel signals of the first mode: 50 and 150 Hz amplified from AFE are given to 2nd CT sigma-delta ADC. Depict the SNR and SNDR as 63dB and 60dB respectively, consuming the power of 11mW. The design was simulated in a 0.18 um standard UMC CMOS process at 1.8V supply. The AFE measured frequency response from 50 Hz to 360 Hz, depict the SNR and SNDR as 63dB and 60dB respectively, consuming the power of 11mW. The design was simulated in 0.18 m standard UMC CMOS process at 1.8V supply. The AFE measured frequency response from 50 Hz to 360 Hz, programmable gains from 52.6 dB to 72 dB, input referred noise of 3.5 Ī¼V in the amplifier bandwidth, NEF of 3
    • ā€¦
    corecore