22 research outputs found

    Heterogeneous Secure Multi-level Remote Acceleration Service for Low-Power Integrated Systems and Devices

    Get PDF
    AbstractThis position paper presents a novel heterogeneous CPU-GPU multi-level cloud acceleration focusing on applications running on embedded systems found on low-power devices. A runtime system performs energy and performance estimations in order to automatically select local CPU-based and GPU-based tasks that should be seamlessly executed on more powerful remote devices or cloud infrastructures. Moreover, it proposes, for the first time, a secure unified model where almost any device or infrastructure can operate as an accelerated entity and/or as an accelerator serving other less powerful devices in a secure way

    for the Vector-IRAM chip

    No full text
    Vector IRAM integrates vector processing with embedded DRAM on a single chip to provide high multimedia performance at low energy cost. This report presents the design and the implementation of the VIRAM Vector Register File. Our design successfully faces many challenges such as the need for speed, low power consumption, compact design and multiported access. Using a 0.18m technology and a 1.3 Volts supply voltage, it operates at 200 MHz, consumes an average power of 330 mW and 8 mm of area, and provides eight read and three write ports. A number of available CAD tools were used, including layout tools from CADENCE, extraction tools from Avant!, hspice, timemill and powermill. This report gives emphasis on implementation issues and evaluates the performance and power consumption of our design

    Accelerating Emulation and Providing Full Chip Observability and Controllability

    No full text

    Novel techniques for hardware / software partitioning and emulation

    No full text
    Over the last several years, uniprocessor systems, in an effort to overcome the limits of deeperpipelining, instruction-level parallelism and power dissipation, evolved from one processing coreto tens or hundreds of cores. At the same time, multi-chip systems and Systems on Board (SoB),have started giving their place to Systems on Chip (SoC) that exploit the latest nanometertechnologies. This has also caused a tremendous shift in the system development process towardsembedded systems, hardware/software co-design, SoC designs, multi-core designs, and hardwareaccelerators. Nowadays, one of the key issues for continued performance scaling is thedevelopment of advanced CAD tools that can efficiently support the design and verification ofthese new platforms and the requirements of today’s complex applications. This thesis focuses on three important aspects of the system development process: hardware/software partitioning, simulation and verification. Since the time consumed in those tasks is usually a large percentage of the overall development time, speeding them up can significantly reduce the ever important time to market. Hardware emulation on FPGAs has been widely used as a significantly faster and moreaccurate approach for the verification of complex designs than software simulation. In this approach, Hardware Simulation Accelerator and Emulator co-processor units are used to offloadcalculation-intensive tasks from software simulators. One of the biggest problems however is thatthe communication overhead between the software simulator, where the behavioral testbenchusually runs, and the hardware emulator where the Design Under Test (DUT) is emulated, isbecoming a new critical bottleneck. Another problem is that in a hardware emulation environmentit is impossible to bring outside of the chip a large number of internal signals for verificationpurposes. Therefore, on-chip observability has become a significant issue. Finally, one more crucial issue is the decision that has to be made on how to partition the system components into two distinct sets: those that will be implemented in hardware and those that will run in software. Inthis thesis we analyze all the aforementioned problems and propose novel techniques that can beused to attack them. First, we introduce a novel emulation framework that automatically transforms certain HDL parts of the testbench into synthesizable code in order to offload them from the software simulator and, more importantly, minimize the aforementioned communication overhead. In particular, we partition the testbench running on the software simulator into two sections: the testbench HDL code that communicates directly with the DUT and the rest, C-like, testbench code. The former section is transformed into synthesizable code while the latter runs in a general purpose CPU. Next, we extend this architecture by adding multiple fast scan-chain paths in the design in order to provide full circuit observability and controllability on the fly. Finally, we develop a fullyautomated hardware/software partitioning tool that incorporates a novel flow with new costmetrics and functions to provide fast and efficient solutions. The tool employs two separatepartitioning algorithms; Simulated Annealing (SA) and a novel greedy algorithm, the GroupingMapping Partitioning (GMP). Our experiments demonstrate that our methodologies provide cost-effective solutions for the hardware/software partitioning and emulation of large and complex systems

    Wormhole IP over (Connectionless) ATM

    No full text
    ABSTRACT: In the eighties, high throughput and low latency requirements in multiprocessor interconnection networks led to wormhole routing. Today, the same techniques are applicable to routing internet packets over ATM hardware athigh speed. Just like virtual channels in wormhole routing carry packets segmented into flits, a number of hardware-managed VC’s in ATM can carry IP packets segmented into cells according to AAL-5; each VC is dedicated to one packet for the duration of that packet, and is afterwards reassigned to another packet, in hardware. This idea was introduced by Barnett [Barn97] and was named connectionless ATM. We modify the Barnett proposal to make it applicable to existing ATMequipment: we propose a single-input, single-output Wormhole IP Router, that functions as a VP/VC translation filter between ATM subnetworks; fast IP routing lookups can be as in [GuLK98]. Based on actual internet traces, we show by simulation that a few tens of hardware-managed VC’s per outgoing VP suffice for all but 10 −4 or less of the packets. We ana-lyze the hardware cost of a wormhole IP routing filter, and show that it can be built at low cost: 10 off-the-shelf chips will do for 622 Mb/s operation; using pipelining, oper-ation is feasible even at 10 Gb/s, today

    Multilingual Extensions to DIENST

    No full text
    Digital libraries enable on-line access of information and provide advanced methods for material search, retrieval, and presentation. In order to support collections of documents written in several languages and to increase the applicability of digital libraries in non-english speaking countries a multilingual digital library design is necessary that supports the native languages of the users. Issues that must be taken into account in a multilingual design include limitations on the use of more than one character sets concurrently and the availability (or lack of) of metadata in languages other than english. Furthermore, the desired display language of each piece of information depends on the languages that each individual user can understand, the languages in which the documents and their metadata are available, and the locally available resources (fonts). DIENST 1 is a digital library search tool developed at Cornell University. This report describes our work on extending DIENST to..

    Wormhole IP over (Connectionless) ATM

    No full text
    Abstract− − High speed switches and routers internally operate using fixed-size cells or segments; variable-size packets are segmented and later reassembled. Connectionless ATMwas proposed to quickly carry IP packets segmented into cells (AAL5) using a number of hardware-managed ATM VC’s. We show that this is analogous to wormhole routing. We modify this architecture to make it applicable to existing ATM equipment: we propose a low-cost, singleinput, single-output Wormhole IP Router that functions as a VP/VC translation filter between ATM subnetworks. When compared to IP routers, the proposed architecture features simpler hardware and lower latency. When compared to software-based IP-over-ATM techniques, the new architecture avoids the overheads of a large number of labels, or the delays of establishing new flows in software after the first few packets have suffered considerable latencies. We simulated a wormhole IP routing filter, showing that a few tens of hardware-managed VC’s per outgoing VP usually suffice. We built and successfully tested a prototype, operating at 2 × 155 Mbps, using one FPGA and DRAM. Simple analysis shows that operation at 10 Gbps and beyond is feasible today. Index Terms− − IP over ATM, connectionless ATM, wormhole routing, gigabit router, wormhole IP, routing filter. 1

    ECOSCALE: Reconfigurable Computing and Runtime System for Future Exascale Systems

    Get PDF
    In order to reach exascale performance, current HPC systems need to be improved. Simple hardware scaling is not a feasible solution due to the increasing utility costs and power consumption limitations. Apart from improvements in implementation technology, what is needed is to refine the HPC application development flow as well as the system architecture of future HPC systems. ECOSCALE tackles these challenges by proposing a scalable programming environment and architecture, aiming to substantially reduce energy consumption as well as data traffic and latency. ECOSCALE introduces a novel heterogeneous energy-efficient hierarchical architecture, as well as a hybrid many-core+OpenCL programming environment and runtime system. The ECOSCALE approach is hierarchical and is expected to scale well by partitioning the physical system into multiple independent Workers (i.e. compute nodes). Workers are interconnected in a tree-like fashion and define a contiguous global address space that can be viewed either as a set of partitions in a Partitioned Global Address Space (PGAS), or as a set of nodes hierarchically interconnected via an MPI protocol. To further increase energy efficiency, as well as to provide resilience, the Workers employ reconfigurable accelerators mapped into the virtual address space utilizing a dual stage System Memory Management Unit with coherent memory access. The architecture supports shared partitioned reconfigurable resources accessed by any Worker in a PGAS partition, as well as automated hardware synthesis of these resources from an OpenCL-based programming model
    corecore