Search CORE

6,334 research outputs found

Content addressable memory project

Author: Hall J. Storrs
Levy Saul
Miyake Keith M.
Smith Donald E.
Publication venue
Publication date
Field of study

A parameterized version of the tree processor was designed and tested (by simulation). The leaf processor design is 90 percent complete. We expect to complete and test a combination of tree and leaf cell designs in the next period. Work is proceeding on algorithms for the computer aided manufacturing (CAM), and once the design is complete we will begin simulating algorithms for large problems. The following topics are covered: (1) the practical implementation of content addressable memory; (2) design of a LEAF cell for the Rutgers CAM architecture; (3) a circuit design tool user's manual; and (4) design and analysis of efficient hierarchical interconnection networks

NASA Technical Reports Server

Rutger's CAM2000 chip architecture

Author: Hall J. Storrs
Miyake Keith
Smith Donald E.
Publication venue
Publication date
Field of study

This report describes the architecture and instruction set of the Rutgers CAM2000 memory chip. The CAM2000 combines features of Associative Processing (AP), Content Addressable Memory (CAM), and Dynamic Random Access Memory (DRAM) in a single chip package that is not only DRAM compatible but capable of applying simple massively parallel operations to memory. This document reflects the current status of the CAM2000 architecture and is continually updated to reflect the current state of the architecture and instruction set

NASA Technical Reports Server

Strong scaling of general-purpose molecular dynamics simulations on GPUs

Author: Anderson Joshua A.
Glaser Jens
Glotzer Sharon C.
Lui Pak
Millan Jaime A.
Morse David C.
Nguyen Trung Dac
Spiga Filippo
Publication venue: 'Elsevier BV'
Publication date: 10/12/2014
Field of study

We describe a highly optimized implementation of MPI domain decomposition in a GPU-enabled, general-purpose molecular dynamics code, HOOMD-blue (Anderson and Glotzer, arXiv:1308.5587). Our approach is inspired by a traditional CPU-based code, LAMMPS (Plimpton, J. Comp. Phys. 117, 1995), but is implemented within a code that was designed for execution on GPUs from the start (Anderson et al., J. Comp. Phys. 227, 2008). The software supports short-ranged pair force and bond force fields and achieves optimal GPU performance using an autotuning algorithm. We are able to demonstrate equivalent or superior scaling on up to 3,375 GPUs in Lennard-Jones and dissipative particle dynamics (DPD) simulations of up to 108 million particles. GPUDirect RDMA capabilities in recent GPU generations provide better performance in full double precision calculations. For a representative polymer physics application, HOOMD-blue 1.0 provides an effective GPU vs. CPU node speed-up of 12.5x.Comment: 30 pages, 14 figure

arXiv.org e-Print Archive

CiteSeerX

PLACES'10: The 3rd Workshop on Programmng Language Approaches to concurrency and Communication-Centric Software

Author: Honda Kohei
Mycroft Alan
Publication venue
Publication date: 30/12/2013
Field of study

Paphos, Cyprus. March 201

Queen Mary Research Online

Programmable built-in self-testing of embedded RAM clusters in system-on-chip architectures

Author: Benso Alfredo
DI CARLO Stefano
DI NATALE Giorgio
Lobetti Bodoni M.
Prinetto Paolo Ernesto
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Multiport memories are widely used as embedded cores in all communication system-on-chip devices. Due to their high complexity and very low accessibility, built-in self-test (BIST) is the most common solution implemented to test the different memories embedded in the system. This article presents a programmable BIST architecture based on a single microprogrammable BIST processor and a set of memory wrappers designed to simplify the test of a system containing a large number of distributed multiport memories of different sizes (number of bits, number of words), access protocols (asynchronous, synchronous), and timing

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

HARDWARE DESIGN OF MESSAGE PASSING ARCHITECTURE ON HETEROGENEOUS SYSTEM

Author: Gao Shanyuan
NC DOCKS at The University of North Carolina at Charlotte
Publication venue
Publication date: 01/01/2013
Field of study

Heterogeneous multi/many-core chips are commonly used in today’s top tier supercomputers. Similar heterogeneous processing elements — or, computation ac- celerators — are commonly found in FPGA systems. Within both multi/many-core chips and FPGA systems, the on-chip network plays a critical role by connecting these processing elements together. However, The common use of the on-chip network is for point-to-point communication between on-chip components and the memory in- terface. As the system scales up with more nodes, traditional programming methods, such as MPI, cannot effectively use the on-chip network and the off-chip network, therefore could make communication the performance bottleneck. This research proposes a MPI-like Message Passing Engine (MPE) as part of the on-chip network, providing point-to-point and collective communication primitives in hardware. On one hand, the MPE improves the communication performance by offloading the communication workload from the general processing elements. On the other hand, the MPE provides direct interface to the heterogeneous processing ele- ments which can eliminate the data path going around the OS and libraries. Detailed experimental results have shown that the MPE can significantly reduce the com- munication time and improve the overall performance, especially for heterogeneous computing systems because of the tight coupling with the network. Additionally, a hybrid “MPI+X” computing system is tested and it shows MPE can effectively of- fload the communications and let the processing elements play their strengths on the computation

The University of North Carolina at Greensboro