Search CORE

956 research outputs found

The "MIND" Scalable PIM Architecture

Author: Brodowicz Maciej
Sterling Thomas
Publication venue
Publication date: 01/01/2005
Field of study

MIND (Memory, Intelligence, and Network Device) is an advanced parallel computer architecture for high performance computing and scalable embedded processing. It is a Processor-in-Memory (PIM) architecture integrating both DRAM bit cells and CMOS logic devices on the same silicon die. MIND is multicore with multiple memory/processor nodes on each chip and supports global shared memory across systems of MIND components. MIND is distinguished from other PIM architectures in that it incorporates mechanisms for efficient support of a global parallel execution model based on the semantics of message-driven multithreaded split-transaction processing. MIND is designed to operate either in conjunction with other conventional microprocessors or in standalone arrays of like devices. It also incorporates mechanisms for fault tolerance, real time execution, and active power management. This paper describes the major elements and operational methods of the MIND architecture

Caltech Authors

Hierarchical multithreading: programming model and system software

Author: Gao Guang R.
Hereld Mark
Sterling Thomas
Stevens Rick
Zhu Weirong
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

This paper addresses the underlying sources of performance degradation (e.g. latency, overhead, and starvation) and the difficulties of programmer productivity (e.g. explicit locality management and scheduling, performance tuning, fragmented memory, and synchronous global barriers) to dramatically enhance the broad effectiveness of parallel processing for high end computing. We are developing a hierarchical threaded virtual machine (HTVM) that defines a dynamic, multithreaded execution model and programming model, providing an architecture abstraction for HEC system software and tools development. We are working on a prototype language, LITL-X (pronounced "little-X") for latency intrinsic-tolerant language, which provides the application programmers with a powerful set of semantic constructs to organize parallel computations in a way that hides/manages latency and limits the effects of overhead. This is quite different from locality management, although the intent of both strategies is to minimize the effect of latency on the efficiency of computation. We work on a dynamic compilation and runtime model to achieve efficient LITL-X program execution. Several adaptive optimizations were studied. A methodology of incorporating domain-specific knowledge in program optimization was studied. Finally, we plan to implement our method in an experimental testbed for a HEC architecture and perform a qualitative and quantitative evaluation on selected applications

Crossref

Caltech Authors

Massively Parallel Computing at the Large Hadron Collider up to the HL-LHC

Author: Halyo Valerie
Lujan Paul
Publication venue: 'IOP Publishing'
Publication date: 12/05/2015
Field of study

As the Large Hadron Collider (LHC) continues its upward progression in energy and luminosity towards the planned High-Luminosity LHC (HL-LHC) in 2025, the challenges of the experiments in processing increasingly complex events will also continue to increase. Improvements in computing technologies and algorithms will be a key part of the advances necessary to meet this challenge. Parallel computing techniques, especially those using massively parallel computing (MPC), promise to be a significant part of this effort. In these proceedings, we discuss these algorithms in the specific context of a particularly important problem: the reconstruction of charged particle tracks in the trigger algorithms in an experiment, in which high computing performance is critical for executing the track reconstruction in the available time. We discuss some areas where parallel computing has already shown benefits to the LHC experiments, and also demonstrate how a MPC-based trigger at the CMS experiment could not only improve performance, but also extend the reach of the CMS trigger system to capture events which are currently not practical to reconstruct at the trigger level.Comment: 14 pages, 6 figures. Proceedings of 2nd International Summer School on Intelligent Signal Processing for Frontier Research and Industry (INFIERI2014), to appear in JINST. Revised version in response to referee comment

arXiv.org e-Print Archive

CERN Document Server

Evaluation of OpenMP for the Cyclops multithreaded architecture

Author: Almasi George
Ayguadé Parra Eduard
Cascaval Calin
Castaños José G.
Labarta Mancho Jesús José
Martorell Bofill Xavier
Martínez Francisco
Moreira José E.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/06/2003
Field of study

Multithreaded architectures have the potential of tolerating large memory and functional unit latencies and increase resource utilization. The Blue Gene/Cyclops architecture, being developed at the IBM T. J. Watson Research Center, is one such systems that offers massive intra-chip parallelism. Although the BG/C architecture was initially designed to execute specific applications, we believe that it can be effectively used on a broad range of parallel numerical applications. Programming such applications for this unconventional design requires a significant porting effort when using the basic built-in mechanisms for thread management and synchronization. In this paper, we describe the implementation of an OpenMP environment for parallelizing applications, currently under development at the CEPBA-IBM Research Institute, targeting BG/C. The environment is evaluated with a set of simple numerical kernels and a subset of the NAS OpenMP benchmarks. We identify issues that were not initially considered in the design of the BG/C architecture to support a programming model such as OpenMP. We also evaluate features currently offered by the BG/C architecture that should be considered in the implementation of an efficient OpenMP layer for massive intra-chip parallel architectures.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Scalable multithreading in a low latency myrinet cluster

Author: Alves Albano
Exposto José
Pina António
Rufino José
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2002
Field of study

In this paper we present some implementation details of a programming model – pCoR – that combines primitives to launch remote processes and threads with communication over Myrinet.B asically, we present the efforts we have made to achieve high performance communication among threads of parallel/distributed applications. The expected advantages of multiple threads launched across a low latency cluster of SMP workstations are emphasized with a graphical application that manages huge maps consisting of several JPEG images

Biblioteca Digital do IPB

Embedded Parallel Systolic Architecture For Multi-Filtering Techniques Using FPGA.

Author: Arshad M. R.
H. Salih Muataz
Publication venue
Publication date: 01/01/2010
Field of study

Computing systems typically suffer from delay in data processing

Crossref

Repository@USM

Secure Virtualization and Multicore Platforms State-of-the-Art report

Author: Douglas Heradon
Gehrmann Christian
Publication venue: Swedish Institute of Computer Science
Publication date: 01/01/2009
Field of study

SVaM

CiteSeerX

RISE – Research Institutes of Sweden

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

Software institutes' Online Digital Archive

Simulation of Recognizer P Systems by Using Manycore GPUs

Author: Cecilia José M.
García José M.
Guerrero Ginés D.
Martínez del Amor Miguel Ángel
Pérez Hurtado de Mendoza Ignacio
Pérez Jiménez Mario de Jesús
Publication venue: Fénix Editora
Publication date: 01/01/2009
Field of study

Software development for cellular computing is growing up yielding new applications. In this paper, we describe a simulator for the class of recognizer P systems with active membranes, which exploits the massively parallel nature of the P systems computations by using a massively parallel computer architecture, such as Compute Unified Device Architecture (CUDA) from Nvidia, to obtain better performance in the simulations. We illustrate it by giving a solution to the N-Queens problem as an example.Ministerio de Educación y Ciencia TIN2006–13425Junta de Andalucía P08–TIC0420

idUS. Depósito de Investigación Universidad de Sevilla