956 research outputs found
The "MIND" Scalable PIM Architecture
MIND (Memory, Intelligence, and Network Device) is an advanced parallel computer architecture for high performance computing and scalable embedded processing. It is a
Processor-in-Memory (PIM) architecture integrating both DRAM bit cells and CMOS logic devices on the same silicon die. MIND is multicore with multiple memory/processor nodes on
each chip and supports global shared memory across systems of MIND components. MIND is distinguished from other PIM architectures in that it incorporates mechanisms for efficient support of a global parallel execution model based on the semantics of message-driven multithreaded split-transaction processing. MIND is designed to operate either in conjunction with other conventional microprocessors or in standalone arrays of like devices. It also incorporates mechanisms for fault tolerance, real time execution, and active power management. This paper describes the major elements and operational methods of the MIND
architecture
Hierarchical multithreading: programming model and system software
This paper addresses the underlying sources of performance degradation (e.g. latency, overhead, and starvation) and the difficulties of programmer productivity (e.g. explicit locality management and scheduling, performance tuning, fragmented memory, and synchronous global barriers) to dramatically enhance the broad effectiveness of parallel processing for high end computing. We are developing a hierarchical threaded virtual machine (HTVM) that defines a dynamic, multithreaded execution model and programming model, providing an architecture abstraction for HEC system software and tools development. We are working on a prototype language, LITL-X (pronounced "little-X") for latency intrinsic-tolerant language, which provides the application programmers with a powerful set of semantic constructs to organize parallel computations in a way that hides/manages latency and limits the effects of overhead. This is quite different from locality management, although the intent of both strategies is to minimize the effect of latency on the efficiency of computation. We work on a dynamic compilation and runtime model to achieve efficient LITL-X program execution. Several adaptive optimizations were studied. A methodology of incorporating domain-specific knowledge in program optimization was studied. Finally, we plan to implement our method in an experimental testbed for a HEC architecture and perform a qualitative and quantitative evaluation on selected applications
Massively Parallel Computing at the Large Hadron Collider up to the HL-LHC
As the Large Hadron Collider (LHC) continues its upward progression in energy
and luminosity towards the planned High-Luminosity LHC (HL-LHC) in 2025, the
challenges of the experiments in processing increasingly complex events will
also continue to increase. Improvements in computing technologies and
algorithms will be a key part of the advances necessary to meet this challenge.
Parallel computing techniques, especially those using massively parallel
computing (MPC), promise to be a significant part of this effort. In these
proceedings, we discuss these algorithms in the specific context of a
particularly important problem: the reconstruction of charged particle tracks
in the trigger algorithms in an experiment, in which high computing performance
is critical for executing the track reconstruction in the available time. We
discuss some areas where parallel computing has already shown benefits to the
LHC experiments, and also demonstrate how a MPC-based trigger at the CMS
experiment could not only improve performance, but also extend the reach of the
CMS trigger system to capture events which are currently not practical to
reconstruct at the trigger level.Comment: 14 pages, 6 figures. Proceedings of 2nd International Summer School
on Intelligent Signal Processing for Frontier Research and Industry
(INFIERI2014), to appear in JINST. Revised version in response to referee
comment
Evaluation of OpenMP for the Cyclops multithreaded architecture
Multithreaded architectures have the potential of tolerating large memory and functional unit latencies and increase resource utilization. The Blue Gene/Cyclops architecture, being developed at the IBM T. J. Watson Research Center, is one such systems that offers massive intra-chip parallelism. Although the BG/C architecture was initially designed to execute specific applications, we believe that it can be effectively used on a broad range of parallel numerical applications. Programming such applications for this unconventional design requires a significant porting effort when using the basic built-in mechanisms for thread management and synchronization. In this paper, we describe the implementation of an OpenMP environment for parallelizing applications, currently under development at the CEPBA-IBM Research Institute, targeting BG/C. The environment is evaluated with a set of simple numerical kernels and a subset of the NAS OpenMP benchmarks. We identify issues that were not initially considered in the design of the BG/C architecture to support a programming model such as OpenMP. We also evaluate features currently offered by the BG/C architecture that should be considered in the implementation of an efficient OpenMP layer for massive intra-chip parallel architectures.Peer ReviewedPostprint (author's final draft
Scalable multithreading in a low latency myrinet cluster
In this paper we present some implementation details of a programming model – pCoR – that combines primitives to launch remote processes and threads with communication over Myrinet.B asically, we present the efforts we have made to achieve high performance communication among threads of parallel/distributed applications. The expected advantages of multiple threads launched across a low latency cluster of SMP workstations are emphasized with a graphical application that manages huge maps consisting of several JPEG images
Embedded Parallel Systolic Architecture For Multi-Filtering Techniques Using FPGA.
Computing systems typically suffer from delay in data processing
Simulation of Recognizer P Systems by Using Manycore GPUs
Software development for cellular computing is growing up yielding new
applications. In this paper, we describe a simulator for the class of recognizer P systems
with active membranes, which exploits the massively parallel nature of the P systems
computations by using a massively parallel computer architecture, such as Compute
Unified Device Architecture (CUDA) from Nvidia, to obtain better performance in the
simulations. We illustrate it by giving a solution to the N-Queens problem as an example.Ministerio de Educación y Ciencia TIN2006–13425Junta de Andalucía P08–TIC0420
- …