8,067 research outputs found
Performance Analysis of a Novel GPU Computation-to-core Mapping Scheme for Robust Facet Image Modeling
Though the GPGPU concept is well-known
in image processing, much more work remains to be done
to fully exploit GPUs as an alternative computation
engine. This paper investigates the computation-to-core
mapping strategies to probe the efficiency and scalability
of the robust facet image modeling algorithm on GPUs.
Our fine-grained computation-to-core mapping scheme
shows a significant performance gain over the standard
pixel-wise mapping scheme. With in-depth performance
comparisons across the two different mapping schemes,
we analyze the impact of the level of parallelism on
the GPU computation and suggest two principles for
optimizing future image processing applications on the
GPU platform
Development and evaluation of a fault-tolerant multiprocessor (FTMP) computer. Volume 4: FTMP executive summary
The FTMP architecture is a high reliability computer concept modeled after a homogeneous multiprocessor architecture. Elements of the FTMP are operated in tight synchronism with one another and hardware fault-detection and fault-masking is provided which is transparent to the software. Operating system design and user software design is thus greatly simplified. Performance of the FTMP is also comparable to that of a simplex equivalent due to the efficiency of fault handling hardware. The FTMP project constructed an engineering module of the FTMP, programmed the machine and extensively tested the architecture through fault injection and other stress testing. This testing confirmed the soundness of the FTMP concepts
Simulating spin models on GPU
Over the last couple of years it has been realized that the vast
computational power of graphics processing units (GPUs) could be harvested for
purposes other than the video game industry. This power, which at least
nominally exceeds that of current CPUs by large factors, results from the
relative simplicity of the GPU architectures as compared to CPUs, combined with
a large number of parallel processing units on a single chip. To benefit from
this setup for general computing purposes, the problems at hand need to be
prepared in a way to profit from the inherent parallelism and hierarchical
structure of memory accesses. In this contribution I discuss the performance
potential for simulating spin models, such as the Ising model, on GPU as
compared to conventional simulations on CPU.Comment: 5 pages, 4 figures, elsarticl
Architecture Design Space Exploration for Streaming Applications Through Timing Analysis
In this paper we compare the maximum achievable throughput of different memory organisations of the processing elements that constitute a multiprocessor system on chip. This is done by modelling the mapping of a task with input and output channels on a processing element as a homogeneous synchronous dataflow graph, and use maximum cycle mean analysis to derive the throughput. In a HiperLAN2 case study we show how these techniques can be used to derive the required clock frequency and communication latencies in order to meet the application's throughput requirement on a multiprocessor system on chip that has one of the investigated memory organisations
Access to vectors in multi-module memories
The poor bandwidth obtained from memory when conflicts arise in the modules or in the interconnection network degrades the performance of computers. Address transformation schemes, such as interleaving, skewing and linear transformations, have been proposed to achieve conflict-free access for streams with constant stride. However, this is achieved only for some strides. In this paper, we summarize a mechanism to request the elements in an out-of-order way which allows to achieve
conflict-free access for a larger number of strides. We study the cases of a single vector processor and of a vector multiprocessor system. For this latter case, we propose a synchronous mode of accessing memory that can be applied in SIMD machines or in MIMD systems with decoupled access and execution.Peer ReviewedPostprint (published version
System configuration and executive requirements specifications for reusable shuttle and space station/base
System configuration and executive requirements specifications for reusable shuttle and space station/bas
Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics
Graphics Processing Units (GPUs) are having a transformational effect on
numerical lattice quantum chromodynamics (LQCD) calculations of importance in
nuclear and particle physics. The QUDA library provides a package of mixed
precision sparse matrix linear solvers for LQCD applications, supporting single
GPUs based on NVIDIA's Compute Unified Device Architecture (CUDA). This
library, interfaced to the QDP++/Chroma framework for LQCD calculations, is
currently in production use on the "9g" cluster at the Jefferson Laboratory,
enabling unprecedented price/performance for a range of problems in LQCD.
Nevertheless, memory constraints on current GPU devices limit the problem sizes
that can be tackled. In this contribution we describe the parallelization of
the QUDA library onto multiple GPUs using MPI, including strategies for the
overlapping of communication and computation. We report on both weak and strong
scaling for up to 32 GPUs interconnected by InfiniBand, on which we sustain in
excess of 4 Tflops.Comment: 11 pages, 7 figures, to appear in the Proceedings of Supercomputing
2010 (submitted April 12, 2010
Memory performance of and-parallel prolog on shared-memory architectures
The goal of the RAP-WAM AND-parallel Prolog abstract architecture is to provide inference speeds significantly
beyond those of sequential systems, while supporting Prolog semantics and preserving sequential performance and storage efficiency. This paper presents simulation results supporting these claims with special emphasis on memory performance on a two-level sharedmemory multiprocessor organization. Several solutions to the cache coherency problem are analyzed. It is shown that RAP-WAM offers good locality and storage efficiency and that it can effectively take advantage of broadcast caches. It is argued that speeds in excess of 2 ML IPS on real applications exhibiting medium parallelism can be attained with current technology
Shared versus distributed memory multiprocessors
The question of whether multiprocessors should have shared or distributed memory has attracted a great deal of attention. Some researchers argue strongly for building distributed memory machines, while others argue just as strongly for programming shared memory multiprocessors. A great deal of research is underway on both types of parallel systems. Special emphasis is placed on systems with a very large number of processors for computation intensive tasks and considers research and implementation trends. It appears that the two types of systems will likely converge to a common form for large scale multiprocessors
A general graphical user interface for automatic reliability modeling
Reported here is a general Graphical User Interface (GUI) for automatic reliability modeling of Processor Memory Switch (PMS) structures using a Markov model. This GUI is based on a hierarchy of windows. One window has graphical editing capabilities for specifying the system's communication structure, hierarchy, reconfiguration capabilities, and requirements. Other windows have field texts, popup menus, and buttons for specifying parameters and selecting actions. An example application of the GUI is given
- …