5,471 research outputs found
Shared versus distributed memory multiprocessors
The question of whether multiprocessors should have shared or distributed memory has attracted a great deal of attention. Some researchers argue strongly for building distributed memory machines, while others argue just as strongly for programming shared memory multiprocessors. A great deal of research is underway on both types of parallel systems. Special emphasis is placed on systems with a very large number of processors for computation intensive tasks and considers research and implementation trends. It appears that the two types of systems will likely converge to a common form for large scale multiprocessors
Memory performance of and-parallel prolog on shared-memory architectures
The goal of the RAP-WAM AND-parallel Prolog abstract architecture is to provide inference speeds significantly
beyond those of sequential systems, while supporting Prolog semantics and preserving sequential performance and storage efficiency. This paper presents simulation results supporting these claims with special emphasis on memory performance on a two-level sharedmemory multiprocessor organization. Several solutions to the cache coherency problem are analyzed. It is shown that RAP-WAM offers good locality and storage efficiency and that it can effectively take advantage of broadcast caches. It is argued that speeds in excess of 2 ML IPS on real applications exhibiting medium parallelism can be attained with current technology
GPU acceleration of brain image proccessing
Durante los últimos años se ha venido demostrando el alto poder computacional
que ofrecen las GPUs a la hora de resolver determinados problemas.
Al mismo tiempo, existen campos en los que no es posible beneficiarse completamente
de las mejoras conseguidas por los investigadores, debido principalmente
a que los tiempos de ejecución de las aplicaciones llegan a ser extremadamente
largos. Este es por ejemplo el caso del registro de imágenes en medicina.
A pesar de que se han conseguido aceleraciones sobre el registro de imágenes,
su uso en la práctica clÃnica es aún limitado. Entre otras cosas, esto se debe
al rendimiento conseguido.
Por lo tanto se plantea como objetivo de este proyecto, conseguir mejorar los
tiempos de ejecución de una aplicación dedicada al resgitro de imágenes en medicina,
con el fin de ayudar a aliviar este problema
An exploration of CUDA and CBEA for a gravitational wave data-analysis application (Einstein@Home)
We present a detailed approach for making use of two new computer hardware
architectures -- CBEA and CUDA -- for accelerating a scientific data-analysis
application (Einstein@Home). Our results suggest that both the architectures
suit the application quite well and the achievable performance in the same
software developmental time-frame, is nearly identical.Comment: Accepted for publication in International Conference on Parallel
Processing and Applied Mathematics (PPAM 2009
CoreTSAR: Task Scheduling for Accelerator-aware Runtimes
Heterogeneous supercomputers that incorporate computational accelerators
such as GPUs are increasingly popular due to their high
peak performance, energy efficiency and comparatively low cost.
Unfortunately, the programming models and frameworks designed
to extract performance from all computational units still lack the
flexibility of their CPU-only counterparts. Accelerated OpenMP
improves this situation by supporting natural migration of OpenMP
code from CPUs to a GPU. However, these implementations currently
lose one of OpenMP’s best features, its flexibility: typical
OpenMP applications can run on any number of CPUs. GPU implementations
do not transparently employ multiple GPUs on a node
or a mix of GPUs and CPUs. To address these shortcomings, we
present CoreTSAR, our runtime library for dynamically scheduling
tasks across heterogeneous resources, and propose straightforward
extensions that incorporate this functionality into Accelerated
OpenMP. We show that our approach can provide nearly linear
speedup to four GPUs over only using CPUs or one GPU while
increasing the overall flexibility of Accelerated OpenMP
Investigation of LSTM Based Prediction for Dynamic Energy Management in Chip Multiprocessors
In this paper, we investigate the effectiveness of using long short-term memory (LSTM) instead of Kalman filtering to do prediction for the purpose of constructing dynamic energy management (DEM) algorithms in chip multi-processors (CMPs). Either of the two prediction methods is employed to estimate the workload in the next control period for each of the processor cores. These estimates are then used to select voltage-frequency (VF) pairs for each core of the CMP during the next control period as part of a dynamic voltage and frequency scaling (DVFS) technique. The objective of the DVFS technique is to reduce energy consumption under performance constraints that are set by the user. We conduct our investigation using a custom Sniper system simulation framework. Simulation results for 16 and 64 core network-on-chip based CMP architectures and using several benchmarks demonstrate that the LSTM is slightly better than Kalman filtering
- …