3,078 research outputs found
Quantifying the benefits of SPECint distant parallelism in simultaneous multithreading architectures
We exploit the existence of distant parallelism that future compilers could detect and characterise its performance under simultaneous multithreading architectures. By distant parallelism we mean parallelism that cannot be captured by the processor instruction window and that can produce threads suitable for parallel execution in a multithreaded processor. We show that distant parallelism can make feasible wider issue processors by providing more instructions from the distant threads, thus better exploiting the resources from the processor in the case of speeding up single integer applications. We also investigate the necessity of out-of-order processors in the presence of multiple threads of the same program. It is important to notice at this point that the benefits described are totally orthogonal to any other architectural techniques targeting a single thread.Peer ReviewedPostprint (published version
Autonomic management of multiple non-functional concerns in behavioural skeletons
We introduce and address the problem of concurrent autonomic management of
different non-functional concerns in parallel applications build as a
hierarchical composition of behavioural skeletons. We first define the problems
arising when multiple concerns are dealt with by independent managers, then we
propose a methodology supporting coordinated management, and finally we discuss
how autonomic management of multiple concerns may be implemented in a typical
use case. The paper concludes with an outline of the challenges involved in
realizing the proposed methodology on distributed target architectures such as
clusters and grids. Being based on the behavioural skeleton concept proposed in
the CoreGRID GCM, it is anticipated that the methodology will be readily
integrated into the current reference implementation of GCM based on Java
ProActive and running on top of major grid middleware systems.Comment: 20 pages + cover pag
Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors
Asymmetric multicore processors (AMPs) have recently emerged as an appealing
technology for severely energy-constrained environments, especially in mobile
appliances where heterogeneity in applications is mainstream. In addition,
given the growing interest for low-power high performance computing, this type
of architectures is also being investigated as a means to improve the
throughput-per-Watt of complex scientific applications.
In this paper, we design and embed several architecture-aware optimizations
into a multi-threaded general matrix multiplication (gemm), a key operation of
the BLAS, in order to obtain a high performance implementation for ARM
big.LITTLE AMPs. Our solution is based on the reference implementation of gemm
in the BLIS library, and integrates a cache-aware configuration as well as
asymmetric--static and dynamic scheduling strategies that carefully tune and
distribute the operation's micro-kernels among the big and LITTLE cores of the
target processor. The experimental results on a Samsung Exynos 5422, a
system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the
big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric
scheduling attain important gains in performance with respect to its
architecture-oblivious counterparts while exploiting all the resources of the
AMP to deliver considerable energy efficiency
Taskgraph: A Low Contention OpenMP Tasking Framework
OpenMP is the de-facto standard for shared memory systems in High-Performance
Computing (HPC). It includes a task-based model that offers a high-level of
abstraction to effectively exploit highly dynamic structured and unstructured
parallelism in an easy and flexible way. Unfortunately, the run-time overheads
introduced to manage tasks are (very) high in most common OpenMP frameworks
(e.g., GCC, LLVM), which defeats the potential benefits of the tasking model,
and makes it suitable for coarse-grained tasks only. This paper presents
taskgraph, a framework that uses a task dependency graph (TDG) to represent a
region of code implemented with OpenMP tasks in order to reduce the run-time
overheads associated with the management of tasks, i.e., contention and
parallel orchestration, including task creation and synchronization. The TDG
avoids the overheads related to the resolution of task dependencies and greatly
reduces those deriving from the accesses to shared resources. Moreover, the
taskgraph framework introduces in OpenMP the record-and-replay execution model
that accelerates the taskgraph region from its second execution. Overall, the
multiple optimizations presented in this paper allow exploiting fine-grained
OpenMP tasks to cope with the trend in current applications pointing to
leverage massive on-node parallelism, fine-grained and dynamic scheduling
paradigms. The framework is implemented on LLVM 15.0. Results show that the
taskgraph implementation outperforms the vanilla OpenMP system in terms of
performance and scalability, for all structured and unstructured parallelism,
and considering coarse and fine grained tasks. Furthermore, the proposed
framework considerably reduces the performance gap between the task and the
thread models of OpenMP
Programmability and Performance of Parallel ECS-based Simulation of Multi-Agent Exploration Models
While the traditional objective of parallel/distributed simulation techniques has been mainly in improving performance and making very large models tractable, more recent research trends targeted complementary aspects, such as the “ease of programming”. Along this line, a recent proposal called Event and Cross State (ECS) synchronization, stands as a solution allowing to break the traditional programming rules proper of Parallel Discrete Event Simulation (PDES) systems, where the application code processing a specific event is only allowed to access the state (namely the memory image) of the target simulation object. In fact with ECS, the programmer is allowed to write ANSI-C event-handlers capable of accessing (in either read or write mode) the state of whichever simulation object included in the simulation model. Correct concurrent execution of events, e.g., on top of multi-core machines, is guaranteed by ECS with no intervention by the programmer, who is in practice exposed to a sequential-style programming model where events are processed one at a time, and have the ability to access the current memory image of the whole simulation model, namely the collection of the states of any involved object. This can strongly simplify the development of specific models, e.g., by avoiding the need for passing state information across concurrent objects in the form of events. In this article we investigate on both programmability and performance aspects related to developing/supporting a multi-agent exploration model on top of the ROOT-Sim PDES platform, which supports ECS
A parallel algorithm for global routing
A Parallel Hierarchical algorithm for Global Routing (PHIGURE) is presented. The router is based on the work of Burstein and Pelavin, but has many extensions for general global routing and parallel execution. Main features of the algorithm include structured hierarchical decomposition into separate independent tasks which are suitable for parallel execution and adaptive simplex solution for adding feedthroughs and adjusting channel heights for row-based layout. Alternative decomposition methods and the various levels of parallelism available in the algorithm are examined closely. The algorithm is described and results are presented for a shared-memory multiprocessor implementation
- …