74,282 research outputs found
Object oriented execution model (OOM)
This paper considers implementing the Object Oriented Programming Model directly in the hardware to serve as a base to exploit object-level parallelism, speculation and heterogeneous computing. Towards this goal, we present a new execution model called Object
Oriented execution Model - OOM - that implements the OO Programming Models. All OOM hardware structures are objects and the OOM Instruction Set directly utilizes objects while hiding other complex hardware structures. OOM maintains all high-level programming language information until execution time. This enables efficient
extraction of available parallelism in OO serial code at
execution time with minimal compiler support. Our results
show that OOM utilizes the available parallelism better
than the OoO (Out-of-Order) modelPeer ReviewedPostprint (published version
The Potential of Synergistic Static, Dynamic and Speculative Loop Nest Optimizations for Automatic Parallelization
Research in automatic parallelization of loop-centric programs started with
static analysis, then broadened its arsenal to include dynamic
inspection-execution and speculative execution, the best results involving
hybrid static-dynamic schemes. Beyond the detection of parallelism in a
sequential program, scalable parallelization on many-core processors involves
hard and interesting parallelism adaptation and mapping challenges. These
challenges include tailoring data locality to the memory hierarchy, structuring
independent tasks hierarchically to exploit multiple levels of parallelism,
tuning the synchronization grain, balancing the execution load, decoupling the
execution into thread-level pipelines, and leveraging heterogeneous hardware
with specialized accelerators. The polyhedral framework allows to model,
construct and apply very complex loop nest transformations addressing most of
the parallelism adaptation and mapping challenges. But apart from
hardware-specific, back-end oriented transformations (if-conversion, trace
scheduling, value prediction), loop nest optimization has essentially ignored
dynamic and speculative techniques. Research in polyhedral compilation recently
reached a significant milestone towards the support of dynamic, data-dependent
control flow. This opens a large avenue for blending dynamic analyses and
speculative techniques with advanced loop nest optimizations. Selecting
real-world examples from SPEC benchmarks and numerical kernels, we make a case
for the design of synergistic static, dynamic and speculative loop
transformation techniques. We also sketch the embedding of dynamic information,
including speculative assumptions, in the heart of affine transformation search
spaces
Identifying, Quantifying, Extracting and Enhancing Implicit Parallelism
The shift of the microprocessor industry towards multicore architectures has
placed a huge burden on the programmers by requiring explicit parallelization
for performance. Implicit Parallelization is an alternative that could ease the
burden on programmers by parallelizing applications ???under the covers??? while
maintaining sequential semantics externally. This thesis develops a novel
approach for thinking about parallelism, by casting the problem of
parallelization in terms of instruction criticality. Using this approach,
parallelism in a program region is readily identified when certain conditions
about fetch-criticality are satisfied by the region. The thesis formalizes this
approach by developing a criticality-driven model of task-based
parallelization. The model can accurately predict the parallelism that would be
exposed by potential task choices by capturing a wide set of sources of
parallelism as well as costs to parallelization.
The criticality-driven model enables the development of two key components for
Implicit Parallelization: a task selection policy, and a bottleneck analysis
tool. The task selection policy can partition a single-threaded program into
tasks that will profitably execute concurrently on a multicore architecture in
spite of the costs associated with enforcing data-dependences and with
task-related actions. The bottleneck analysis tool gives feedback to the
programmers about data-dependences that limit parallelism. In particular, there
are several ???accidental dependences??? that can be easily removed with large
improvements in parallelism. These tools combine into a systematic methodology
for performance tuning in Implicit Parallelization. Finally, armed with the
criticality-driven model, the thesis revisits several architectural design
decisions, and finds several encouraging ways forward to increase the scope of
Implicit Parallelization.unpublishednot peer reviewe
Optimize parallel numerical applications for climate modelling
Aquest projecte vol avaluar els possibles beneficis d'implementar paral·lelisme amb memòria compartida en la versió més recent del model NEMO, el qual actualment només fa servir paral·lelisme amb memòria distribuida utilitzant MPI. Generalment les paral·lelitzacions hÃbrides, que explotan memòria distribuida i compartida, fent servir ambdós paradigmes de paral·lelisme són més eficients. Amb el llançament de l'última versió de NEMO 4.2 amb millores a l'escalabilitat, volem avaluar el rendiment de OpenMP per a implementar el paral·lelisme hÃbrid amb els objectius de millorar l'escalabilitat del model i preparar-lo per a les noves arquitectures de clusters, les quals estan tendint a incrementar el nombre de nuclis per node.This project wants to evaluate the possible benefits of implementing shared memory parallelism in the most recent version of the NEMO model which currently uses distributed memory parallelism with MPI. Generally, hybrid parallelizations, which exploit distributed and shared memory, using both parallelism paradigms are more efficient. With the release of the latest version of NEMO 4.2 with improvements on the scalability, we want to evaluate the performance of OpenMP to implement the hybrid parallelism in order to improve the model's scalability and making it better suited for the new cluster architectures, which are tending towards increasing the amount of cores per node
Programming MPSoC platforms: Road works ahead
This paper summarizes a special session on multicore/multi-processor system-on-chip (MPSoC) programming challenges. The current trend towards MPSoC platforms in most computing domains does not only mean a radical change in computer architecture. Even more important from a SW developer´s viewpoint, at the same time the classical sequential von Neumann programming model needs to be overcome. Efficient utilization of the MPSoC HW resources demands for radically new models and corresponding SW development tools, capable of exploiting the available parallelism and guaranteeing bug-free parallel SW. While several standards are established in the high-performance computing domain (e.g. OpenMP), it is clear that more innovations are required for successful\ud
deployment of heterogeneous embedded MPSoC. On the other hand, at least for coming years, the freedom for disruptive programming technologies is limited by the huge amount of certified sequential code that demands for a more pragmatic, gradual tool and code replacement strategy
Evaluating the Impact of SDC on the GMRES Iterative Solver
Increasing parallelism and transistor density, along with increasingly
tighter energy and peak power constraints, may force exposure of occasionally
incorrect computation or storage to application codes. Silent data corruption
(SDC) will likely be infrequent, yet one SDC suffices to make numerical
algorithms like iterative linear solvers cease progress towards the correct
answer. Thus, we focus on resilience of the iterative linear solver GMRES to a
single transient SDC. We derive inexpensive checks to detect the effects of an
SDC in GMRES that work for a more general SDC model than presuming a bit flip.
Our experiments show that when GMRES is used as the inner solver of an
inner-outer iteration, it can "run through" SDC of almost any magnitude in the
computationally intensive orthogonalization phase. That is, it gets the right
answer using faulty data without any required roll back. Those SDCs which it
cannot run through, get caught by our detection scheme
Towards high-level execution primitives for and-parallelism: preliminary results
Most implementations of parallel logic programming rely on complex low-level machinery which is arguably difflcult to implement and modify. We explore an alternative approach aimed at taming that complexity by raising core parts of the implementation to the source language level for the particular case of and-parallelism. Therefore, we handle a signiflcant portion of the parallel implementation mechanism at the Prolog level with the help of a comparatively small number of concurrency-related primitives which take care of lower-level tasks such as locking, thread management, stack set management, etc. The approach does not eliminate altogether modiflcations to the abstract machine, but it does greatly simplify them and it also facilitates experimenting with different alternatives. We show how this approach allows implementing both restricted and unrestricted (i.e., non fork-join) parallelism. Preliminary experiments show that the amount of performance sacriflced is reasonable, although granularity control is required in some cases. Also, we observe that the availability of unrestricted parallelism contributes to better observed speedups
- …