24,899 research outputs found
Memory performance of and-parallel prolog on shared-memory architectures
The goal of the RAP-WAM AND-parallel Prolog abstract architecture is to provide inference speeds significantly
beyond those of sequential systems, while supporting Prolog semantics and preserving sequential performance and storage efficiency. This paper presents simulation results supporting these claims with special emphasis on memory performance on a two-level sharedmemory multiprocessor organization. Several solutions to the cache coherency problem are analyzed. It is shown that RAP-WAM offers good locality and storage efficiency and that it can effectively take advantage of broadcast caches. It is argued that speeds in excess of 2 ML IPS on real applications exhibiting medium parallelism can be attained with current technology
A hybrid MPI-OpenMP scheme for scalable parallel pseudospectral computations for fluid turbulence
A hybrid scheme that utilizes MPI for distributed memory parallelism and
OpenMP for shared memory parallelism is presented. The work is motivated by the
desire to achieve exceptionally high Reynolds numbers in pseudospectral
computations of fluid turbulence on emerging petascale, high core-count,
massively parallel processing systems. The hybrid implementation derives from
and augments a well-tested scalable MPI-parallelized pseudospectral code. The
hybrid paradigm leads to a new picture for the domain decomposition of the
pseudospectral grids, which is helpful in understanding, among other things,
the 3D transpose of the global data that is necessary for the parallel fast
Fourier transforms that are the central component of the numerical
discretizations. Details of the hybrid implementation are provided, and
performance tests illustrate the utility of the method. It is shown that the
hybrid scheme achieves near ideal scalability up to ~20000 compute cores with a
maximum mean efficiency of 83%. Data are presented that demonstrate how to
choose the optimal number of MPI processes and OpenMP threads in order to
optimize code performance on two different platforms.Comment: Submitted to Parallel Computin
Exploring the Performance Benefit of Hybrid Memory System on HPC Environments
Hardware accelerators have become a de-facto standard to achieve high
performance on current supercomputers and there are indications that this trend
will increase in the future. Modern accelerators feature high-bandwidth memory
next to the computing cores. For example, the Intel Knights Landing (KNL)
processor is equipped with 16 GB of high-bandwidth memory (HBM) that works
together with conventional DRAM memory. Theoretically, HBM can provide 5x
higher bandwidth than conventional DRAM. However, many factors impact the
effective performance achieved by applications, including the application
memory access pattern, the problem size, the threading level and the actual
memory configuration. In this paper, we analyze the Intel KNL system and
quantify the impact of the most important factors on the application
performance by using a set of applications that are representative of
scientific and data-analytics workloads. Our results show that applications
with regular memory access benefit from MCDRAM, achieving up to 3x performance
when compared to the performance obtained using only DRAM. On the contrary,
applications with random memory access pattern are latency-bound and may suffer
from performance degradation when using only MCDRAM. For those applications,
the use of additional hardware threads may help hide latency and achieve higher
aggregated bandwidth when using HBM
Hybrid static/dynamic scheduling for already optimized dense matrix factorization
We present the use of a hybrid static/dynamic scheduling strategy of the task
dependency graph for direct methods used in dense numerical linear algebra.
This strategy provides a balance of data locality, load balance, and low
dequeue overhead. We show that the usage of this scheduling in communication
avoiding dense factorization leads to significant performance gains. On a 48
core AMD Opteron NUMA machine, our experiments show that we can achieve up to
64% improvement over a version of CALU that uses fully dynamic scheduling, and
up to 30% improvement over the version of CALU that uses fully static
scheduling. On a 16-core Intel Xeon machine, our hybrid static/dynamic
scheduling approach is up to 8% faster than the version of CALU that uses a
fully static scheduling or fully dynamic scheduling. Our algorithm leads to
speedups over the corresponding routines for computing LU factorization in well
known libraries. On the 48 core AMD NUMA machine, our best implementation is up
to 110% faster than MKL, while on the 16 core Intel Xeon machine, it is up to
82% faster than MKL. Our approach also shows significant speedups compared with
PLASMA on both of these systems
Exploring Application Performance on Emerging Hybrid-Memory Supercomputers
Next-generation supercomputers will feature more hierarchical and
heterogeneous memory systems with different memory technologies working
side-by-side. A critical question is whether at large scale existing HPC
applications and emerging data-analytics workloads will have performance
improvement or degradation on these systems. We propose a systematic and fair
methodology to identify the trend of application performance on emerging
hybrid-memory systems. We model the memory system of next-generation
supercomputers as a combination of "fast" and "slow" memories. We then analyze
performance and dynamic execution characteristics of a variety of workloads,
from traditional scientific applications to emerging data analytics to compare
traditional and hybrid-memory systems. Our results show that data analytics
applications can clearly benefit from the new system design, especially at
large scale. Moreover, hybrid-memory systems do not penalize traditional
scientific applications, which may also show performance improvement.Comment: 18th International Conference on High Performance Computing and
Communications, IEEE, 201
Basis Token Consistency: A Practical Mechanism for Strong Web Cache Consistency
With web caching and cache-related services like CDNs and edge services playing an increasingly significant role in the modern internet, the problem of the weak consistency and coherence provisions in current web protocols is becoming increasingly significant and drawing the attention of the standards community [LCD01]. Toward this end, we present definitions of consistency and coherence for web-like environments, that is, distributed client-server information systems where the semantics of interactions with resource are more general than the read/write operations found in memory hierarchies and distributed file systems. We then present a brief review of proposed mechanisms which strengthen the consistency of caches in the web, focusing upon their conceptual contributions and their weaknesses in real-world practice. These insights motivate a new mechanism, which we call "Basis Token Consistency" or BTC; when implemented at the server, this mechanism allows any client (independent of the presence and conformity of any intermediaries) to maintain a self-consistent view of the server's state. This is accomplished by annotating responses with additional per-resource application information which allows client caches to recognize the obsolescence of currently cached entities and identify responses from other caches which are already stale in light of what has already been seen. The mechanism requires no deviation from the existing client-server communication model, and does not require servers to maintain any additional per-client state. We discuss how our mechanism could be integrated into a fragment-assembling Content Management System (CMS), and present a simulation-driven performance comparison between the BTC algorithm and the use of the Time-To-Live (TTL) heuristic.National Science Foundation (ANI-9986397, ANI-0095988
- …