2,776,002 research outputs found
Propagation and Decay of Injected One-Off Delays on Clusters: A Case Study
Analytic, first-principles performance modeling of distributed-memory
applications is difficult due to a wide spectrum of random disturbances caused
by the application and the system. These disturbances (commonly called "noise")
destroy the assumptions of regularity that one usually employs when
constructing simple analytic models. Despite numerous efforts to quantify,
categorize, and reduce such effects, a comprehensive quantitative understanding
of their performance impact is not available, especially for long delays that
have global consequences for the parallel application. In this work, we
investigate various traces collected from synthetic benchmarks that mimic real
applications on simulated and real message-passing systems in order to pinpoint
the mechanisms behind delay propagation. We analyze the dependence of the
propagation speed of idle waves emanating from injected delays with respect to
the execution and communication properties of the application, study how such
delays decay under increased noise levels, and how they interact with each
other. We also show how fine-grained noise can make a system immune against the
adverse effects of propagating idle waves. Our results contribute to a better
understanding of the collective phenomena that manifest themselves in
distributed-memory parallel applications.Comment: 10 pages, 9 figures; title change
Parallelizing Windowed Stream Joins in a Shared-Nothing Cluster
The availability of large number of processing nodes in a parallel and
distributed computing environment enables sophisticated real time processing
over high speed data streams, as required by many emerging applications.
Sliding window stream joins are among the most important operators in a stream
processing system. In this paper, we consider the issue of parallelizing a
sliding window stream join operator over a shared nothing cluster. We propose a
framework, based on fixed or predefined communication pattern, to distribute
the join processing loads over the shared-nothing cluster. We consider various
overheads while scaling over a large number of nodes, and propose solution
methodologies to cope with the issues. We implement the algorithm over a
cluster using a message passing system, and present the experimental results
showing the effectiveness of the join processing algorithm.Comment: 11 page
A Non-blocking Buddy System for Scalable Memory Allocation on Multi-core Machines
Common implementations of core memory allocation components handle concurrent allocation/release requests by synchronizing threads via spin-locks. This approach is not prone to scale with large thread counts, a problem that has been addressed in the literature by introducing layered allocation services or replicating the core allocators - the bottom most ones within the layered architecture. Both these solutions tend to reduce the pressure of actual concurrent accesses to each individual core allocator. In this article we explore an alternative approach to scalability of memory allocation/release, which can be still combined with those literature proposals. We present a fully non-blocking buddy-system, that allows threads to proceed in parallel, and commit their allocations/releases unless a conflict is materialized while handling its metadata. Beyond improving scalability and performance it is resilient to performance degradation in face of concurrent accesses independently of the current level of fragmentation of the handled memory blocks
LIKWID Monitoring Stack: A flexible framework enabling job specific performance monitoring for the masses
System monitoring is an established tool to measure the utilization and
health of HPC systems. Usually system monitoring infrastructures make no
connection to job information and do not utilize hardware performance
monitoring (HPM) data. To increase the efficient use of HPC systems automatic
and continuous performance monitoring of jobs is an essential component. It can
help to identify pathological cases, provides instant performance feedback to
the users, offers initial data to judge on the optimization potential of
applications and helps to build a statistical foundation about application
specific system usage. The LIKWID monitoring stack is a modular framework build
on top of the LIKWID tools library. It aims on enabling job specific
performance monitoring using HPM data, system metrics and application-level
data for small to medium sized commodity clusters. Moreover, it is designed to
integrate in existing monitoring infrastructures to speed up the change from
pure system monitoring to job-aware monitoring.Comment: 4 pages, 4 figures. Accepted for HPCMASPA 2017, the Workshop on
Monitoring and Analysis for High Performance Computing Systems Plus
Applications, held in conjunction with IEEE Cluster 2017, Honolulu, HI,
September 5, 201
Cluster algebras arising from cluster tubes
We study the cluster algebras arising from cluster tubes with rank bigger
than . Cluster tubes are Calabi-Yau triangulated categories which
contain no cluster tilting objects, but maximal rigid objects. Fix a certain
maximal rigid object in the cluster tube of rank . For
any indecomposable rigid object in , we define an analogous
of Caldero-Chapton's formula (or Palu's cluster character formula) by
using the geometric information of . We show that satisfy the
mutation formula when form an exchange pair, and that gives a bijection from the set of indecomposable rigid objects in
to the set of cluster variables of cluster algebra of type
, which induces a bijection between the set of basic maximal rigid
objects in and the set of clusters. This strengths a surprising
result proved recently by Buan-Marsh-Vatne that the combinatorics of maximal
rigid objects in the cluster tube encode the combinatorics of
the cluster algebra of type since the combinatorics of cluster
algebras of type or of type are the same by a result of
Fomin and Zelevinsky. As a consequence, we give a categorification of cluster
algebras of type .Comment: 21 pages, title changed, rewrite the proof of the main theorem in
Section 3, add Section 5, final version to appear in Jour. London Math. So
Kang-Redner Anomaly in Cluster-Cluster Aggregation
The large time, small mass, asymptotic behavior of the average mass
distribution \pb is studied in a -dimensional system of diffusing
aggregating particles for . By means of both a renormalization
group computation as well as a direct re-summation of leading terms in the
small reaction-rate expansion of the average mass distribution, it is shown
that \pb \sim \frac{1}{t^d} (\frac{m^{1/d}}{\sqrt{t}})^{e_{KR}} for , where and . In two
dimensions, it is shown that \pb \sim \frac{\ln(m) \ln(t)}{t^2} for . Numerical simulations in two dimensions supporting the analytical
results are also presented.Comment: 11 pages, 6 figures, Revtex
Globular Cluster Formation in the Virgo Cluster
Metal poor globular clusters (MPGCs) are a unique probe of the early
universe, in particular the reionization era. Systems of globular clusters in
galaxy clusters are particularly interesting as it is in the progenitors of
galaxy clusters that the earliest reionizing sources first formed. Although the
exact physical origin of globular clusters is still debated, it is generally
admitted that globular clusters form in early, rare dark matter peaks (Moore et
al. 2006; Boley et al. 2009). We provide a fully numerical analysis of the
Virgo cluster globular cluster system by identifying the present day globular
cluster system with exactly such early, rare dark matter peaks. A popular
hypothesis is that that the observed truncation of blue metal poor globular
cluster formation is due to reionization (Spitler et al. 2012; Boley et al.
2009; Brodie & Strader 2006); adopting this view, constraining the formation
epoch of MPGCs provides a complementary constraint on the epoch of
reionization. By analyzing both the line of sight velocity dispersion and the
surface density distribution of the present day distribution we are able to
constrain the redshift and mass of the dark matter peaks. We find and quantify
a dependence on the chosen line of sight of these quantities, whose strength
varies with redshift, and coupled with star formation efficiency arguments find
a best fitting formation mass and redshift of and . We predict intracluster MPGCs in
the Virgo cluster. Our results confirm the techniques pioneered by Moore et al.
(2006) when applied to the the Virgo cluster and extend and refine the analytic
results of Spitler et al. (2012) numerically.Comment: 13 Pages, 13 Figures, submitted to MNRA
Towards a Mini-App for Smoothed Particle Hydrodynamics at Exascale
The smoothed particle hydrodynamics (SPH) technique is a purely Lagrangian
method, used in numerical simulations of fluids in astrophysics and
computational fluid dynamics, among many other fields. SPH simulations with
detailed physics represent computationally-demanding calculations. The
parallelization of SPH codes is not trivial due to the absence of a structured
grid. Additionally, the performance of the SPH codes can be, in general,
adversely impacted by several factors, such as multiple time-stepping,
long-range interactions, and/or boundary conditions. This work presents
insights into the current performance and functionalities of three SPH codes:
SPHYNX, ChaNGa, and SPH-flow. These codes are the starting point of an
interdisciplinary co-design project, SPH-EXA, for the development of an
Exascale-ready SPH mini-app. To gain such insights, a rotating square patch
test was implemented as a common test simulation for the three SPH codes and
analyzed on two modern HPC systems. Furthermore, to stress the differences with
the codes stemming from the astrophysics community (SPHYNX and ChaNGa), an
additional test case, the Evrard collapse, has also been carried out. This work
extrapolates the common basic SPH features in the three codes for the purpose
of consolidating them into a pure-SPH, Exascale-ready, optimized, mini-app.
Moreover, the outcome of this serves as direct feedback to the parent codes, to
improve their performance and overall scalability.Comment: 18 pages, 4 figures, 5 tables, 2018 IEEE International Conference on
Cluster Computing proceedings for WRAp1
A runtime heuristic to selectively replicate tasks for application-specific reliability targets
In this paper we propose a runtime-based selective task replication technique for task-parallel high performance computing applications. Our selective task replication technique is automatic and does not require modification/recompilation of OS, compiler or application code. Our heuristic, we call App_FIT, selects tasks to replicate such that the specified reliability target for an application is achieved. In our experimental evaluation, we show that App FIT selective replication heuristic is low-overhead and highly scalable. In addition, results indicate that complete task replication is overkill for achieving reliability targets. We show that with App FIT, we can tolerate pessimistic exascale error rates with only 53% of the tasks being replicated.This work was supported by FI-DGR 2013 scholarship and the European Community’s
Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2
Project (www.montblanc-project.eu), grant agreement no. 610402 and in part by the
European Union (FEDER funds) under contract TIN2015-65316-P.Peer ReviewedPostprint (author's final draft
- …
