6,549 research outputs found
Simulating whole supercomputer applications
Architecture simulation tools are extremely useful not only to predict the performance of future system designs, but also to analyze and improve the performance of software running on well know architectures. However, since power and complexity issues stopped the progress of single-thread performance, simulation speed no longer scales with technology: systems get larger and faster, but simulators do not get any faster. Detailed simulation of full-scale applications running on large clusters with hundreds or thousands of processors is not feasible.
In this paper we present a methodology that allows detailed simulation of large-scale MPI applications running on systems with thousands of processors with low resource cost. Our methodology allows detailed processor simulation, from the memory and cache hierarchy down to the functional units and the pipeline structure. This feature enables software performance analysis beyond what performance counters would allow. In addition, it enables performance prediction targeting non-existent architectures and systems, that is, systems for which no performance data can be used as a reference.
For example, detailed analysis of the weather forecasting application WRF reveals that it is highly optimized for cache locality, and is strongly compute bound, with faster functional units having the greatest impact on its performance. Also, analysis of next-generation CMP clusters show that performance may start to decline beyond 8 processors per chip due to shared resource contention, regardless of the benefits of through-memory communication.Postprint (published version
Performance analysis of direct N-body algorithms for astrophysical simulations on distributed systems
We discuss the performance of direct summation codes used in the simulation
of astrophysical stellar systems on highly distributed architectures. These
codes compute the gravitational interaction among stars in an exact way and
have an O(N^2) scaling with the number of particles. They can be applied to a
variety of astrophysical problems, like the evolution of star clusters, the
dynamics of black holes, the formation of planetary systems, and cosmological
simulations. The simulation of realistic star clusters with sufficiently high
accuracy cannot be performed on a single workstation but may be possible on
parallel computers or grids. We have implemented two parallel schemes for a
direct N-body code and we study their performance on general purpose parallel
computers and large computational grids. We present the results of timing
analyzes conducted on the different architectures and compare them with the
predictions from theoretical models. We conclude that the simulation of star
clusters with up to a million particles will be possible on large distributed
computers in the next decade. Simulating entire galaxies however will in
addition require new hybrid methods to speedup the calculation.Comment: 22 pages, 8 figures, accepted for publication in Parallel Computin
Application of Supercomputer Technologies for Simulation of Socio-Economic Systems
To date, an extensive experience has been accumulated in investigation of problems related to quality, assessment of management systems, modeling of economic system sustainability. The studies performed have created a basis for formation of a new research area — Economics of Quality. Its tools allow to use opportunities of model simulation for construction of the mathematical models adequately reflecting the role of quality in natural, technical, social regularities of functioning of the complex socioeconomic systems. Extensive application and development of models, and also system modeling with use of supercomputer technologies, on our deep belief, will bring the conducted researches of social and economic systems to essentially new level. Moreover, the current scientific research makes a significant contribution to model simulation of multi-agent social systems and that isn’t less important, it belongs to the priority areas in development of science and technology in our country. This article is devoted to the questions of supercomputer technologies application in public sciences, first of all, — regarding technical realization of the large-scale agent-focused models (AFM). The essence of this tool is that owing to increase in power of computers it became possible to describe the behavior of many separate fragments of a difficult system, as social and economic systems represent. The article also deals with the experience of foreign scientists and practicians in launching the AFM on supercomputers, and also the example of AFM developed in CEMI RAS, stages and methods of effective calculating kernel display of multi-agent system on architecture of a modern supercomputer will be analyzed. The experiments on the basis of model simulation on forecasting the population of St. Petersburg according to three scenarios as one of the major factors influencing the development of social and economic system and quality of life of the population are presented in the conclusion
Idle Period Propagation in Message-Passing Applications
Idle periods on different processes of Message Passing applications are
unavoidable. While the origin of idle periods on a single process is well
understood as the effect of system and architectural random delays, yet it is
unclear how these idle periods propagate from one process to another. It is
important to understand idle period propagation in Message Passing applications
as it allows application developers to design communication patterns avoiding
idle period propagation and the consequent performance degradation in their
applications. To understand idle period propagation, we introduce a methodology
to trace idle periods when a process is waiting for data from a remote delayed
process in MPI applications. We apply this technique in an MPI application that
solves the heat equation to study idle period propagation on three different
systems. We confirm that idle periods move between processes in the form of
waves and that there are different stages in idle period propagation. Our
methodology enables us to identify a self-synchronization phenomenon that
occurs on two systems where some processes run slower than the other processes.Comment: 18th International Conference on High Performance Computing and
Communications, IEEE, 201
The Brain on Low Power Architectures - Efficient Simulation of Cortical Slow Waves and Asynchronous States
Efficient brain simulation is a scientific grand challenge, a
parallel/distributed coding challenge and a source of requirements and
suggestions for future computing architectures. Indeed, the human brain
includes about 10^15 synapses and 10^11 neurons activated at a mean rate of
several Hz. Full brain simulation poses Exascale challenges even if simulated
at the highest abstraction level. The WaveScalES experiment in the Human Brain
Project (HBP) has the goal of matching experimental measures and simulations of
slow waves during deep-sleep and anesthesia and the transition to other brain
states. The focus is the development of dedicated large-scale
parallel/distributed simulation technologies. The ExaNeSt project designs an
ARM-based, low-power HPC architecture scalable to million of cores, developing
a dedicated scalable interconnect system, and SWA/AW simulations are included
among the driving benchmarks. At the joint between both projects is the INFN
proprietary Distributed and Plastic Spiking Neural Networks (DPSNN) simulation
engine. DPSNN can be configured to stress either the networking or the
computation features available on the execution platforms. The simulation
stresses the networking component when the neural net - composed by a
relatively low number of neurons, each one projecting thousands of synapses -
is distributed over a large number of hardware cores. When growing the number
of neurons per core, the computation starts to be the dominating component for
short range connections. This paper reports about preliminary performance
results obtained on an ARM-based HPC prototype developed in the framework of
the ExaNeSt project. Furthermore, a comparison is given of instantaneous power,
total energy consumption, execution time and energetic cost per synaptic event
of SWA/AW DPSNN simulations when executed on either ARM- or Intel-based server
platforms
- …