105,989 research outputs found
Automatic Loop Kernel Analysis and Performance Modeling With Kerncraft
Analytic performance models are essential for understanding the performance
characteristics of loop kernels, which consume a major part of CPU cycles in
computational science. Starting from a validated performance model one can
infer the relevant hardware bottlenecks and promising optimization
opportunities. Unfortunately, analytic performance modeling is often tedious
even for experienced developers since it requires in-depth knowledge about the
hardware and how it interacts with the software. We present the "Kerncraft"
tool, which eases the construction of analytic performance models for streaming
kernels and stencil loop nests. Starting from the loop source code, the problem
size, and a description of the underlying hardware, Kerncraft can ideally
predict the single-core performance and scaling behavior of loops on multicore
processors using the Roofline or the Execution-Cache-Memory (ECM) model. We
describe the operating principles of Kerncraft with its capabilities and
limitations, and we show how it may be used to quickly gain insights by
accelerated analytic modeling.Comment: 11 pages, 4 figures, 8 listing
Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels
Achieving optimal program performance requires deep insight into the
interaction between hardware and software. For software developers without an
in-depth background in computer architecture, understanding and fully utilizing
modern architectures is close to impossible. Analytic loop performance modeling
is a useful way to understand the relevant bottlenecks of code execution based
on simple machine models. The Roofline Model and the Execution-Cache-Memory
(ECM) model are proven approaches to performance modeling of loop nests. In
comparison to the Roofline model, the ECM model can also describes the
single-core performance and saturation behavior on a multicore chip. We give an
introduction to the Roofline and ECM models, and to stencil performance
modeling using layer conditions (LC). We then present Kerncraft, a tool that
can automatically construct Roofline and ECM models for loop nests by
performing the required code, data transfer, and LC analysis. The layer
condition analysis allows to predict optimal spatial blocking factors for loop
nests. Together with the models it enables an ab-initio estimate of the
potential benefits of loop blocking optimizations and of useful block sizes. In
cases where LC analysis is not easily possible, Kerncraft supports a cache
simulator as a fallback option. Using a 25-point long-range stencil we
demonstrate the usefulness and predictive power of the Kerncraft tool.Comment: 22 pages, 5 figure
ADF95: Tool for automatic differentiation of a FORTRAN code designed for large numbers of independent variables
ADF95 is a tool to automatically calculate numerical first derivatives for
any mathematical expression as a function of user defined independent
variables. Accuracy of derivatives is achieved within machine precision. ADF95
may be applied to any FORTRAN 77/90/95 conforming code and requires minimal
changes by the user. It provides a new derived data type that holds the value
and derivatives and applies forward differencing by overloading all FORTRAN
operators and intrinsic functions. An efficient indexing technique leads to a
reduced memory usage and a substantially increased performance gain over other
available tools with operator overloading. This gain is especially pronounced
for sparse systems with large number of independent variables. A wide class of
numerical simulations, e.g., those employing implicit solvers, can profit from
ADF95.Comment: 24 pages, 2 figures, 4 tables, accepted in Computer Physics
Communication
Transparently Mixing Undo Logs and Software Reversibility for State Recovery in Optimistic PDES
The rollback operation is a fundamental building block to support the correct execution of a speculative Time Warp-based Parallel Discrete Event Simulation. In the literature, several solutions to reduce the execution cost of this operation have been proposed, either based on the creation of a checkpoint of previous simulation state images, or on the execution of negative copies of simulation events which are able to undo the updates on the state. In this paper, we explore the practical design and implementation of a state recoverability technique which allows to restore a previous simulation state either relying on checkpointing or on the reverse execution of the state updates occurred while processing events in forward mode. Differently from other proposals, we address the issue of executing backward updates in a fully-transparent and event granularity-independent way, by relying on static software instrumentation (targeting the x86 architecture and Linux systems) to generate at runtime reverse update code blocks (not to be confused with reverse events, proper of the reverse computing approach). These are able to undo the effects of a forward execution while minimizing the cost of the undo operation. We also present experimental results related to our implementation, which is released as free software and fully integrated into the open source ROOT-Sim (ROme OpTimistic Simulator) package. The experimental data support the viability and effectiveness of our proposal
Matching bias in syllogistic reasoning: Evidence for a dual-process account from response times and confidence ratings
We examined matching bias in syllogistic reasoning by analysing response times, confidence ratings, and individual differences. Robertsâ (2005) ânegations paradigmâ was used to generate conflict between the surface features of problems and the logical status of conclusions. The experiment replicated matching bias effects in conclusion evaluation (Stupple & Waterhouse, 2009), revealing increased processing times for matching/logic âconflict problemsâ. Results paralleled chronometric evidence from the belief bias paradigm indicating that logic/belief conflict problems take longer to process than non-conflict problems (Stupple, Ball, Evans, & Kamal-Smith, 2011). Individualsâ response times for conflict problems also showed patterns of association with the degree of overall normative responding. Acceptance rates, response times, metacognitive confidence judgements, and individual differences all converged in supporting dual-process theory. This is noteworthy because dual-process predictions about heuristic/analytic conflict in syllogistic reasoning generalised from the belief bias paradigm to a situation where matching features of conclusions, rather than beliefs, were set in opposition to logic
Tuning the Level of Concurrency in Software Transactional Memory: An Overview of Recent Analytical, Machine Learning and Mixed Approaches
Synchronization transparency offered by Software Transactional Memory (STM) must not come at the expense of run-time efficiency, thus demanding from the STM-designer the inclusion of mechanisms properly oriented to performance and other quality indexes. Particularly, one core issue to cope with in STM is related to exploiting parallelism while also avoiding thrashing phenomena due to excessive transaction rollbacks, caused by excessively high levels of contention on logical resources, namely concurrently accessed data portions. A means to address run-time efficiency consists in dynamically determining the best-suited level of concurrency (number of threads) to be employed for running the application (or specific application phases) on top of the STM layer. For too low levels of concurrency, parallelism can be hampered. Conversely, over-dimensioning the concurrency level may give rise to the aforementioned thrashing phenomena caused by excessive data contentionâan aspect which has reflections also on the side of reduced energy-efficiency. In this chapter we overview a set of recent techniques aimed at building âapplication-specificâ performance models that can be exploited to dynamically tune the level of concurrency to the best-suited value. Although they share some base concepts while modeling the system performance vs the degree of concurrency, these techniques rely on disparate methods, such as machine learning or analytic methods (or combinations of the two), and achieve different tradeoffs in terms of the relation between the precision of the performance model and the latency for model instantiation. Implications of the different tradeoffs in real-life scenarios are also discussed
- âŠ