3,979 research outputs found
SPAM : a multiprocessor execution driven simulation Kernel
Trace driven simulation is a well known technique for performance evaluation of single processor computers. However, trace driven simulation introduces distortions when used to simulate multiprocessor architectures. Execution driven simulation is the only technique that gives accurate simulation results for multiprocessor architectures though it is difficult to implement. Thisd paper presents SPAM, a simulation kernel that simplifies the construction of execution driven simulators for shared memory multiprocessors. The kernel provides a tracing tool and a set of primitives which allow the execution, tracing and simulation of shared memory parallel applications on a single processor computer. The performance of the kernel allows the simulation of real sized parallel applications in a reasonnable time
Adaptive runtime-assisted block prefetching on chip-multiprocessors
Memory stalls are a significant source of performance degradation in modern processors. Data prefetching is a widely adopted and well studied technique used to alleviate this problem. Prefetching can be performed by the hardware, or be initiated and controlled by software. Among software controlled prefetching we find a wide variety of schemes, including runtime-directed prefetching and more specifically runtime-directed block prefetching. This paper proposes a hybrid prefetching mechanism that integrates a software driven block prefetcher with existing hardware prefetching techniques. Our runtime-assisted software prefetcher brings large blocks of data on-chip with the support of a low cost hardware engine, and synergizes with existing hardware prefetchers that manage locality at a finer granularity. The runtime system that drives the prefetch engine dynamically selects which cache to prefetch to. Our evaluation on a set of scientific benchmarks obtains a maximum speed up of 32 and 10 % on average compared to a baseline with hardware prefetching only. As a result, we also achieve a reduction of up to 18 and 3 % on average in energy-to-solution.Peer ReviewedPostprint (author's final draft
Performance Debugging and Tuning using an Instruction-Set Simulator
Instruction-set simulators allow programmers a detailed level of insight into,
and control over, the execution of a program, including parallel programs and
operating systems. In principle, instruction set simulation can model any
target computer and gather any statistic. Furthermore, such simulators are
usually portable, independent of compiler tools, and deterministic-allowing
bugs to be recreated or measurements repeated. Though often viewed as being
too slow for use as a general programming tool, in the last several years
their performance has improved considerably.
We describe SIMICS, an instruction set simulator of SPARC-based
multiprocessors developed at SICS, in its rĂ´le as a general programming tool.
We discuss some of the benefits of using a tool such as SIMICS to support
various tasks in software engineering, including debugging, testing, analysis,
and performance tuning. We present in some detail two test cases, where we've
used SimICS to support analysis and performance tuning of two applications,
Penny and EQNTOTT. This work resulted in improved parallelism in, and
understanding of, Penny, as well as a performance improvement for EQNTOTT of
over a magnitude. We also present some early work on analyzing SPARC/Linux,
demonstrating the ability of tools like SimICS to analyze operating systems
Cycle Accurate Energy and Throughput Estimation for Data Cache
Resource optimization in energy constrained real-time adaptive embedded systems highly depends on accurate energy and throughput estimates of processor peripherals. Such applications require lightweight, accurate mathematical models to profile energy and timing requirements on the go. This paper presents enhanced mathematical models for data cache energy and throughput estimation. The energy and throughput models were found to be within 95% accuracy of per instruction energy model of a processor, and a full system simulator?s timing model respectively. Furthermore, the possible application of these models in various scenarios is discussed in this paper
Estimating the Potential Speedup of Computer Vision Applications on Embedded Multiprocessors
Computer vision applications constitute one of the key drivers for embedded
multicore architectures. Although the number of available cores is increasing
in new architectures, designing an application to maximize the utilization of
the platform is still a challenge. In this sense, parallel performance
prediction tools can aid developers in understanding the characteristics of an
application and finding the most adequate parallelization strategy. In this
work, we present a method for early parallel performance estimation on embedded
multiprocessors from sequential application traces. We describe its
implementation in Parana, a fast trace-driven simulator targeting OpenMP
applications on the STMicroelectronics' STxP70 Application-Specific
Multiprocessor (ASMP). Results for the FAST key point detector application show
an error margin of less than 10% compared to the reference cycle-approximate
simulator, with lower modeling effort and up to 20x faster execution time.Comment: Presented at DATE Friday Workshop on Heterogeneous Architectures and
Design Methods for Embedded Image Systems (HIS 2015) (arXiv:1502.07241
Formal Modelling, Testing and Verification of HSA Memory Models using Event-B
The HSA Foundation has produced the HSA Platform System Architecture
Specification that goes a long way towards addressing the need for a clear and
consistent method for specifying weakly consistent memory. HSA is specified in
a natural language which makes it open to multiple ambiguous interpretations
and could render bugs in implementations of it in hardware and software. In
this paper we present a formal model of HSA which can be used in the
development and verification of both concurrent software applications as well
as in the development and verification of the HSA-compliant platform itself. We
use the Event-B language to build a provably correct hierarchy of models from
the most abstract to a detailed refinement of HSA close to implementation
level. Our memory models are general in that they represent an arbitrary number
of masters, programs and instruction interleavings. We reason about such
general models using refinements. Using Rodin tool we are able to model and
verify an entire hierarchy of models using proofs to establish that each
refinement is correct. We define an automated validation method that allows us
to test baseline compliance of the model against a suite of published HSA
litmus tests. Once we complete model validation we develop a coverage driven
method to extract a richer set of tests from the Event-B model and a user
specified coverage model. These tests are used for extensive regression testing
of hardware and software systems. Our method of refinement based formal
modelling, baseline compliance testing of the model and coverage driven test
extraction using the single language of Event-B is a new way to address a key
challenge facing the design and verification of multi-core systems.Comment: 9 pages, 10 figure
Principles for problem aggregation and assignment in medium scale multiprocessors
One of the most important issues in parallel processing is the mapping of workload to processors. This paper considers a large class of problems having a high degree of potential fine grained parallelism, and execution requirements that are either not predictable, or are too costly to predict. The main issues in mapping such a problem onto medium scale multiprocessors are those of aggregation and assignment. We study a method of parameterized aggregation that makes few assumptions about the workload. The mapping of aggregate units of work onto processors is uniform, and exploits locality of workload intensity to balance the unknown workload. In general, a finer aggregate granularity leads to a better balance at the price of increased communication/synchronization costs; the aggregation parameters can be adjusted to find a reasonable granularity. The effectiveness of this scheme is demonstrated on three model problems: an adaptive one-dimensional fluid dynamics problem with message passing, a sparse triangular linear system solver on both a shared memory and a message-passing machine, and a two-dimensional time-driven battlefield simulation employing message passing. Using the model problems, the tradeoffs are studied between balanced workload and the communication/synchronization costs. Finally, an analytical model is used to explain why the method balances workload and minimizes the variance in system behavior
Parallelization of cycle-based logic simulation
Verification of digital circuits by Cycle-based simulation can be performed in parallel. The parallel implementation requires two phases: the compilation phase, that sets up the data needed for the
execution of the simulation, and the simulation phase, that consists in executing the parallel simulation of the considered circuit for a certain number of cycles. During the early phase of design, compilation phase has to be repeated each time a bug is found. Thus, if the time of the compilation phase is too high, the advantages stemming from the parallel approach may be lost. In this work we propose an
effective version of the compilation phase and compute the corresponding execution time. We also analyze the percentage of execution time required by the different steps of the compilation phase for
a set of literature benchmarks. Further, we implemented the simulation phase exploiting the GPU architecture, and we computed the execution times for a set of benchmarks obtaining values comparable
with literature ones. Finally, we implemented the sequential version of the Cycle-based simulation in such a way that the execution time is optimized. We used the sequential values to compute the speedup
of the parallel version for the considered set of benchmarks
- …