2,621 research outputs found

    Performance Debugging and Tuning using an Instruction-Set Simulator

    Get PDF
    Instruction-set simulators allow programmers a detailed level of insight into, and control over, the execution of a program, including parallel programs and operating systems. In principle, instruction set simulation can model any target computer and gather any statistic. Furthermore, such simulators are usually portable, independent of compiler tools, and deterministic-allowing bugs to be recreated or measurements repeated. Though often viewed as being too slow for use as a general programming tool, in the last several years their performance has improved considerably. We describe SIMICS, an instruction set simulator of SPARC-based multiprocessors developed at SICS, in its rôle as a general programming tool. We discuss some of the benefits of using a tool such as SIMICS to support various tasks in software engineering, including debugging, testing, analysis, and performance tuning. We present in some detail two test cases, where we've used SimICS to support analysis and performance tuning of two applications, Penny and EQNTOTT. This work resulted in improved parallelism in, and understanding of, Penny, as well as a performance improvement for EQNTOTT of over a magnitude. We also present some early work on analyzing SPARC/Linux, demonstrating the ability of tools like SimICS to analyze operating systems

    Assessing load-sharing within optimistic simulation platforms

    Get PDF
    The advent of multi-core machines has lead to the need for revising the architecture of modern simulation platforms. One recent proposal we made attempted to explore the viability of load-sharing for optimistic simulators run on top of these types of machines. In this article, we provide an extensive experimental study for an assessment of the effects on run-time dynamics by a load-sharing architecture that has been implemented within the ROOT-Sim package, namely an open source simulation platform adhering to the optimistic synchronization paradigm. This experimental study is essentially aimed at evaluating possible sources of overheads when supporting load-sharing. It has been based on differentiated workloads allowing us to generate different execution profiles in terms of, e.g., granularity/locality of the simulation events. © 2012 IEEE

    Simulation models of shared-memory multiprocessor systems

    Get PDF

    Experimental research of a shared memory subsystem with limited queue length for specialized reconfigurable multiprocessor systems

    Get PDF
    Recently, reconfigurable systems based on field programmable logic devices (FPLDs) have been widely used in high-performance computing. The paper discusses issues related to the experimental research of a shared memory subsystem with a limited queue length of specialized reconfigurable multiprocessor systems using the developed mathematical modelling method. The paper presents the results of the method proposed by the authors for modelling multiprocessor systems based on open queuing networks with limited queue lengths. Based on these conditions, as well as the architectural features of the investigated processor-memory subsystem, expressions are calculated to estimate the exchange time and the resulting delays at each exchange stage. During the research, the main attention was paid to the dependence of the increase in the number of processor nodes in the processor-memory subsystem. As a result, the data obtained showed that the processor growth significantly affects the exchange time, creating a significant load on the common bus, as well as increasing delays at the stages when request transfer operation from the processor to the memory is performed. At the same time, the inadequate behaviour of experimental results and inaccuracy of their values when using the basic modelling method are explicitly tracked, which is reflected in the obtained graphs. Computational experiments were carried out to calculate the probabilistic-temporal characteristics of the "processor-memory" subsystem using the developed mathematical modelling methods. Based on the experimental results, it was determined that the delays occurring in subsystem's nodes and the time of exchange between the processor and memory modules depend on the query parameters and the processor-memory subsystem’s architectural characteristics

    Modelling Heterogeneous DSP–FPGA Based System Partitioning with Extensions to the Spinach Simulation Environment

    Get PDF
    In this paper we present system-on-a-chip extensions to the Spinach simulation environment for rapidly prototyping heterogeneous DSP/FPGA based architectures, specifically in the embedded domain. This infrastructure has been successfully used to model systems varying from multiprocessor gigabit ethernet controllers to Texas Instruments C6x series DSP based systems with tightly coupled FPGA based coprocessors for computational offloading. As an illustrative example of this toolsets functionality, we investigate workload partitioning in heterogeneous DSP/FPGA based embedded environments. Specifically, we focus on computational offloading of matrix multiplication kernels across DSP/FPGA based embedded architectures

    Efficient parallel architecture for highly coupled real-time linear system applications

    Get PDF
    A systematic procedure is developed for exploiting the parallel constructs of computation in a highly coupled, linear system application. An overall top-down design approach is adopted. Differential equations governing the application under consideration are partitioned into subtasks on the basis of a data flow analysis. The interconnected task units constitute a task graph which has to be computed in every update interval. Multiprocessing concepts utilizing parallel integration algorithms are then applied for efficient task graph execution. A simple scheduling routine is developed to handle task allocation while in the multiprocessor mode. Results of simulation and scheduling are compared on the basis of standard performance indices. Processor timing diagrams are developed on the basis of program output accruing to an optimal set of processors. Basic architectural attributes for implementing the system are discussed together with suggestions for processing element design. Emphasis is placed on flexible architectures capable of accommodating widely varying application specifics
    • …
    corecore