428 research outputs found

    Reproducible simulation of multi-threaded workloads for architecture design exploration

    Full text link
    As multiprocessors become mainstream, techniques to ad-dress efficient simulation of multi-threaded workloads are needed. Multi-threaded simulation presents a new challenge: non-determinism across simulations for different architecture configurations. If the execution paths between two simulation runs of the same benchmark with the same input are too different, the simulation results cannot be used to compare the configurations. In this paper we focus on a simulation technique to efficiently collect simulation checkpoints for multi-threaded workloads, and to compare simulation runs addressing this non-determinism problem. We focus on user-level simulation of multi-threaded workloads for multiprocessor architectures. We present an approach, based on binary instrumentation, to collect checkpoints for simulation. Our checkpoints allow reproducible execution of the samples across different ar-chitecture configurations by controlling the sources of non-determinism during simulation. This results in stalls that would not naturally occur in execution. We propose techniques that allow us to accurately compare performance across architec-ture configurations in the presence of these stalls. I

    Holistic debugging - enabling instruction set simulation for software quality assurance

    Get PDF
    We present holistic debugging, a novel method for observing execution of complex and distributed software. It builds on an instruction set simulator, which provides reproducible experiments and non-intrusive probing of state in a distributed system. Instruction set simulators, however, only provide low-level information, so a holistic debugger contains a translation framework that maps this information to higher abstraction level observation tools, such as source code debuggers. We have created Nornir, a proof-of-concept holistic debugger, built on the simulator Simics. For each observed process in the simulated system, Nornir creates an abstraction translation stack, with virtual machine translators that map machine-level storage contents (e.g. physical memory, registers) provided by Simics, to application-level data (e.g. virtual memory contents) by parsing the data structures of operating systems and virtual machines. Nornir includes a modified version of the GNU debugger (GDB), which supports non-intrusive symbolic debugging of distributed applications. Nornir's main interface is a debugger shepherd, a programmable interface that controls multiple debuggers, and allows users to coherently inspect the entire state of heterogeneous, distributed applications. It provides a robust observation platform for construction of new observation tools

    RPPM : Rapid Performance Prediction of Multithreaded workloads on multicore processors

    Get PDF
    Analytical performance modeling is a useful complement to detailed cycle-level simulation to quickly explore the design space in an early design stage. Mechanistic analytical modeling is particularly interesting as it provides deep insight and does not require expensive offline profiling as empirical modeling. Previous work in mechanistic analytical modeling, unfortunately, is limited to single-threaded applications running on single-core processors. This work proposes RPPM, a mechanistic analytical performance model for multi-threaded applications on multicore hardware. RPPM collects microarchitecture-independent characteristics of a multi-threaded workload to predict performance on a previously unseen multicore architecture. The profile needs to be collected only once to predict a range of processor architectures. We evaluate RPPM's accuracy against simulation and report a performance prediction error of 11.2% on average (23% max). We demonstrate RPPM's usefulness for conducting design space exploration experiments as well as for analyzing parallel application performance

    VIoLET: A Large-scale Virtual Environment for Internet of Things

    Full text link
    IoT deployments have been growing manifold, encompassing sensors, networks, edge, fog and cloud resources. Despite the intense interest from researchers and practitioners, most do not have access to large-scale IoT testbeds for validation. Simulation environments that allow analytical modeling are a poor substitute for evaluating software platforms or application workloads in realistic computing environments. Here, we propose VIoLET, a virtual environment for defining and launching large-scale IoT deployments within cloud VMs. It offers a declarative model to specify container-based compute resources that match the performance of the native edge, fog and cloud devices using Docker. These can be inter-connected by complex topologies on which private/public networks, and bandwidth and latency rules are enforced. Users can configure synthetic sensors for data generation on these devices as well. We validate VIoLET for deployments with > 400 devices and > 1500 device-cores, and show that the virtual IoT environment closely matches the expected compute and network performance at modest costs. This fills an important gap between IoT simulators and real deployments.Comment: To appear in the Proceedings of the 24TH International European Conference On Parallel and Distributed Computing (EURO-PAR), August 27-31, 2018, Turin, Italy, europar2018.org. Selected as a Distinguished Paper for presentation at the Plenary Session of the conferenc

    Pac-Sim: Simulation of Multi-threaded Workloads using Intelligent, Live Sampling

    Full text link
    High-performance, multi-core processors are the key to accelerating workloads in several application domains. To continue to scale performance at the limit of Moore's Law and Dennard scaling, software and hardware designers have turned to dynamic solutions that adapt to the needs of applications in a transparent, automatic way. For example, modern hardware improves its performance and power efficiency by changing the hardware configuration, like the frequency and voltage of cores, according to a number of parameters such as the technology used, the workload running, etc. With this level of dynamism, it is essential to simulate next-generation multi-core processors in a way that can both respond to system changes and accurately determine system performance metrics. Currently, no sampled simulation platform can achieve these goals of dynamic, fast, and accurate simulation of multi-threaded workloads. In this work, we propose a solution that allows for fast, accurate simulation in the presence of both hardware and software dynamism. To accomplish this goal, we present Pac-Sim, a novel sampled simulation methodology for fast, accurate sampled simulation that requires no upfront analysis of the workload. With our proposed methodology, it is now possible to simulate long-running dynamically scheduled multi-threaded programs with significant simulation speedups even in the presence of dynamic hardware events. We evaluate Pac-Sim using the multi-threaded SPEC CPU2017, NPB, and PARSEC benchmarks with both static and dynamic thread scheduling. The experimental results show that Pac-Sim achieves a very low sampling error of 1.63% and 3.81% on average for statically and dynamically scheduled benchmarks, respectively. Pac-Sim also demonstrates significant simulation speedups as high as 523.5Ă—\times (210.3Ă—\times on average) for the train input set of SPEC CPU2017.Comment: 14 pages, 14 figure

    A Subset of the CERN Virtual Machine File System: Fast Delivering of Complex Software Stacks for Supercomputing Resources

    Full text link
    Delivering a reproducible environment along with complex and up-to-date software stacks on thousands of distributed and heterogeneous worker nodes is a critical task. The CernVM-File System (CVMFS) has been designed to help various communities to deploy software on worldwide distributed computing infrastructures by decoupling the software from the Operating System. However, the installation of this file system depends on a collaboration with system administrators of the remote resources and an HTTP connectivity to fetch dependencies from external sources. Supercomputers, which offer tremendous computing power, generally have more restrictive policies than grid sites and do not easily provide the mandatory conditions to exploit CVMFS. Different solutions have been developed to tackle the issue, but they are often specific to a scientific community and do not deal with the problem in its globality. In this paper, we provide a generic utility to assist any community in the installation of complex software dependencies on supercomputers with no external connectivity. The approach consists in capturing dependencies of applications of interests, building a subset of dependencies, testing it in a given environment, and deploying it to a remote computing resource. We experiment this proposal with a real use case by exporting Gauss-a Monte-Carlo simulation program from the LHCb experiment-on Mare Nostrum, one of the top supercomputers of the world. We provide steps to encapsulate the minimum required files and deliver a light and easy-to-update subset of CVMFS: 12.4 Gigabytes instead of 5.2 Terabytes for the whole LHCb repository

    Accelerated Modeling of Near and Far-Field Diffraction for Coronagraphic Optical Systems

    Full text link
    Accurately predicting the performance of coronagraphs and tolerancing optical surfaces for high-contrast imaging requires a detailed accounting of diffraction effects. Unlike simple Fraunhofer diffraction modeling, near and far-field diffraction effects, such as the Talbot effect, are captured by plane-to-plane propagation using Fresnel and angular spectrum propagation. This approach requires a sequence of computationally intensive Fourier transforms and quadratic phase functions, which limit the design and aberration sensitivity parameter space which can be explored at high-fidelity in the course of coronagraph design. This study presents the results of optimizing the multi-surface propagation module of the open source Physical Optics Propagation in PYthon (POPPY) package. This optimization was performed by implementing and benchmarking Fourier transforms and array operations on graphics processing units, as well as optimizing multithreaded numerical calculations using the NumExpr python library where appropriate, to speed the end-to-end simulation of observatory and coronagraph optical systems. Using realistic systems, this study demonstrates a greater than five-fold decrease in wall-clock runtime over POPPY's previous implementation and describes opportunities for further improvements in diffraction modeling performance.Comment: Presented at SPIE ASTI 2018, Austin Texas. 11 pages, 6 figure

    Doctor of Philosophy

    Get PDF
    dissertationWith the explosion of chip transistor counts, the semiconductor industry has struggled with ways to continue scaling computing performance in line with historical trends. In recent years, the de facto solution to utilize excess transistors has been to increase the size of the on-chip data cache, allowing fast access to an increased portion of main memory. These large caches allowed the continued scaling of single thread performance, which had not yet reached the limit of instruction level parallelism (ILP). As we approach the potential limits of parallelism within a single threaded application, new approaches such as chip multiprocessors (CMP) have become popular for scaling performance utilizing thread level parallelism (TLP). This dissertation identifies the operating system as a ubiquitous area where single threaded performance and multithreaded performance have often been ignored by computer architects. We propose that novel hardware and OS co-design has the potential to significantly improve current chip multiprocessor designs, enabling increased performance and improved power efficiency. We show that the operating system contributes a nontrivial overhead to even the most computationally intense workloads and that this OS contribution grows to a significant fraction of total instructions when executing several common applications found in the datacenter. We demonstrate that architectural improvements have had little to no effect on the performance of the OS over the last 15 years, leaving ample room for improvements. We specifically consider three potential solutions to improve OS execution on modern processors. First, we consider the potential of a separate operating system processor (OSP) operating concurrently with general purpose processors (GPP) in a chip multiprocessor organization, with several specialized structures acting as efficient conduits between these processors. Second, we consider the potential of segregating existing caching structures to decrease cache interference between the OS and application. Third, we propose that there are components within the OS itself that should be refactored to be both multithreaded and cache topology aware, which in turn, improves the performance and scalability of many-threaded applications
    • …
    corecore