1,601 research outputs found

    Coz: Finding Code that Counts with Causal Profiling

    Full text link
    Improving performance is a central concern for software developers. To locate optimization opportunities, developers rely on software profilers. However, these profilers only report where programs spent their time: optimizing that code may have no impact on performance. Past profilers thus both waste developer time and make it difficult for them to uncover significant optimization opportunities. This paper introduces causal profiling. Unlike past profiling approaches, causal profiling indicates exactly where programmers should focus their optimization efforts, and quantifies their potential impact. Causal profiling works by running performance experiments during program execution. Each experiment calculates the impact of any potential optimization by virtually speeding up code: inserting pauses that slow down all other code running concurrently. The key insight is that this slowdown has the same relative effect as running that line faster, thus "virtually" speeding it up. We present Coz, a causal profiler, which we evaluate on a range of highly-tuned applications: Memcached, SQLite, and the PARSEC benchmark suite. Coz identifies previously unknown optimization opportunities that are both significant and targeted. Guided by Coz, we improve the performance of Memcached by 9%, SQLite by 25%, and accelerate six PARSEC applications by as much as 68%; in most cases, these optimizations involve modifying under 10 lines of code.Comment: Published at SOSP 2015 (Best Paper Award

    Parallel Construction of Wavelet Trees on Multicore Architectures

    Get PDF
    The wavelet tree has become a very useful data structure to efficiently represent and query large volumes of data in many different domains, from bioinformatics to geographic information systems. One problem with wavelet trees is their construction time. In this paper, we introduce two algorithms that reduce the time complexity of a wavelet tree's construction by taking advantage of nowadays ubiquitous multicore machines. Our first algorithm constructs all the levels of the wavelet in parallel in O(n)O(n) time and O(nlgσ+σlgn)O(n\lg\sigma + \sigma\lg n) bits of working space, where nn is the size of the input sequence and σ\sigma is the size of the alphabet. Our second algorithm constructs the wavelet tree in a domain-decomposition fashion, using our first algorithm in each segment, reaching O(lgn)O(\lg n) time and O(nlgσ+pσlgn/lgσ)O(n\lg\sigma + p\sigma\lg n/\lg\sigma) bits of extra space, where pp is the number of available cores. Both algorithms are practical and report good speedup for large real datasets.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

    Building real-time embedded applications on QduinoMC: a web-connected 3D printer case study

    Full text link
    Single Board Computers (SBCs) are now emerging with multiple cores, ADCs, GPIOs, PWM channels, integrated graphics, and several serial bus interfaces. The low power consumption, small form factor and I/O interface capabilities of SBCs with sensors and actuators makes them ideal in embedded and real-time applications. However, most SBCs run non-realtime operating systems based on Linux and Windows, and do not provide a user-friendly API for application development. This paper presents QduinoMC, a multicore extension to the popular Arduino programming environment, which runs on the Quest real-time operating system. QduinoMC is an extension of our earlier single-core, real-time, multithreaded Qduino API. We show the utility of QduinoMC by applying it to a specific application: a web-connected 3D printer. This differs from existing 3D printers, which run relatively simple firmware and lack operating system support to spool multiple jobs, or interoperate with other devices (e.g., in a print farm). We show how QduinoMC empowers devices with the capabilities to run new services without impacting their timing guarantees. While it is possible to modify existing operating systems to provide suitable timing guarantees, the effort to do so is cumbersome and does not provide the ease of programming afforded by QduinoMC.http://www.cs.bu.edu/fac/richwest/papers/rtas_2017.pdfAccepted manuscrip

    Thread-spawning schemes for speculative multithreading

    Get PDF
    Speculative multithreading has been recently proposed to boost performance by means of exploiting thread-level parallelism in applications difficult to parallelize. The performance of these processors heavily depends on the partitioning policy used to split the program into threads. Previous work uses heuristics to spawn speculative threads based on easily-detectable program constructs such as loops or subroutines. In this work we propose a profile-based mechanism to divide programs into threads by searching for those parts of the code that have certain features that could benefit from potential thread-level parallelism. Our profile-based spawning scheme is evaluated on a Clustered Speculative Multithreaded Processor and results show large performance benefits. When the proposed spawning scheme is compared with traditional heuristics, we outperform them by almost 20%. When a realistic value predictor and a 8-cycle thread initialization penalty is considered, the performance difference between them is maintained. The speed-up over a single thread execution is higher than 5x for a 16-thread-unit processor and close to 2x for a 4-thread-unit processor.Peer ReviewedPostprint (published version

    GeantV: Results from the prototype of concurrent vector particle transport simulation in HEP

    Full text link
    Full detector simulation was among the largest CPU consumer in all CERN experiment software stacks for the first two runs of the Large Hadron Collider (LHC). In the early 2010's, the projections were that simulation demands would scale linearly with luminosity increase, compensated only partially by an increase of computing resources. The extension of fast simulation approaches to more use cases, covering a larger fraction of the simulation budget, is only part of the solution due to intrinsic precision limitations. The remainder corresponds to speeding-up the simulation software by several factors, which is out of reach using simple optimizations on the current code base. In this context, the GeantV R&D project was launched, aiming to redesign the legacy particle transport codes in order to make them benefit from fine-grained parallelism features such as vectorization, but also from increased code and data locality. This paper presents extensively the results and achievements of this R&D, as well as the conclusions and lessons learnt from the beta prototype.Comment: 34 pages, 26 figures, 24 table

    Distributed data cache designs for clustered VLIW processors

    Get PDF
    Wire delays are a major concern for current and forthcoming processors. One approach to deal with this problem is to divide the processor into semi-independent units referred to as clusters. A cluster usually consists of a local register file and a subset of the functional units, while the L1 data cache typically remains centralized in What we call partially distributed architectures. However, as technology evolves, the relative latency of such a centralized cache will increase, leading to an important impact on performance. In this paper, we propose partitioning the L1 data cache among clusters for clustered VLIW processors. We refer to this kind of design as fully distributed processors. In particular; we propose and evaluate three different configurations: a snoop-based cache coherence scheme, a word-interleaved cache, and flexible LO-buffers managed by the compiler. For each alternative, instruction scheduling techniques targeted to cyclic code are developed. Results for the Mediabench suite'show that the performance of such fully distributed architectures is always better than the performance of a partially distributed one with the same amount of resources. In addition, the key aspects of each fully distributed configuration are explored.Peer ReviewedPostprint (published version
    corecore