35,259 research outputs found

    Coz: Finding Code that Counts with Causal Profiling

    Full text link
    Improving performance is a central concern for software developers. To locate optimization opportunities, developers rely on software profilers. However, these profilers only report where programs spent their time: optimizing that code may have no impact on performance. Past profilers thus both waste developer time and make it difficult for them to uncover significant optimization opportunities. This paper introduces causal profiling. Unlike past profiling approaches, causal profiling indicates exactly where programmers should focus their optimization efforts, and quantifies their potential impact. Causal profiling works by running performance experiments during program execution. Each experiment calculates the impact of any potential optimization by virtually speeding up code: inserting pauses that slow down all other code running concurrently. The key insight is that this slowdown has the same relative effect as running that line faster, thus "virtually" speeding it up. We present Coz, a causal profiler, which we evaluate on a range of highly-tuned applications: Memcached, SQLite, and the PARSEC benchmark suite. Coz identifies previously unknown optimization opportunities that are both significant and targeted. Guided by Coz, we improve the performance of Memcached by 9%, SQLite by 25%, and accelerate six PARSEC applications by as much as 68%; in most cases, these optimizations involve modifying under 10 lines of code.Comment: Published at SOSP 2015 (Best Paper Award

    Analysis of a benchmark suite to evaluate mixed numeric and symbolic processing

    Get PDF
    The suite of programs that formed the benchmark for a proposed advanced computer is described and analyzed. The features of the processor and its operating system that are tested by the benchmark are discussed. The computer codes and the supporting data for the analysis are given as appendices

    Measurement, modeling, and adjustment of the 10.4-m-diameter Leighton telescopes

    Get PDF
    The design of the Leighton telescopes and the unique techniques used in their fabrication make these telescopes particularly amenable to precise modeling and measurement of their performance. The surface is essentially a continuous membrane supported at 99 uniformly distributed nodes by a pin joint triangular grid space frame. This structure can be accurately modeled and the surface can be adjusted using low- resolution maps. Holographic measurements of the surface figure of these telescopes at the Caltech Submillimeter Observatory (CSO) and the Owens Valley Radio Observatory (OVRO) have been made over several epochs with a repeatability of 5 - 10 micrometer over the zenith angle range from 15 to 75 degrees. The measurements are consistent with the calculated gravitational distortions. Several different surface setting strategies are evaluated and the 'second order deviation from homology,' Hd, is introduced as a measure of the gravitational degradation that can be expected for an optimally adjusted surface. Hd is defined as half of the RMS difference between the deviations from homology for the telescope pointed at the extremes of its intended sky coverage range. This parameter can be used to compare the expected performance of many different types of telescopes, including off-axis reflectors and slant-axis or polar mounts as well as standard alt-az designs. Subtle asymmetries in a telescope's structure are shown to dramatically affect its performance. The RMS surface error of the Leighton telescope is improved by more than a factor of two when optimized over the positive zenith angle quadrant compared to optimization over the negative quadrant. A global surface optimization algorithm is developed to take advantage of the long term stability and understanding of the Leighton telescopes. It significantly improves the operational performance of the telescope over that obtained using a simple 'rigging angle' adjustment. The surface errors for the CSO are now less than 22 micrometer RMS over most of the zenith angle range and the aperture efficiency at 810 GHz exceeds 33%. This illustrates the usefulness of the global surface optimization procedure

    Stellar Intensity Interferometry: Astrophysical targets for sub-milliarcsecond imaging

    Full text link
    Intensity interferometry permits very long optical baselines and the observation of sub-milliarcsecond structures. Using planned kilometric arrays of air Cherenkov telescopes at short wavelengths, intensity interferometry may increase the spatial resolution achieved in optical astronomy by an order of magnitude, inviting detailed studies of the shapes of rapidly rotating hot stars with structures in their circumstellar disks and winds, or mapping out patterns of nonradial pulsations across stellar surfaces. Signal-to-noise in intensity interferometry favors high-temperature sources and emission-line structures, and is independent of the optical passband, be it a single spectral line or the broad spectral continuum. Prime candidate sources have been identified among classes of bright and hot stars. Observations are simulated for telescope configurations envisioned for large Cherenkov facilities, synthesizing numerous optical baselines in software, confirming that resolutions of tens of microarcseconds are feasible for numerous astrophysical targets.Comment: 12 pages, 4 figures; presented at the SPIE conference "Optical and Infrared Interferometry II", San Diego, CA, USA (June 2010

    Cache Equalizer: A Cache Pressure Aware Block Placement Scheme for Large-Scale Chip Multiprocessors

    Get PDF
    This paper describes Cache Equalizer (CE), a novel distributed cache management scheme for large scale chip multiprocessors (CMPs). Our work is motivated by large asymmetry in cache sets usages. CE decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Temporal pressure at the on-chip last-level cache, is continuously collected at a group (comprised of cache sets) granularity, and periodically recorded at the memory controller to guide the placement process. An incoming block is consequently placed at a cache group that exhibits the minimum pressure. CE provides Quality of Service (QoS) by robustly offering better performance than the baseline shared NUCA cache. Simulation results using a full-system simulator demonstrate that CE outperforms shared NUCA caches by an average of 15.5% and by as much as 28.5% for the benchmark programs we examined. Furthermore, evaluations manifested the outperformance of CE versus related CMP cache designs

    Stellar intensity interferometry: Optimizing air Cherenkov telescope array layouts

    Full text link
    Kilometric-scale optical imagers seem feasible to realize by intensity interferometry, using telescopes primarily erected for measuring Cherenkov light induced by gamma rays. Planned arrays envision 50--100 telescopes, distributed over some 1--4 km2^2. Although array layouts and telescope sizes will primarily be chosen for gamma-ray observations, also their interferometric performance may be optimized. Observations of stellar objects were numerically simulated for different array geometries, yielding signal-to-noise ratios for different Fourier components of the source images in the interferometric (u,v)(u,v)-plane. Simulations were made for layouts actually proposed for future Cherenkov telescope arrays, and for subsets with only a fraction of the telescopes. All large arrays provide dense sampling of the (u,v)(u,v)-plane due to the sheer number of telescopes, irrespective of their geographic orientation or stellar coordinates. However, for improved coverage of the (u,v)(u,v)-plane and a wider variety of baselines (enabling better image reconstruction), an exact east-west grid should be avoided for the numerous smaller telescopes, and repetitive geometric patterns avoided for the few large ones. Sparse arrays become severely limited by a lack of short baselines, and to cover astrophysically relevant dimensions between 0.1--3 milliarcseconds in visible wavelengths, baselines between pairs of telescopes should cover the whole interval 30--2000 m.Comment: 12 pages, 10 figures; presented at the SPIE conference "Optical and Infrared Interferometry II", San Diego, CA, USA (June 2010

    Beyond Reuse Distance Analysis: Dynamic Analysis for Characterization of Data Locality Potential

    Get PDF
    Emerging computer architectures will feature drastically decreased flops/byte (ratio of peak processing rate to memory bandwidth) as highlighted by recent studies on Exascale architectural trends. Further, flops are getting cheaper while the energy cost of data movement is increasingly dominant. The understanding and characterization of data locality properties of computations is critical in order to guide efforts to enhance data locality. Reuse distance analysis of memory address traces is a valuable tool to perform data locality characterization of programs. A single reuse distance analysis can be used to estimate the number of cache misses in a fully associative LRU cache of any size, thereby providing estimates on the minimum bandwidth requirements at different levels of the memory hierarchy to avoid being bandwidth bound. However, such an analysis only holds for the particular execution order that produced the trace. It cannot estimate potential improvement in data locality through dependence preserving transformations that change the execution schedule of the operations in the computation. In this article, we develop a novel dynamic analysis approach to characterize the inherent locality properties of a computation and thereby assess the potential for data locality enhancement via dependence preserving transformations. The execution trace of a code is analyzed to extract a computational directed acyclic graph (CDAG) of the data dependences. The CDAG is then partitioned into convex subsets, and the convex partitioning is used to reorder the operations in the execution trace to enhance data locality. The approach enables us to go beyond reuse distance analysis of a single specific order of execution of the operations of a computation in characterization of its data locality properties. It can serve a valuable role in identifying promising code regions for manual transformation, as well as assessing the effectiveness of compiler transformations for data locality enhancement. We demonstrate the effectiveness of the approach using a number of benchmarks, including case studies where the potential shown by the analysis is exploited to achieve lower data movement costs and better performance.Comment: Transaction on Architecture and Code Optimization (2014
    • …
    corecore