35,259 research outputs found
Coz: Finding Code that Counts with Causal Profiling
Improving performance is a central concern for software developers. To locate
optimization opportunities, developers rely on software profilers. However,
these profilers only report where programs spent their time: optimizing that
code may have no impact on performance. Past profilers thus both waste
developer time and make it difficult for them to uncover significant
optimization opportunities.
This paper introduces causal profiling. Unlike past profiling approaches,
causal profiling indicates exactly where programmers should focus their
optimization efforts, and quantifies their potential impact. Causal profiling
works by running performance experiments during program execution. Each
experiment calculates the impact of any potential optimization by virtually
speeding up code: inserting pauses that slow down all other code running
concurrently. The key insight is that this slowdown has the same relative
effect as running that line faster, thus "virtually" speeding it up.
We present Coz, a causal profiler, which we evaluate on a range of
highly-tuned applications: Memcached, SQLite, and the PARSEC benchmark suite.
Coz identifies previously unknown optimization opportunities that are both
significant and targeted. Guided by Coz, we improve the performance of
Memcached by 9%, SQLite by 25%, and accelerate six PARSEC applications by as
much as 68%; in most cases, these optimizations involve modifying under 10
lines of code.Comment: Published at SOSP 2015 (Best Paper Award
Analysis of a benchmark suite to evaluate mixed numeric and symbolic processing
The suite of programs that formed the benchmark for a proposed advanced computer is described and analyzed. The features of the processor and its operating system that are tested by the benchmark are discussed. The computer codes and the supporting data for the analysis are given as appendices
Measurement, modeling, and adjustment of the 10.4-m-diameter Leighton telescopes
The design of the Leighton telescopes and the unique techniques used in their fabrication make these telescopes particularly amenable to precise modeling and measurement of their performance. The surface is essentially a continuous membrane supported at 99 uniformly distributed nodes by a pin joint triangular grid space frame. This structure can be accurately modeled and the surface can be adjusted using low- resolution maps. Holographic measurements of the surface figure of these telescopes at the Caltech Submillimeter Observatory (CSO) and the Owens Valley Radio Observatory (OVRO) have been made over several epochs with a repeatability of 5 - 10 micrometer over the zenith angle range from 15 to 75 degrees. The measurements are consistent with the calculated gravitational distortions. Several different surface setting strategies are evaluated and the 'second order deviation from homology,' Hd, is introduced as a measure of the gravitational degradation that can be expected for an optimally adjusted surface. Hd is defined as half of the RMS difference between the deviations from homology for the telescope pointed at the extremes of its intended sky coverage range. This parameter can be used to compare the expected performance of many different types of telescopes, including off-axis reflectors and slant-axis or polar mounts as well as standard alt-az designs. Subtle asymmetries in a telescope's structure are shown to dramatically affect its performance. The RMS surface error of the Leighton telescope is improved by more than a factor of two when optimized over the positive zenith angle quadrant compared to optimization over the negative quadrant. A global surface optimization algorithm is developed to take advantage of the long term stability and understanding of the Leighton telescopes. It significantly improves the operational performance of the telescope over that obtained using a simple 'rigging angle' adjustment. The surface errors for the CSO are now less than 22 micrometer RMS over most of the zenith angle range and the aperture efficiency at 810 GHz exceeds 33%. This illustrates the usefulness of the global surface optimization procedure
Stellar Intensity Interferometry: Astrophysical targets for sub-milliarcsecond imaging
Intensity interferometry permits very long optical baselines and the
observation of sub-milliarcsecond structures. Using planned kilometric arrays
of air Cherenkov telescopes at short wavelengths, intensity interferometry may
increase the spatial resolution achieved in optical astronomy by an order of
magnitude, inviting detailed studies of the shapes of rapidly rotating hot
stars with structures in their circumstellar disks and winds, or mapping out
patterns of nonradial pulsations across stellar surfaces. Signal-to-noise in
intensity interferometry favors high-temperature sources and emission-line
structures, and is independent of the optical passband, be it a single spectral
line or the broad spectral continuum. Prime candidate sources have been
identified among classes of bright and hot stars. Observations are simulated
for telescope configurations envisioned for large Cherenkov facilities,
synthesizing numerous optical baselines in software, confirming that
resolutions of tens of microarcseconds are feasible for numerous astrophysical
targets.Comment: 12 pages, 4 figures; presented at the SPIE conference "Optical and
Infrared Interferometry II", San Diego, CA, USA (June 2010
Cache Equalizer: A Cache Pressure Aware Block Placement Scheme for Large-Scale Chip Multiprocessors
This paper describes Cache Equalizer (CE), a novel distributed cache management scheme for large scale chip multiprocessors (CMPs). Our work is motivated by large asymmetry in cache sets usages. CE decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Temporal pressure at the on-chip last-level cache, is continuously collected at a group (comprised of cache sets) granularity, and periodically recorded at the memory controller to guide the placement process. An incoming block is consequently placed at a cache group that exhibits the minimum pressure. CE provides Quality of Service (QoS) by robustly offering better performance than the baseline shared NUCA cache. Simulation results using a full-system simulator demonstrate that CE outperforms shared NUCA caches by an average of 15.5% and by as much as 28.5% for the benchmark programs we examined. Furthermore, evaluations manifested the outperformance of CE versus related CMP cache designs
Stellar intensity interferometry: Optimizing air Cherenkov telescope array layouts
Kilometric-scale optical imagers seem feasible to realize by intensity
interferometry, using telescopes primarily erected for measuring Cherenkov
light induced by gamma rays. Planned arrays envision 50--100 telescopes,
distributed over some 1--4 km. Although array layouts and telescope sizes
will primarily be chosen for gamma-ray observations, also their interferometric
performance may be optimized. Observations of stellar objects were numerically
simulated for different array geometries, yielding signal-to-noise ratios for
different Fourier components of the source images in the interferometric
-plane. Simulations were made for layouts actually proposed for future
Cherenkov telescope arrays, and for subsets with only a fraction of the
telescopes. All large arrays provide dense sampling of the -plane due to
the sheer number of telescopes, irrespective of their geographic orientation or
stellar coordinates. However, for improved coverage of the -plane and a
wider variety of baselines (enabling better image reconstruction), an exact
east-west grid should be avoided for the numerous smaller telescopes, and
repetitive geometric patterns avoided for the few large ones. Sparse arrays
become severely limited by a lack of short baselines, and to cover
astrophysically relevant dimensions between 0.1--3 milliarcseconds in visible
wavelengths, baselines between pairs of telescopes should cover the whole
interval 30--2000 m.Comment: 12 pages, 10 figures; presented at the SPIE conference "Optical and
Infrared Interferometry II", San Diego, CA, USA (June 2010
Beyond Reuse Distance Analysis: Dynamic Analysis for Characterization of Data Locality Potential
Emerging computer architectures will feature drastically decreased flops/byte
(ratio of peak processing rate to memory bandwidth) as highlighted by recent
studies on Exascale architectural trends. Further, flops are getting cheaper
while the energy cost of data movement is increasingly dominant. The
understanding and characterization of data locality properties of computations
is critical in order to guide efforts to enhance data locality. Reuse distance
analysis of memory address traces is a valuable tool to perform data locality
characterization of programs. A single reuse distance analysis can be used to
estimate the number of cache misses in a fully associative LRU cache of any
size, thereby providing estimates on the minimum bandwidth requirements at
different levels of the memory hierarchy to avoid being bandwidth bound.
However, such an analysis only holds for the particular execution order that
produced the trace. It cannot estimate potential improvement in data locality
through dependence preserving transformations that change the execution
schedule of the operations in the computation. In this article, we develop a
novel dynamic analysis approach to characterize the inherent locality
properties of a computation and thereby assess the potential for data locality
enhancement via dependence preserving transformations. The execution trace of a
code is analyzed to extract a computational directed acyclic graph (CDAG) of
the data dependences. The CDAG is then partitioned into convex subsets, and the
convex partitioning is used to reorder the operations in the execution trace to
enhance data locality. The approach enables us to go beyond reuse distance
analysis of a single specific order of execution of the operations of a
computation in characterization of its data locality properties. It can serve a
valuable role in identifying promising code regions for manual transformation,
as well as assessing the effectiveness of compiler transformations for data
locality enhancement. We demonstrate the effectiveness of the approach using a
number of benchmarks, including case studies where the potential shown by the
analysis is exploited to achieve lower data movement costs and better
performance.Comment: Transaction on Architecture and Code Optimization (2014
- …