102 research outputs found

    Physical Representation-based Predicate Optimization for a Visual Analytics Database

    Full text link
    Querying the content of images, video, and other non-textual data sources requires expensive content extraction methods. Modern extraction techniques are based on deep convolutional neural networks (CNNs) and can classify objects within images with astounding accuracy. Unfortunately, these methods are slow: processing a single image can take about 10 milliseconds on modern GPU-based hardware. As massive video libraries become ubiquitous, running a content-based query over millions of video frames is prohibitive. One promising approach to reduce the runtime cost of queries of visual content is to use a hierarchical model, such as a cascade, where simple cases are handled by an inexpensive classifier. Prior work has sought to design cascades that optimize the computational cost of inference by, for example, using smaller CNNs. However, we observe that there are critical factors besides the inference time that dramatically impact the overall query time. Notably, by treating the physical representation of the input image as part of our query optimization---that is, by including image transforms, such as resolution scaling or color-depth reduction, within the cascade---we can optimize data handling costs and enable drastically more efficient classifier cascades. In this paper, we propose Tahoma, which generates and evaluates many potential classifier cascades that jointly optimize the CNN architecture and input data representation. Our experiments on a subset of ImageNet show that Tahoma's input transformations speed up cascades by up to 35 times. We also find up to a 98x speedup over the ResNet50 classifier with no loss in accuracy, and a 280x speedup if some accuracy is sacrificed.Comment: Camera-ready version of the paper submitted to ICDE 2019, In Proceedings of the 35th IEEE International Conference on Data Engineering (ICDE 2019

    Computational Sprinting

    Get PDF
    Although transistor density continues to increase, voltage scaling has stalled and thus power density is increasing each technology generation. Particularly in mobile devices, which have limited cooling options, these trends lead to a utilization wall in which sustained chip performance is limited primarily by power rather than area. However, many mobile applications do not demand sustained performance; rather they comprise short bursts of computation in response to sporadic user activity. To improve responsiveness for such applications, this paper explores activating otherwise powered-down cores for sub-second bursts of intense parallel computation. The approach exploits the concept of computational sprinting, in which a chip temporarily exceeds its sustainable thermal power budget to provide instantaneous throughput, after which the chip must return to nominal operation to cool down. To demonstrate the feasibility of this approach, we analyze the thermal and electrical characteristics of a smart-phone-like system that nominally operates a single core (~1W peak), but can sprint with up to 16 cores for hundreds of milliseconds. We describe a thermal design that incorporates phase-change materials to provide thermal capacitance to enable such sprints. We analyze image recognition kernels to show that parallel sprinting has the potential to achieve the task response time of a 16W chip within the thermal constraints of a 1W mobile platform

    TurboSMARTS: Accurate microarchitecture simulation sampling in minutes

    Get PDF
    Recent research proposes accelerating processor microarchitecture simulation through statistical sampling. Prior simulation sampling approaches construct accurate model state for each measurement by continuously warming large microarchitectural structures (e.g., caches and the branch predictor) while emulating the billions of instructions between measurements. This approach, called functional warming, occupies hours of runtime while the detailed simulation that is measured requires mere minutes. To eliminate the functional warming bottleneck, we propose TurboSMARTS, a simulation framework that stores functionally-warmed state in a library of small, reusable checkpoints. TurboSMARTS enables the creation of the thousands of checkpoints necessary for accurate sampling by storing only the subset of warmed state accessed during simulation of each brief execution window. TurboSMARTS matches the accuracy of prior simulation sampling techniques (i.e., ±3% error with 99.7% confidence), while estimating the performance of an 8-way out-of-order superscalar processor running SPEC CPU2000 in 91 seconds per benchmark, on average, using a 12 GB checkpoint library

    An Evaluation of Stratified Sampling of Microarchitecture Simulations

    Get PDF
    Recent research advocates applying sampling to accelerate microarchitecture simulation. Simple random sampling offers accurate performance estimates (with a high quantifiable confidence) by taking a large number (e.g., 10,000) of short performance measurements over the full length of a benchmark. Simple random sampling does not exploit the often repetitive behaviors of benchmarks, collecting many redundant measurements. By identifying repetitive behaviors, we can apply stratified random sampling to achieve the same confidence as simple random sampling with far fewer measurements. Our oracle limit study of optimal stratified sampling of SPEC2K benchmarks demonstrates an opportunity to reduce required measurement by 43x over simple random sampling. Using our oracle results as a basis for comparison, we evaluate two practical approaches for selecting strata, program phase detection and IPC profiling. Program phase detection is attractive because it is microarchitec- ture independent, while IPC profiling directly minimizes stratum variance, therefore minimizing sample size. Unfortunately, our results indicate that: (1) program phase stratification falls far short of optimal opportunity, (2) IPC profiling requires expensive microarchitecture- specific analysis, and (3) both methods require large sampling unit sizes to make strata selection feasible, offsetting reductions in sample size. We conclude that, without better stratification approaches, stratified sampling does not provide a clear advantage over simple random sampling

    Sonic Millip3De with Dynamic Receive Focusing and Apodization Optimization

    Get PDF
    Abstract-3D ultrasound is becoming common for noninvasive medical imaging because of its accuracy, safety, and ease of use. However, the extreme computational requirements (and associated power requirements) of image formation for a large 3D system have, to date, precluded hand-held 3D-capable devices. Sonic Millip3De is a recently proposed hardware design that leverages modern computer architecture techniques, such as 3D die stacking, massive parallelism, and streaming data flow, to enable high-resolution synthetic aperture 3D ultrasound imaging in a single, low-power chip. In this paper, we enhance Sonic Millip3De with a new virtual source firing sequence and dynamic receive focusing scheme to optimize receive apertures in multiple depth focal zones. These enhancements further reduce power requirements while maintaining image quality over a large depth range. We present image quality analysis using Field II simulations of cysts in tissue at varying depths to show that our methods do not degrade CNR relative to an ideal system with no power constraints. Then, using RTL-level design for an industrial 45nm ASIC process, we demonstrate 3D synthetic aperture with 120x88 transducer array within a 15W fullsystem power budget (400x less than a conventional DSP solution). We project that continued semicondutor scaling will enable a sub-5W power budget in 16nm technology

    Simulation sampling with live-points

    Get PDF
    Current simulation-sampling techniques construct accurate model state for each measurement by continuously warming large microarchitectural structures (e.g., caches and the branch predictor) while functionally simulating the billions of instructions between measurements. This approach, called functional warming, is the main performance bottleneck of simulation sampling and requires hours of runtime while the detailed simulation of the sample requires only minutes. Existing simulators can avoid functional simulation by jumping directly to particular instruction stream locations with architectural state checkpoints. To replace functional warming, these checkpoints must additionally provide microarchitectural model state that is accurate and reusable across experiments while meeting tight storage constraints. In this paper, we present a simulation-sampling framework that replaces functional warming with live-points without sacrificing accuracy. A live-point stores the bare minimum of functionally-warmed state for accurate simulation of a limited execution window while placing minimal restrictions on microarchitectural configuration. Live-points can be processed in random rather than program order, allowing simulation results and their statistical confidence to be reported while simulations are in progress. Our framework matches the accuracy of prior simulation-sampling techniques (i.e., ±3% error with 99.7% confidence), while estimating the performance of an 8-way out-of-order superscalar processor running SPEC CPU2000 in 91 seconds per benchmark, on average, using a 12 GB live-point librar

    Making Address-Correlated Prefetching Practical

    Get PDF
    Despite a decade of research demonstrating its efficacy, address-correlated prefetching has never been implemented in a shipping processor because it requires megabytes of metadata—too large to store practically on chip. New storage-, latency-, and bandwidth-efficient mechanisms for storing metadata off chip yield a practical design that achieves 90 percent of the performance potential of idealized on-chip metadata storage

    Temporal Streaming of Shared Memory

    Get PDF
    Coherent read misses in shared-memory multiprocessors account for a substantial fraction of execution time in many important scientific and commercial workloads. We propose Temporal Streaming, to eliminate coherent read misses by streaming data to a processor in advance of the corresponding memory accesses. Temporal streaming dynamically identifies address sequences to be streamed by exploiting two common phenomena in shared-memory access patterns: (1) temporal address correlation — groups of shared addresses tend to be accessed together and in the same order, and (2) temporal stream locality — recently- accessed address streams are likely to recur. We present a practical design for temporal streaming. We evaluate our design using a combination of trace-driven and cycle- accurate full-system simulation of a cache-coherent distributed shared-memory system. We show that temporal streaming can eliminate 98% of coherent read misses in scientific applications, and between 43% and 60% in database and web server workloads. Our design yields speedups of 1.07 to 3.29 in scientific applications, and 1.06 to 1.21 in commercial workloads

    Memory coherence activity prediction in commercial workloads

    Get PDF
    Recent research indicates that prediction-based coherence optimizations offer substantial performance improvements for scientific applications in distributed shared memory multiprocessors. Important commercial applications also show sensitivity to coherence latency, which will become more acute in the future as technology scales. Therefore it is important to investigate prediction of memory coherence activity in the context of commercial workloads.This paper studies a trace-based Downgrade Predictor (DGP) for predicting last stores to shared cache blocks, and a pattern-based Consumer Set Predictor (CSP) for predicting subsequent readers. We evaluate this class of predictors for the first time on commercial applications and demonstrate that our DGP correctly predicts 47%-76% of last stores. Memory sharing patterns in commercial workloads are inherently non-repetitive; hence CSP cannot attain high coverage. We perform an opportunity study of a DGP enhanced through competitive underlying predictors, and in commercial and scientific applications, demonstrate potential to increase coverage up to 14%
    • …
    corecore