1,120 research outputs found

    Using Intelligent Prefetching to Reduce the Energy Consumption of a Large-scale Storage System

    Get PDF
    Many high performance large-scale storage systems will experience significant workload increases as their user base and content availability grow over time. The U.S. Geological Survey (USGS) Earth Resources Observation and Science (EROS) center hosts one such system that has recently undergone a period of rapid growth as its user population grew nearly 400% in just about three years. When administrators of these massive storage systems face the challenge of meeting the demands of an ever increasing number of requests, the easiest solution is to integrate more advanced hardware to existing systems. However, additional investment in hardware may significantly increase the system cost as well as daily power consumption. In this paper, we present evidence that well-selected software level optimization is capable of achieving comparable levels of performance without the cost and power consumption overhead caused by physically expanding the system. Specifically, we develop intelligent prefetching algorithms that are suitable for the unique workloads and user behaviors of the world\u27s largest satellite images distribution system managed by USGS EROS. Our experimental results, derived from real-world traces with over five million requests sent by users around the globe, show that the EROS hybrid storage system could maintain the same performance with over 30% of energy savings by utilizing our proposed prefetching algorithms, compared to the alternative solution of doubling the size of the current FTP server farm

    An ultra low-power hardware accelerator for automatic speech recognition

    Get PDF
    Automatic Speech Recognition (ASR) is becoming increasingly ubiquitous, especially in the mobile segment. Fast and accurate ASR comes at a high energy cost which is not affordable for the tiny power budget of mobile devices. Hardware acceleration can reduce power consumption of ASR systems, while delivering high-performance. In this paper, we present an accelerator for large-vocabulary, speaker-independent, continuous speech recognition. It focuses on the Viterbi search algorithm, that represents the main bottleneck in an ASR system. The proposed design includes innovative techniques to improve the memory subsystem, since memory is identified as the main bottleneck for performance and power in the design of these accelerators. We propose a prefetching scheme tailored to the needs of an ASR system that hides main memory latency for a large fraction of the memory accesses with a negligible impact on area. In addition, we introduce a novel bandwidth saving technique that removes 20% of the off-chip memory accesses issued during the Viterbi search. The proposed design outperforms software implementations running on the CPU by orders of magnitude and achieves 1.7x speedup over a highly optimized CUDA implementation running on a high-end Geforce GTX 980 GPU, while reducing by two orders of magnitude (287x) the energy required to convert the speech into text.Peer ReviewedPostprint (author's final draft

    A low-power, high-performance speech recognition accelerator

    Get PDF
    © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Automatic Speech Recognition (ASR) is becoming increasingly ubiquitous, especially in the mobile segment. Fast and accurate ASR comes at high energy cost, not being affordable for the tiny power-budgeted mobile devices. Hardware acceleration reduces energy-consumption of ASR systems, while delivering high-performance. In this paper, we present an accelerator for largevocabulary, speaker-independent, continuous speech-recognition. It focuses on the Viterbi search algorithm representing the main bottleneck in an ASR system. The proposed design consists of innovative techniques to improve the memory subsystem, since memory is the main bottleneck for performance and power in these accelerators' design. It includes a prefetching scheme tailored to the needs of ASR systems that hides main memory latency for a large fraction of the memory accesses, negligibly impacting area. Additionally, we introduce a novel bandwidth-saving technique that removes off-chip memory accesses by 20 percent. Finally, we present a power saving technique that significantly reduces the leakage power of the accelerators scratchpad memories, providing between 8.5 and 29.2 percent reduction in entire power dissipation. Overall, the proposed design outperforms implementations running on the CPU by orders of magnitude, and achieves speedups between 1.7x and 5.9x for different speech decoders over a highly optimized CUDA implementation running on Geforce-GTX-980 GPU, while reducing the energy by 123-454x.Peer ReviewedPostprint (author's final draft

    Architecture for Cooperative Prefetching in P2P Video-on- Demand System

    Full text link
    Most P2P VoD schemes focused on service architectures and overlays optimization without considering segments rarity and the performance of prefetching strategies. As a result, they cannot better support VCRoriented service in heterogeneous environment having clients using free VCR controls. Despite the remarkable popularity in VoD systems, there exist no prior work that studies the performance gap between different prefetching strategies. In this paper, we analyze and understand the performance of different prefetching strategies. Our analytical characterization brings us not only a better understanding of several fundamental tradeoffs in prefetching strategies, but also important insights on the design of P2P VoD system. On the basis of this analysis, we finally proposed a cooperative prefetching strategy called "cooching". In this strategy, the requested segments in VCR interactivities are prefetched into session beforehand using the information collected through gossips. We evaluate our strategy through extensive simulations. The results indicate that the proposed strategy outperforms the existing prefetching mechanisms.Comment: 13 Pages, IJCN

    Hierarchical clustered register file organization for VLIW processors

    Get PDF
    Technology projections indicate that wire delays will become one of the biggest constraints in future microprocessor designs. To avoid long wire delays and therefore long cycle times, processor cores must be partitioned into components so that most of the communication is done locally. In this paper, we propose a novel register file organization for VLIW cores that combines clustering with a hierarchical register file organization. Functional units are organized in clusters, each one with a local first level register file. The local register files are connected to a global second level register file, which provides access to memory. All intercluster communications are done through the second level register file. This paper also proposes MIRS-HC, a novel modulo scheduling technique that simultaneously performs instruction scheduling, cluster selection, inserts communication operations, performs register allocation and spill insertion for the proposed organization. The results show that although more cycles are required to execute applications, the execution time is reduced due to a shorter cycle time. In addition, the combination of clustering and hierarchy provides a larger design exploration space that trades-off performance and technology requirements.Peer ReviewedPostprint (published version
    • …
    corecore