382 research outputs found

    An ultra low-power hardware accelerator for automatic speech recognition

    Get PDF
    Automatic Speech Recognition (ASR) is becoming increasingly ubiquitous, especially in the mobile segment. Fast and accurate ASR comes at a high energy cost which is not affordable for the tiny power budget of mobile devices. Hardware acceleration can reduce power consumption of ASR systems, while delivering high-performance. In this paper, we present an accelerator for large-vocabulary, speaker-independent, continuous speech recognition. It focuses on the Viterbi search algorithm, that represents the main bottleneck in an ASR system. The proposed design includes innovative techniques to improve the memory subsystem, since memory is identified as the main bottleneck for performance and power in the design of these accelerators. We propose a prefetching scheme tailored to the needs of an ASR system that hides main memory latency for a large fraction of the memory accesses with a negligible impact on area. In addition, we introduce a novel bandwidth saving technique that removes 20% of the off-chip memory accesses issued during the Viterbi search. The proposed design outperforms software implementations running on the CPU by orders of magnitude and achieves 1.7x speedup over a highly optimized CUDA implementation running on a high-end Geforce GTX 980 GPU, while reducing by two orders of magnitude (287x) the energy required to convert the speech into text.Peer ReviewedPostprint (author's final draft

    A low-power, high-performance speech recognition accelerator

    Get PDF
    © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Automatic Speech Recognition (ASR) is becoming increasingly ubiquitous, especially in the mobile segment. Fast and accurate ASR comes at high energy cost, not being affordable for the tiny power-budgeted mobile devices. Hardware acceleration reduces energy-consumption of ASR systems, while delivering high-performance. In this paper, we present an accelerator for largevocabulary, speaker-independent, continuous speech-recognition. It focuses on the Viterbi search algorithm representing the main bottleneck in an ASR system. The proposed design consists of innovative techniques to improve the memory subsystem, since memory is the main bottleneck for performance and power in these accelerators' design. It includes a prefetching scheme tailored to the needs of ASR systems that hides main memory latency for a large fraction of the memory accesses, negligibly impacting area. Additionally, we introduce a novel bandwidth-saving technique that removes off-chip memory accesses by 20 percent. Finally, we present a power saving technique that significantly reduces the leakage power of the accelerators scratchpad memories, providing between 8.5 and 29.2 percent reduction in entire power dissipation. Overall, the proposed design outperforms implementations running on the CPU by orders of magnitude, and achieves speedups between 1.7x and 5.9x for different speech decoders over a highly optimized CUDA implementation running on Geforce-GTX-980 GPU, while reducing the energy by 123-454x.Peer ReviewedPostprint (author's final draft

    Dynamic Hardware Resource Management for Efficient Throughput Processing.

    Full text link
    High performance computing is evolving at a rapid pace, with throughput oriented processors such as graphics processing units (GPUs), substituting for traditional processors as the computational workhorse. Their adoption has seen a tremendous increase as they provide high peak performance and energy efficiency while maintaining a friendly programming interface. Furthermore, many existing desktop, laptop, tablet, and smartphone systems support accelerating non-graphics, data parallel workloads on their GPUs. However, the multitude of systems that use GPUs as an accelerator run different genres of data parallel applications, which have significantly contrasting runtime characteristics. GPUs use thousands of identical threads to efficiently exploit the on-chip hardware resources. Therefore, if one thread uses a resource (compute, bandwidth, data cache) more heavily, there will be significant contention for that resource. This contention will eventually saturate the performance of the GPU due to contention for the bottleneck resource,leaving other resources underutilized at the same time. Traditional policies of managing the massive hardware resources work adequately, on well designed traditional scientific style applications. However, these static policies, which are oblivious to the application’s resource requirement, are not efficient for the large spectrum of data parallel workloads with varying resource requirements. Therefore, several standard hardware policies such as using maximum concurrency, fixed operational frequency and round-robin style scheduling are not efficient for modern GPU applications. This thesis defines dynamic hardware resource management mechanisms which improve the efficiency of the GPU by regulating the hardware resources at runtime. The first step in successfully achieving this goal is to make the hardware aware of the application’s characteristics at runtime through novel counters and indicators. After this detection, dynamic hardware modulation provides opportunities for increased performance, improved energy consumption, or both, leading to efficient execution. The key mechanisms for modulating the hardware at runtime are dynamic frequency regulation, managing the amount of concurrency, managing the order of execution among different threads and increasing cache utilization. The resultant increased efficiency will lead to improved energy consumption of the systems that utilize GPUs while maintaining or improving their performance.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113356/1/asethia_1.pd

    Compiling vector pascal to the XeonPhi

    Get PDF
    Intel's XeonPhi is a highly parallel x86 architecture chip made by Intel. It has a number of novel features which make it a particularly challenging target for the compiler writer. This paper describes the techniques used to port the Glasgow Vector Pascal Compiler to this architecture and assess its performance by comparisons of the XeonPhi with 3 other machines running the same algorithms

    Audio on the GPU: Real-Time Time Domain Audio Convolution on Graphics Cards

    Get PDF
    The architecture of CPUs has shifted in recent years from increased speed to more cores on the chips. With this change, more developers are focusing on parallelism; however, many developers have not taken advantage of a common hardware component that specializes in parallel applications: the Graphics Processing Unit (GPU). By writing code to execute on GPUs, developers have been able to gain increased performance over the traditional CPU in many problem domains, including signal processing. Time domain convolution is an important component of signal processing. Currently, the fastest process to perform convolution is frequency domain multiplication. In addition to being more complex, inconsistencies such as missing data are difficult to solve in the frequency domain. It has been shown that executing frequency domain multiplication on GPUs improves performance, but there is no research for time domain convolution on GPUs. This thesis provides two algorithms that implement time domain convolution on GPUs: one algorithm is for computing convolution all at once and another is designed for real time computation and playing the results. The results from this thesis indicate that using the GPU significantly reduces processing time for time domain convolution

    Durability of Wireless Charging Systems Embedded Into Concrete Pavements for Electric Vehicles

    Get PDF
    Point clouds are widely used in various applications such as 3D modeling, geospatial analysis, robotics, and more. One of the key advantages of 3D point cloud data is that, unlike other data formats like texture, it is independent of viewing angle, surface type, and parameterization. Since each point in the point cloud is independent of the other, it makes it the most suitable source of data for tasks like object recognition, scene segmentation, and reconstruction. Point clouds are complex and verbose due to the numerous attributes they contain, many of which may not be always necessary for rendering, making retrieving and parsing a heavy task. As Sensors are becoming more precise and popular, effectively streaming, processing, and rendering the data is also becoming more challenging. In a hierarchical continuous LOD system, the previously fetched and rendered data for a region may become unavailable when revisiting it. To address this, we use a non-persistence cache using hash-map which stores the parsed point attributes, which still has some limitations, such as the dataset needing to be refetched and reprocessed if the tab or browser is closed and reopened which can be addressed by persistence caching. On the web, popularly persistence caching involves storing data in server memory, or an intermediate caching server like Redis. This is not suitable for point cloud data where we have to store parsed and processed large point data making point cloud visualization rely only on non-persistence caching. The thesis aims to contribute toward better performance and suitability of point cloud rendering on the web reducing the number of read requests to the remote file to access data.We achieve this with the application of client-side-based LRU Cache and Private File Open Space as a combination of both persistence and non-persistence caching of data. We use a cloud-optimized data format, which is better suited for web and streaming hierarchical data structures. Our focus is to improve rendering performance using WebGPU by reducing access time and minimizing the amount of data loaded in GPU. Preliminary results indicate that our approach significantly improves rendering performance and reduce network request when compared to traditional caching methods using WebGPU

    Photo Based 3D Walkthrough

    Get PDF
    The objective of 'Photo Based 3D Walkthrough' is to understand how image-based rendering technology is used to create virtual environment and to develop aprototype system which is capable ofproviding real-time 3D walkthrough experience by solely using 2D images. Photo realism has always been an aim of computer graphics in virtual environment. Traditional graphics needs a great amount of works and time to construct a detailed 3D model andscene. Despite the tedious works in constructing the 3D models andscenes, a lot ofefforts need to beput in to render the constructed 3D models and scenes to enhance the level of realism. Traditional geometry-based rendering systems fall short ofsimulating the visual realism of a complex environment and are unable to capture and store a sampled representation ofa large environment with complex lighting and visibility effects. Thus, creating a virtual walkthrough ofa complex real-world environment remains one of the most challenging problems in computer graphics. Due to the various disadvantages of the traditional graphics and geometry-based rendering systems, image-based rendering (IBR) has been introduced recently to overcome the above problems. In this project, a research will be carried out to create anIBR virtual walkthrough by using only OpenGL and C++program without the use of any game engine or QuickTime VR function. Normal photographs (not panoramic photographs) are used as the source material in creating the virtual scene and keyboard is used asthe main navigation tool in the virtual environment. The quality ofthe virtual walkthrough prototype constructed isgood withjust a littlejerkiness

    Developing a compiler for the XeonPhi (TR-2014-341)

    Get PDF
    The XeonPhi is a highly parallel x86 architecture chip made by Intel. It has a number of novel features which make it a particularly challenging target for the compiler writer. This paper describes the techniques used to port the Glasgow Vector Pascal Compiler (VPC) to this architecture and assess its performance by comparisons of the XeonPhi with 3 other machines running the same algorithms
    • …
    corecore