32 research outputs found

    Performance analysis and optimization of automatic speech recognition

    Get PDF
    © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Fast and accurate Automatic Speech Recognition (ASR) is emerging as a key application for mobile devices. Delivering ASR on such devices is challenging due to the compute-intensive nature of the problem and the power constraints of embedded systems. In this paper, we provide a performance and energy characterization of Pocketsphinx, a popular toolset for ASR that targets mobile devices. We identify the computation of the Gaussian Mixture Model (GMM) as the main bottleneck, consuming more than 80 percent of the execution time. The CPI stack analysis shows that branches and main memory accesses are the main performance limiting factors for GMM computation. We propose several software-level optimizations driven by the power/performance analysis. Unlike previous proposals that trade accuracy for performance by reducing the number of Gaussians evaluated, we maintain accuracy and improve performance by effectively using the underlying CPU microarchitecture. First, we use a refactored implementation of the innermost loop of the GMM evaluation code to ameliorate the impact of branches. Second, we exploit the vector unit available on most modern CPUs to boost GMM computation, introducing a novel memory layout for storing the means and variances of the Gaussians in order to maximize the effectiveness of vectorization. Third, we compute the Gaussians for multiple frames in parallel, so means and variances can be fetched once in the on-chip caches and reused across multiple frames, significantly reducing memory bandwidth usage. We evaluate our optimizations using both hardware counters on real CPUs and simulations. Our experimental results show that the proposed optimizations provide 2.68x speedup over the baseline Pocketsphinx decoder on a high-end Intel Skylake CPU, while achieving 61 percent energy savings. On a modern ARM Cortex-A57 mobile processor our techniques improve performance by 1.85x, while providing 59 percent energy savings without any loss in the accuracy of the ASR system.Peer ReviewedPostprint (author's final draft

    Full covariance Gaussian mixture models evaluation on GPU

    Full text link

    Parallel implementation of Artificial Neural Network training for speech recognition

    Get PDF
    In this paper we describe the implementation of a complete ANN training procedure using the block mode back-propagation learning algorithm for sequential patterns – such as the observation feature vectors of a speech recognition system – exploiting the high performance SIMD architecture of GPU using CUDA and its C-like language interface. We also compare the speed-up obtained implementing the training procedure only taking advantage of the multi-thread capabilities of multi-core processors. In our implementation we take into account all the peculiar aspects of training large scale sequential patterns, in particular, the re-segmentation of the training sentences, the block size for the feed-forward and for the back-propagation steps, and the transfer of huge amount of data from host memory to the GPU card. Our approach has been tested by training acoustic models for large vocabulary speech recognition tasks, showing a six times reduction of the time required to train real-world large size networks with respect to an already optimized implementation using the Intel MKL libraries. Thanks to these optimizations and to the support of the GPU, the training time for language having a huge set of training sentences (about one million for Italian) can be reduced from approximately a month to 5 days

    Improving Eye Motion Sequence Recognition Using Electrooculography Based on Context-Dependent HMM

    Get PDF
    Eye motion-based human-machine interfaces are used to provide a means of communication for those who can move nothing but their eyes because of injury or disease. To detect eye motions, electrooculography (EOG) is used. For efficient communication, the input speed is critical. However, it is difficult for conventional EOG recognition methods to accurately recognize fast, sequentially input eye motions because adjacent eye motions influence each other. In this paper, we propose a context-dependent hidden Markov model- (HMM-) based EOG modeling approach that uses separate models for identical eye motions with different contexts. Because the influence of adjacent eye motions is explicitly modeled, higher recognition accuracy is achieved. Additionally, we propose a method of user adaptation based on a user-independent EOG model to investigate the trade-off between recognition accuracy and the amount of user-dependent data required for HMM training. Experimental results show that when the proposed context-dependent HMMs are used, the character error rate (CER) is significantly reduced compared with the conventional baseline under user-dependent conditions, from 36.0 to 1.3%. Although the CER increases again to 17.3% when the context-dependent but user-independent HMMs are used, it can be reduced to 7.3% by applying the proposed user adaptation method

    Konvoluutioneuroverkot ja Gaussiset prosessit sensoridatan analysoimiseen

    Get PDF
    Different sensors are constantly collecting information about us and our surroundings, such as pollution levels or heart rates. This results in long sequences of noisy time series observations, often also referred to as signals. This thesis develops machine learning methods for analysing such sensor data. The motivation behind the work is based on three real-world applications. In one, the goal is to improve Wi-Fi networks and recognise devices causing interference from spectral data measured by a spectrum analyser. The second one uses ultrasound signals propagated through different paths to localise objects inside closed containers, such as fouling inside of industrial pipelines. In third, the goal is to model an engine of a car and its emissions. Machine learning builds models of complex systems based on a set of observations. We develop models that are designed for analysing time series data, and we build on existing work on two different models: convolutional neural networks (CNNs) and Gaussian processes (GPs). We show that CNNs are able to automatically recognise useful patterns both in 1D and 2D signal data, even when we use a chaotic cavity to scatter waves randomly in order to increase the acoustic aperture. We show how GPs can be used when the observations can be interpreted as integrals over some region, and how we can introduce a non-negativity constraint in such cases. We also show how Gaussian process state space models can be used to learn long- and short-term effects simultaneously by training the model with different resolutions of the data. The amount of data in our case studies is limited as the datasets have been collected manually using a limited amount of sensors. This adds additional challenges to modeling, and we have used different approaches to cope with limited data. GPs as a model are well suited for small data as they are able to naturally model uncertainties. We also show how a dataset can be collected so that it contains as much information as possible with the limited resources available in cases where we use GPs with integral observations. CNNs in general require large datasets, but we show how we can augment labeled data with unlabeled data by taking advantage of the continuity in sensor data.Erilaiset sensorit keräävät jatkuvasti dataa meistä ja ympäristöstämme, kuten ilmanlaadusta tai ihmisen sykkeestä. Tuloksena on pitkiä aikasarjahavaintoja, joita usein kutsutaan myös signaaleiksi. Tässä työssä kehitetään koneoppimismenetelmiä sensoridatan analysoimiseen. Motivaationa työssä on kolme erilaista käytännön sovellusta. Ensimmäisessä pyritään parantamaan Wi-Fi -verkkojen toimintaa tunnistamalla häiriötä aiheuttavia laitteita spektridatasta. Toisessa käytetään ultraääntä paikallistamaan kohteita suljettujen säiliöden sisällä. Kolmannessa mallinnetaan auton moottoria ja sen päästöjä. Koneoppiminen muodostaa malleja monimutkaisista järjestelmistä havaintojen pohjalta. Tässä työssä kehitetään malleja, jotka sopivat erityisesti aikasarjojen analysointiin. Nämä mallit perustuvat kahteen erilaiseen malliperheeseen: konvoluutioneuroverkkoihin ja Gaussisiin prosesseihin. Työssä kehitetään konvoluutioneuroverkkoja sekä yksi- että kaksiulotteisen signaalidatan analysointiin ja lisäksi osoitetaan, että niiden avulla voidaan tulkita myös signaaleja jotka on hajautettu satunnaisesti mittausalueen kasvattamiseksi. Työssä kehitetään Gaussisia prosesseja tapauksiin, joissa havainnot ovat integraaleja tuntemattoman funktion yli ja yleistetään menetelmä myös tilanteisiin joissa tuntemattoman funktion arvot ovat rajoitettuja, esimerkiksi ei-negativisia. Lisäksi esittelemme tavan, jolla gaussisia prosesseja hyödyntävät tila-avaruusmallit pystyvät oppimaan sekä pitkän että lyhyen aikavälin ilmiöitä käyttämällä opettamiseen datan eri resoluutioita. Työssä käsiteltävissä sovelluksissa datan määrä on verrattain pieni, sillä data on kerätty manuaalisesti vain pienellä määrällä sensoreita. Tässä työssä esitellään myös ratkaisuja pieniin datamääriin liittyviin haasteisiin. Näytämme, miten data voidaan kerätä niin, että se sisältää mahdollisimman paljon informaatiota pienistä resursseista huolimatta, tapauksissa, joissa havainnot vastaavat integraaleja alueiden yli. Konvoluutioneuroverkot tyypillisesti tarvitsevat opettamiseen paljon dataa, mutta työ esittelee miten opettamisessa voidaan täydentää luokiteltua dataa luokittelemattomalla datalla hyödyntämällä sensoridatan aikajatkuvuutta

    System Design for Intelligent Web Services

    Full text link
    The devices and software systems we interact with on a daily basis are more intelligent than ever. The computing required to deliver these experiences for end-users is hosted in Warehouse Scale Computers (WSC) where intelligent web services are employed to process user images, speech, and text. These intelligent web services are emerging as one of the fastest growing class of web services. Given the expectation of users moving forward is an experience that uses intelligent web services, the demand for this type of processing is only going to increase. However, today’s cloud infrastructures, tuned for traditional workloads such as Web Search and social networks, are not adequately equipped to sustain this increase in demand. This dissertation shows that applications that use intelligent web service processing on the path of a single query require orders of magnitude more computational resources than traditional Web Search. Intelligent web services use large pretrained machine learning models to process image, speech, and text based inputs and generate a prediction. As this dissertation investigates, we find that hosting intelligent web services in today’s infrastructures exposes three critical problems: 1) current infrastructures are computationally inadequate to host this new class of services, 2) system designers are unaware of the bottlenecks exposed by these services and the implications on future designs, 3) the rapid algorithmic churn of these intelligent services deprecates current designs at an even faster rate. This dissertation investigates and addresses each of these problems. After building a representative workload to show the computational resources required by an application composed of three intelligent web services, this dissertation first argues that hardware acceleration is required on the path of a query to sustain demand moving forward. We show that GPU- and FPGA-accelerated servers can improve the query latency on average by 10x and 16x. Leveraging the latency reduction, GPU- and FPGA-accelerated servers reduce the Total Cost of Ownership (TCO) by 2.6x and 1.4x, respectively. Second, we focus on Deep Neural Networks (DNN), a state-of-the- art algorithm for intelligent web services and design a DNN-as-a-Service infrastructure enabling application-agnostic acceleration and single-point of optimization. We identify compute bottlenecks that inform the design of a Graphics Processing Unit (GPU) based system; addressing the compute bottlenecks translates to a throughput improvement of 133x across seven DNN based applications. GPU-enabled datacenters show a TCO improvement over CPU-only designs by 4-20x. Finally, we design a runtime system based on a GPU equipped server that improves current systems accounting for recent advances in intelligent web service algorithms. Specifically, we identify asynchronous processing key for accelerating dynamically configured in- telligent services. We achieve on average 7.6x throughput improvements over an optimized CPU baseline and 2.8x over the current GPU system. By thoroughly addressing these problems, we produce designs for WSCs that are equipped to handle the future demand for intelligent web services. The investigations in this thesis address significant computational bottlenecks and lead to system designs that are more efficient and cost-effective for this new class of web services.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/137055/1/jahausw_1.pd

    Third Conference on Artificial Intelligence for Space Applications, part 1

    Get PDF
    The application of artificial intelligence to spacecraft and aerospace systems is discussed. Expert systems, robotics, space station automation, fault diagnostics, parallel processing, knowledge representation, scheduling, man-machine interfaces and neural nets are among the topics discussed
    corecore