374 research outputs found

    Speech Recognition in Hindi

    Get PDF
    This project is an attempt towards reducing the gap between the computer and the people of rural India, by allowing them to use Hindi language, the most common language being used by the people in rural areas. Speech recognition will, indeed, play a very significant role in promoting the technology in the rural areas. Although many speech interfaces are already available, the need is for speech interfaces in local Indian languages, hence we attempt to build a speech recognition system in Hindi, in this project. The project report explains in brief about the basic model of a speech recognition engine and its different modules. It also briefs about the construction of the Hindi language dictionary and training the model for recognition of speech and finally testing the model for accuracy. The results of the tests have been provided and finally the report ends with the derived conclusion and recommended future work

    Real-time viseme extraction

    Get PDF
    With the advance of modem computer hardware, computer animation has advanced leaps and bounds. What formerly took weeks of processing can now be generated on the fly. However, the actors in games often stand mute with faces unmoving, or speak only in canned phrases as the technology for calculating their lip positions from an arbitrary sound segment has lagged behind the technology that allowed the movement of those lips to be rendered in real-time. Traditional speech recognition techniques requires the entire utterance to be present or require at least a wide window around the text to be matched to allow for higher level structure to be used in determining what words are being spoken. However, this approach, while highly appropriate for recognizing the sounds present in an audio stream and mapping those to speech, is less applicable to the problem of lip-syncing in real time. This paper looks at an alternate technique for applying multivariate statistical techniques to lip-sync a cartoon or model with an audio stream in real time, which requires orders of magnitude less processing power than traditional methods

    Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation

    Get PDF
    Over the past few years, speech recognition technology performance on tasks ranging from isolated digit recognition to conversational speech has dramatically improved. Performance on limited recognition tasks in noiseree environments is comparable to that achieved by human transcribers. This advancement in automatic speech recognition technology along with an increase in the compute power of mobile devices, standardization of communication protocols, and the explosion in the popularity of the mobile devices, has created an interest in flexible voice interfaces for mobile devices. However, speech recognition performance degrades dramatically in mobile environments which are inherently noisy. In the recent past, a great amount of effort has been spent on the development of front ends based on advanced noise robust approaches. The primary objective of this thesis was to analyze the performance of two advanced front ends, referred to as the QIO and MFA front ends, on a speech recognition task based on the Wall Street Journal database. Though the advanced front ends are shown to achieve a significant improvement over an industry-standard baseline front end, this improvement is not operationally significant. Further, we show that the results of this evaluation were not significantly impacted by suboptimal recognition system parameter settings. Without any front end-specific tuning, the MFA front end outperforms the QIO front end by 9.6% relative. With tuning, the relative performance gap increases to 15.8%. Finally, we also show that mismatched microphone and additive noise evaluation conditions resulted in a significant degradation in performance for both front ends

    Suomenkielinen puheentunnistus hammashuollon sovelluksissa

    Get PDF
    A significant portion of the work time of dentists and nursing staff goes to writing reports and notes. This thesis studies how automatic speech recognition could ease the work load. The primary objective was to develop and evaluate an automatic speech recognition system for dental health care that records the status of patient's dentition, as dictated by a dentist. The system accepts a restricted set of spoken commands that identify a tooth or teeth and describe their condition. The status of the teeth is stored in a database. In addition to dentition status dictation, it was surveyed how well automatic speech recognition would be suited for dictating patient treatment reports. Instead of typing reports with a keyboard, a dentist could dictate them to speech recognition software that automatically transcribes them into text. The vocabulary and grammar in such a system is, in principle, unlimited. This makes it significantly harder to obtain an accurate transcription. The status commands and the report dictation language model are Finnish. Aalto University has developed an unlimited vocabulary speech recognizer that is particularly well suited for Finnish free speech recognition, but it has previously been used mainly for research purposes. In this project we experimented with adapting the recognizer to grammar-based dictation, and real end user environments. Nearly perfect recognition accuracy was obtained for dentition status dictation. Letter error rates for the report transcription task varied between 1.3 % and 17 % depending on the speaker, with no obvious explanation for so radical inter-speaker variability. Language model for report transcription was estimated using a collection of dental reports. Including a corpus of literary Finnish did not improve the results.Hammaslääkärien ja hoitohenkilökunnan työajasta huomattava osa kuluu raportointiin ja muistiinpanojen tekemiseen. Tämä lisensiaatintyö tutkii kuinka automaattinen puheentunnistus voisi helpottaa tätä työtaakkaa. Ensisijaisena tavoitteena oli kehittää automaattinen puheentunnistusjärjestelmä hammashuollon tarpeisiin, joka tallentaa potilaan hampaiston tilan hammaslääkärin sanelemana, ja arvioida järjestelmän toimivuutta. Järjestelmä hyväksyy rajoitetun joukon puhuttuja komentoja, jotka identifioivat hampaan tai hampaat ja kuvaavat niiden tilaa. Hampaiden tila tallennetaan tietokantaan. Hampaiston tilan sanelun lisäksi tutkittiin kuinka hyvin automaattinen puheentunnistus sopisi potilaiden hoitokertomusten saneluun. Näppäimistöllä kirjoittamisen sijaan hammaslääkäri voisi sanella hoitokertomukset puheentunnistusohjelmistolle, joka automaattisesti purkaisi puheen tekstimuotoon. Tämän kaltaisessa järjestelmässä sanasto ja kielioppi ovat periaatteessa rajoittamattomat, minkä takia tekstiä on huomattavasti vaikeampaa tunnistaa tarkasti. Status-komennot ja hoitokertomusten kielimalli ovat suomenkielisiä. Aalto-yliopisto on kehittänyt rajoittamattoman sanaston puheentunnistimen, joka soveltuu erityisen hyvin suomenkielisen vapaamuotoisen puheen tunnistamiseen, mutta sitä on aikaisemmin käytetty lähinnä tutkimustarkoituksiin. Tässä projektissa tutkimme tunnistimen sovittamista kielioppipohjaiseen tunnistukseen ja todellisiin käyttöympäristöihin. Hampaiston tilan sanelussa saavutettiin lähes täydellinen tunnistustarkkuus. Kirjainvirheiden osuus hoitokertomusten sanelussa vaihteli 1,3 ja 17 prosentin välillä puhujasta riippuen, ilman selvää syytä näin jyrkälle puhujien väliselle vaihtelulle. Kielimalli hoitokertomusten sanelulle laskettiin kokoelmasta hammaslääkärien kirjoittamia raportteja. Kirjakielisen aineiston sisällyttäminen ei parantanut tunnistustulosta

    Robust speaker identification using artificial neural networks

    Full text link
    This research mainly focuses on recognizing the speakers through their speech samples. Numerous Text-Dependent or Text-Independent algorithms have been developed by people so far, to recognize the speaker from his/her speech. In this thesis, we concentrate on the recognition of the speaker from the fixed text i.e. Text-Dependent . Possibility of extending this method to variable text i.e. Text-Independent is also analyzed. Different feature extraction algorithms are employed and their performance with Artificial Neural Networks as a Data Classifier on a fixed training set is analyzed. We find a way to combine all these individual feature extraction algorithms by incorporating their interdependence. The efficiency of these algorithms is determined after the input speech is classified using Back Propagation Algorithm of Artificial Neural Networks. A special case of Back Propagation Algorithm which improves the efficiency of the classification is also discussed

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Recognizing GSM Digital Speech

    Get PDF
    The Global System for Mobile (GSM) environment encompasses three main problems for automatic speech recognition (ASR) systems: noisy scenarios, source coding distortion, and transmission errors. The first one has already received much attention; however, source coding distortion and transmission errors must be explicitly addressed. In this paper, we propose an alternative front-end for speech recognition over GSM networks. This front-end is specially conceived to be effective against source coding distortion and transmission errors. Specifically, we suggest extracting the recognition feature vectors directly from the encoded speech (i.e., the bitstream) instead of decoding it and subsequently extracting the feature vectors. This approach offers two significant advantages. First, the recognition system is only affected by the quantization distortion of the spectral envelope. Thus, we are avoiding the influence of other sources of distortion as a result of the encoding-decoding process. Second, when transmission errors occur, our front-end becomes more effective since it is not affected by errors in bits allocated to the excitation signal. We have considered the half and the full-rate standard codecs and compared the proposed front-end with the conventional approach in two ASR tasks, namely, speaker-independent isolated digit recognition and speaker-independent continuous speech recognition. In general, our approach outperforms the conventional procedure, for a variety of simulated channel conditions. Furthermore, the disparity increases as the network conditions worsen

    Ultra low-power, high-performance accelerator for speech recognition

    Get PDF
    Automatic Speech Recognition (ASR) is undoubtedly one of the most important and interesting applications in the cutting-edge era of Deep-learning deployment, especially in the mobile segment. Fast and accurate ASR comes at a high energy cost, requiring huge memory storage and computational power, which is not affordable for the tiny power budget of mobile devices. Hardware acceleration can reduce power consumption of ASR systems as well as reducing its memory pressure, while delivering high-performance. In this thesis, we present a customized accelerator for large-vocabulary, speaker-independent, continuous speech recognition. A state-of-the-art ASR system consists of two major components: acoustic-scoring using DNN and speech-graph decoding using Viterbi search. As the first step, we focus on the Viterbi search algorithm, that represents the main bottleneck in the ASR system. The accelerator includes some innovative techniques to improve the memory subsystem, which is the main bottleneck for performance and power, such as a prefetching scheme and a novel bandwidth saving technique tailored to the needs of ASR. Furthermore, as the speech graph is vast taking more than 1-Gigabyte memory space, we propose to change its representation by partitioning it into several sub-graphs and perform an on-the-fly composition during the Viterbi run-time. This approach together with some simple yet efficient compression techniques result in 31x memory footprint reduction, providing 155x real-time speedup and orders of magnitude power and energy saving compared to CPUs and GPUs. In the next step, we propose a novel hardware-based ASR system that effectively integrates a DNN accelerator for the pruned/quantized models with the Viterbi accelerator. We show that, when either pruning or quantizing the DNN model used for acoustic scoring, ASR accuracy is maintained but the execution time of the ASR system is increased by 33%. Although pruning and quantization improves the efficiency of the DNN, they result in a huge increase of activity in the Viterbi search since the output scores of the pruned model are less reliable. In order to avoid the aforementioned increase in Viterbi search workload, our system loosely selects the N-best hypotheses at every time step, exploring only the N most likely paths. Our final solution manages to efficiently combine both DNN and Viterbi accelerators using all their optimizations, delivering 222x real-time ASR with a small power budget of 1.26 Watt, small memory footprint of 41 MB, and a peak memory bandwidth of 381 MB/s, being amenable for low-power mobile platforms.Los sistemas de reconocimiento automático del habla (ASR por sus siglas en inglés, Automatic Speech Recognition) son sin lugar a dudas una de las aplicaciones más relevantes en el área emergente de aprendizaje profundo (Deep Learning), specialmente en el segmento de los dispositivos móviles. Realizar el reconocimiento del habla de forma rápida y precisa tiene un elevado coste en energía, requiere de gran capacidad de memoria y de cómputo, lo cual no es deseable en sistemas móviles que tienen severas restricciones de consumo energético y disipación de potencia. El uso de arquitecturas específicas en forma de aceleradores hardware permite reducir el consumo energético de los sistemas de reconocimiento del habla, al tiempo que mejora el rendimiento y reduce la presión en el sistema de memoria. En esta tesis presentamos un acelerador específicamente diseñado para sistemas de reconocimiento del habla de gran vocabulario, independientes del orador y que funcionan en tiempo real. Un sistema de reconocimiento del habla estado del arte consiste principalmente en dos componentes: el modelo acústico basado en una red neuronal profunda (DNN, Deep Neural Network) y la búsqueda de Viterbi basada en un grafo que representa el lenguaje. Como primer objetivo nos centramos en la búsqueda de Viterbi, ya que representa el principal cuello de botella en los sistemas ASR. El acelerador para el algoritmo de Viterbi incluye técnicas innovadoras para mejorar el sistema de memoria, que es el mayor cuello de botella en rendimiento y energía, incluyendo técnicas de pre-búsqueda y una nueva técnica de ahorro de ancho de banda a memoria principal específicamente diseñada para sistemas ASR. Además, como el grafo que representa el lenguaje requiere de gran capacidad de almacenamiento en memoria (más de 1 GB), proponemos cambiar su representación y dividirlo en distintos grafos que se componen en tiempo de ejecución durante la búsqueda de Viterbi. De esta forma conseguimos reducir el almacenamiento en memoria principal en un factor de 31x, alcanzar un rendimiento 155 veces superior a tiempo real y reducir el consumo energético y la disipación de potencia en varios órdenes de magnitud comparado con las CPUs y las GPUs. En el siguiente paso, proponemos un novedoso sistema hardware para reconocimiento del habla que integra de forma efectiva un acelerador para DNNs podadas y cuantizadas con el acelerador de Viterbi. Nuestros resultados muestran que podar y/o cuantizar el DNN para el modelo acústico permite mantener la precisión pero causa un incremento en el tiempo de ejecución del sistema completo de hasta el 33%. Aunque podar/cuantizar mejora la eficiencia del DNN, éstas técnicas producen un gran incremento en la carga de trabajo de la búsqueda de Viterbi ya que las probabilidades calculadas por el DNN son menos fiables, es decir, se reduce la confianza en las predicciones del modelo acústico. Con el fin de evitar un incremento inaceptable en la carga de trabajo de la búsqueda de Viterbi, nuestro sistema restringe la búsqueda a las N hipótesis más probables en cada paso de la búsqueda. Nuestra solución permite combinar de forma efectiva un acelerador de DNNs con un acelerador de Viterbi incluyendo todas las optimizaciones de poda/cuantización. Nuestro resultados experimentales muestran que dicho sistema alcanza un rendimiento 222 veces superior a tiempo real con una disipación de potencia de 1.26 vatios, unos requisitos de memoria modestos de 41 MB y un uso de ancho de banda a memoria principal de, como máximo, 381 MB/s, ofreciendo una solución adecuada para dispositivos móviles
    corecore