17 research outputs found

    Improving wordspotting performance with limited training data

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1995.Includes bibliographical references (leaves 149-155).by Eric I-Chao Chang.Ph.D

    Hierarchical duration modeling for a speech recognition system

    Get PDF
    Thesis (M.S.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1997.Includes bibliographical references (p. 102-105).by Grace Chung.M.S

    Detection and Recognition of Number Sequences in Spoken Utterances

    Get PDF
    In this paper we investigate the detection and recognition of sequences of numbers in spoken utterances. This is done in two steps: first, the entire utterance is decoded assuming that only numbers were spoken. In the second step, non-number segments (garbage) are detected based on word confidence measures. We compare this approach to conventional garbage models. Also, a comparison of several phone posterior based confidence measures is presented in this paper. The work is evaluated in terms of detection task (hit rate and false alarms) and recognition task (word accuracy) within detected number sequences. The proposed method is tested on German continuous spoken utterances where target content (numbers) is only 20\%

    Classifying seismic waveforms from scratch: a case study in the alpine environment

    Get PDF
    Nowadays, an increasing amount of seismic data is collected by daily observatory routines. The basic step for successfully analyzing those data is the correct detection of various event types. However, the visually scanning process is a time-consuming task. Applying standard techniques for detection like the STA/LTA trigger still requires the manual control for classification. Here, we present a useful alternative. The incoming data stream is scanned automatically for events of interest. A stochastic classifier, called hidden Markov model, is learned for each class of interest enabling the recognition of highly variable waveforms. In contrast to other automatic techniques as neural networks or support vector machines the algorithm allows to start the classification from scratch as soon as interesting events are identified. Neither the tedious process of collecting training samples nor a time-consuming configuration of the classifier is required. An approach originally introduced for the volcanic task force action allows to learn classifier properties from a single waveform example and some hours of background recording. Besides a reduction of required workload this also enables to detect very rare events. Especially the latter feature provides a milestone point for the use of seismic devices in alpine warning systems. Furthermore, the system offers the opportunity to flag new signal classes that have not been defined before. We demonstrate the application of the classification system using a data set from the Swiss Seismological Survey achieving very high recognition rates. In detail we document all refinements of the classifier providing a step-by-step guide for the fast set up of a well-working classification syste

    Spoken command recognition for robotics

    Get PDF
    In this thesis, I investigate spoken command recognition technology for robotics. While high robustness is expected, the distant and noisy conditions in which the system has to operate make the task very challenging. Unlike commercial systems which all rely on a "wake-up" word to initiate the interaction, the pipeline proposed here directly detect and recognizes commands from the continuous audio stream. In order to keep the task manageable despite low-resource conditions, I propose to focus on a limited set of commands, thus trading off flexibility of the system against robustness. Domain and speaker adaptation strategies based on a multi-task regularization paradigm are first explored. More precisely, two different methods are proposed which rely on a tied loss function which penalizes the distance between the output of several networks. The first method considers each speaker or domain as a task. A canonical task-independent network is jointly trained with task-dependent models, allowing both types of networks to improve by learning from one another. While an improvement of 3.2% on the frame error rate (FER) of the task-independent network is obtained, this only partially carried over to the phone error rate (PER), with 1.5% of improvement. Similarly, a second method explored the parallel training of the canonical network with a privileged model having access to i-vectors. This method proved less effective with only 1.2% of improvement on the FER. In order to make the developed technology more accessible, I also investigated the use of a sequence-to-sequence (S2S) architecture for command classification. The use of an attention-based encoder-decoder model reduced the classification error by 40% relative to a strong convolutional neural network (CNN)-hidden Markov model (HMM) baseline, showing the relevance of S2S architectures in such context. In order to improve the flexibility of the trained system, I also explored strategies for few-shot learning, which allow to extend the set of commands with minimum requirements in terms of data. Retraining a model on the combination of original and new commands, I managed to achieve 40.5% of accuracy on the new commands with only 10 examples for each of them. This scores goes up to 81.5% of accuracy with a larger set of 100 examples per new command. An alternative strategy, based on model adaptation achieved even better scores, with 68.8% and 88.4% of accuracy with 10 and 100 examples respectively, while being faster to train. This high performance is obtained at the expense of the original categories though, on which the accuracy deteriorated. Those results are very promising as the methods allow to easily extend an existing S2S model with minimal resources. Finally, a full spoken command recognition system (named iCubrec) has been developed for the iCub platform. The pipeline relies on a voice activity detection (VAD) system to propose a fully hand-free experience. By segmenting only regions that are likely to contain commands, the VAD module also allows to reduce greatly the computational cost of the pipeline. Command candidates are then passed to the deep neural network (DNN)-HMM command recognition system for transcription. The VoCub dataset has been specifically gathered to train a DNN-based acoustic model for our task. Through multi-condition training with the CHiME4 dataset, an accuracy of 94.5% is reached on VoCub test set. A filler model, complemented by a rejection mechanism based on a confidence score, is finally added to the system to reject non-command speech in a live demonstration of the system

    Advances in Document Layout Analysis

    Full text link
    [EN] Handwritten Text Segmentation (HTS) is a task within the Document Layout Analysis field that aims to detect and extract the different page regions of interest found in handwritten documents. HTS remains an active topic, that has gained importance with the years, due to the increasing demand to provide textual access to the myriads of handwritten document collections held by archives and libraries. This thesis considers HTS as a task that must be tackled in two specialized phases: detection and extraction. We see the detection phase fundamentally as a recognition problem that yields the vertical positions of each region of interest as a by-product. The extraction phase consists in calculating the best contour coordinates of the region using the position information provided by the detection phase. Our proposed detection approach allows us to attack both higher level regions: paragraphs, diagrams, etc., and lower level regions like text lines. In the case of text line detection we model the problem to ensure that the system's yielded vertical position approximates the fictitious line that connects the lower part of the grapheme bodies in a text line, commonly known as the baseline. One of the main contributions of this thesis, is that the proposed modelling approach allows us to include prior information regarding the layout of the documents being processed. This is performed via a Vertical Layout Model (VLM). We develop a Hidden Markov Model (HMM) based framework to tackle both region detection and classification as an integrated task and study the performance and ease of use of the proposed approach in many corpora. We review the modelling simplicity of our approach to process regions at different levels of information: text lines, paragraphs, titles, etc. We study the impact of adding deterministic and/or probabilistic prior information and restrictions via the VLM that our approach provides. Having a separate phase that accurately yields the detection position (base- lines in the case of text lines) of each region greatly simplifies the problem that must be tackled during the extraction phase. In this thesis we propose to use a distance map that takes into consideration the grey-scale information in the image. This allows us to yield extraction frontiers which are equidistant to the adjacent text regions. We study how our approach escalates its accuracy proportionally to the quality of the provided detection vertical position. Our extraction approach gives near perfect results when human reviewed baselines are provided.[ES] La Segmentaci贸n de Texto Manuscrito (STM) es una tarea dentro del campo de investigaci贸n de An谩lisis de Estructura de Documentos (AED) que tiene como objetivo detectar y extraer las diferentes regiones de inter茅s de las p谩ginas que se encuentran en documentos manuscritos. La STM es un tema de investigaci贸n activo que ha ganado importancia con los a帽os debido a la creciente demanda de proporcionar acceso textual a las miles de colecciones de documentos manuscritos que se conservan en archivos y bibliotecas. Esta tesis entiende la STM como una tarea que debe ser abordada en dos fases especializadas: detecci贸n y extracci贸n. Consideramos que la fase de detecci贸n es, fundamentalmente, un problema de clasificaci贸n cuyo subproducto son las posiciones verticales de cada regi贸n de inter茅s. Por su parte, la fase de extracci贸n consiste en calcular las mejores coordenadas de contorno de la regi贸n utilizando la informaci贸n de posici贸n proporcionada por la fase de detecci贸n. Nuestro enfoque de detecci贸n nos permite atacar tanto regiones de alto nivel (p谩rrafos, diagramas驴) como regiones de nivel bajo (l铆neas de texto principalmente). En el caso de la detecci贸n de l铆neas de texto, modelamos el problema para asegurar que la posici贸n vertical estimada por el sistema se aproxime a la l铆nea ficticia que conecta la parte inferior de los cuerpos de los grafemas en una l铆nea de texto, com煤nmente conocida como l铆nea base. Una de las principales aportaciones de esta tesis es que el enfoque de modelizaci贸n propuesto nos permite incluir informaci贸n conocida a priori sobre la disposici贸n de los documentos que se est谩n procesando. Esto se realiza mediante un Modelo de Estructura Vertical (MEV). Desarrollamos un marco de trabajo basado en los Modelos Ocultos de Markov (MOM) para abordar tanto la detecci贸n de regiones como su clasificaci贸n de forma integrada, as铆 como para estudiar el rendimiento y la facilidad de uso del enfoque propuesto en numerosos corpus. As铆 mismo, revisamos la simplicidad del modelado de nuestro enfoque para procesar regiones en diferentes niveles de informaci贸n: l铆neas de texto, p谩rrafos, t铆tulos, etc. Finalmente, estudiamos el impacto de a帽adir informaci贸n y restricciones previas deterministas o probabilistas a trav茅s de el MEV propuesto que nuestro enfoque proporciona. Disponer de un m茅todo independiente que obtiene con precisi贸n la posici贸n de cada regi贸n detectada (l铆neas base en el caso de las l铆neas de texto) simplifica enormemente el problema que debe abordarse durante la fase de extracci贸n. En esta tesis proponemos utilizar un mapa de distancias que tiene en cuenta la informaci贸n de escala de grises de la imagen. Esto nos permite obtener fronteras de extracci贸n que son equidistantes a las regiones de texto adyacentes. Estudiamos como nuestro enfoque aumenta su precisi贸n de manera proporcional a la calidad de la detecci贸n y descubrimos que da resultados casi perfectos cuando se le proporcionan l铆neas de base revisadas por humanos.[CA] La Segmentaci贸 de Text Manuscrit (STM) 茅s una tasca dins del camp d'investigaci贸 d'An脿lisi d'Estructura de Documents (AED) que t茅 com a objectiu detectar I extraure les diferents regions d'inter猫s de les p脿gines que es troben en documents manuscrits. La STM 茅s un tema d'investigaci贸 actiu que ha guanyat import脿ncia amb els anys a causa de la creixent demanda per proporcionar acc茅s textual als milers de col路leccions de documents manuscrits que es conserven en arxius i biblioteques. Aquesta tesi ent茅n la STM com una tasca que ha de ser abordada en dues fases especialitzades: detecci贸 i extracci贸. Considerem que la fase de detecci贸 茅s, fonamentalment, un problema de classificaci贸 el subproducte de la qual s贸n les posicions verticals de cada regi贸 d'inter猫s. Per la seva part, la fase d'extracci贸 consisteix a calcular les millors coordenades de contorn de la regi贸 utilitzant la informaci贸 de posici贸 proporcionada per la fase de detecci贸. El nostre enfocament de detecci贸 ens permet atacar tant regions d'alt nivell (par脿grafs, diagrames ...) com regions de nivell baix (l铆nies de text principalment). En el cas de la detecci贸 de l铆nies de text, modelem el problema per a assegurar que la posici贸 vertical estimada pel sistema s'aproximi a la l铆nia fict铆cia que connecta la part inferior dels cossos dels grafemes en una l铆nia de text, comunament coneguda com a l铆nia base. Una de les principals aportacions d'aquesta tesi 茅s que l'enfocament de modelitzaci贸 proposat ens permet incloure informaci贸 coneguda a priori sobre la disposici贸 dels documents que s'estan processant. Aix貌 es realitza mitjan莽ant un Model d'Estructura Vertical (MEV). Desenvolupem un marc de treball basat en els Models Ocults de Markov (MOM) per a abordar tant la detecci贸 de regions com la seva classificaci贸 de forma integrada, aix铆 com per a estudiar el rendiment i la facilitat d'煤s de l'enfocament proposat en nombrosos corpus. Aix铆 mateix, revisem la simplicitat del modelatge del nostre enfocament per a processar regions en diferents nivells d'informaci贸: l铆nies de text, par脿grafs, t铆tols, etc. Finalment, estudiem l'impacte d'afegir informaci贸 i restriccions pr猫vies deterministes o probabilistes a trav茅s del MEV que el nostre m猫tode proporciona. Disposar d'un m猫tode independent que obt茅 amb precisi贸 la posici贸 de cada regi贸 detectada (l铆nies base en el cas de les l铆nies de text) simplifica enormement el problema que ha d'abordar-se durant la fase d'extracci贸. En aquesta tesi proposem utilitzar un mapa de dist脿ncies que t茅 en compte la informaci贸 d'escala de grisos de la imatge. Aix貌 ens permet obtenir fronteres d'extracci贸 que s贸n equidistants de les regions de text adjacents. Estudiem com el nostre enfocament augmenta la seva precisi贸 de manera proporcional a la qualitat de la detecci贸 i descobrim que dona resultats quasi perfectes quan se li proporcionen l铆nies de base revisades per humans.Bosch Campos, V. (2020). Advances in Document Layout Analysis [Tesis doctoral no publicada]. Universitat Polit猫cnica de Val猫ncia. https://doi.org/10.4995/Thesis/10251/138397TESI

    Subword lexical modelling for speech recognition

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (p. 155-160).by Raymond Lau.Ph.D

    Searching Spontaneous Conversational Speech:Proceedings of ACM SIGIR Workshop (SSCS2008)

    Get PDF

    An acoustic-phonetic approach in automatic Arabic speech recognition

    Get PDF
    In a large vocabulary speech recognition system the broad phonetic classification technique is used instead of detailed phonetic analysis to overcome the variability in the acoustic realisation of utterances. The broad phonetic description of a word is used as a means of lexical access, where the lexicon is structured into sets of words sharing the same broad phonetic labelling. This approach has been applied to a large vocabulary isolated word Arabic speech recognition system. Statistical studies have been carried out on 10,000 Arabic words (converted to phonemic form) involving different combinations of broad phonetic classes. Some particular features of the Arabic language have been exploited. The results show that vowels represent about 43% of the total number of phonemes. They also show that about 38% of the words can uniquely be represented at this level by using eight broad phonetic classes. When introducing detailed vowel identification the percentage of uniquely specified words rises to 83%. These results suggest that a fully detailed phonetic analysis of the speech signal is perhaps unnecessary. In the adopted word recognition model, the consonants are classified into four broad phonetic classes, while the vowels are described by their phonemic form. A set of 100 words uttered by several speakers has been used to test the performance of the implemented approach. In the implemented recognition model, three procedures have been developed, namely voiced-unvoiced-silence segmentation, vowel detection and identification, and automatic spectral transition detection between phonemes within a word. The accuracy of both the V-UV-S and vowel recognition procedures is almost perfect. A broad phonetic segmentation procedure has been implemented, which exploits information from the above mentioned three procedures. Simple phonological constraints have been used to improve the accuracy of the segmentation process. The resultant sequence of labels are used for lexical access to retrieve the word or a small set of words sharing the same broad phonetic labelling. For the case of having more than one word-candidates, a verification procedure is used to choose the most likely one
    corecore