55 research outputs found

    Una estrategia de procesamiento automático del habla basada en la detección de atributos

    Get PDF
    State-of-the-art automatic speech and speaker recognition systems are often built with a pattern matching framework that has proven to achieve low recognition error rates for a variety of resource-rich tasks when the volume of speech and text examples to build statistical acoustic and language models is plentiful, and the speaker, acoustics and language conditions follow a rigid protocol. However, because of the “blackbox” top-down knowledge integration approach, such systems cannot easily leverage a rich set of knowledge sources already available in the literature on speech, acoustics and languages. In this paper, we present a bottom-up approach to knowledge integration, called automatic speech attribute transcription (ASAT), which is intended to be “knowledge-rich”, so that new and existing knowledge sources can be verified and integrated into current spoken language systems to improve recognition accuracy and system robustness. Since the ASAT framework offers a “divide-and-conquer” strategy and a “plug-andplay” game plan, it will facilitate a cooperative speech processing community that every researcher can contribute to, with a view to improving speech processing capabilities which are currently not easily accessible to researchers in the speech science community.Los sistemas más novedosos de reconocimiento automático de habla y de locutor suelen basarse en un sistema de coincidencia de patrones. Gracias a este modo de trabajo, se han obtenido unos bajos índices de error de reconocimiento para una variedad de tareas ricas en recursos, cuando se aporta una cantidad abundante de ejemplos de habla y texto para el entrenamiento estadístico de los modelos acústicos y de lenguaje, y siempre que el locutor y las condiciones acústicas y lingüísticas sigan un protocolo estricto. Sin embargo, debido a su aplicación de un proceso ciego de integración del conocimiento de arriba a abajo, dichos sistemas no pueden aprovechar fácilmente toda una serie de conocimientos ya disponibles en la literatura sobre el habla, la acústica y las lenguas. En este artículo presentamos una aproximación de abajo a arriba a la integración del conocimiento, llamada transcripción automática de atributos del habla (conocida en inglés como automatic speech attribute transcription, ASAT). Dicho enfoque pretende ser “rico en conocimiento”, con el fin de poder verificar las fuentes de conocimiento, tanto nuevas como ya existentes, e integrarlas en los actuales sistemas de lengua hablada para mejorar la precisión del reconocimiento y la robustez del sistema. Dado que ASAT ofrece una estrategia de tipo “divide y vencerás” y un plan de juego de “instalación y uso inmediato” (en inglés, plugand-play), esto facilitará una comunidad cooperativa de procesamiento del habla a la que todo investigador pueda contribuir con vistas a mejorar la capacidad de procesamiento del habla, que en la actualidad no es fácilmente accesible a los investigadores de la comunidad de las ciencias del habla

    Articulatory features for conversational speech recognition

    Get PDF

    Boosting End-to-End Multilingual Phoneme Recognition through Exploiting Universal Speech Attributes Constraints

    Full text link
    We propose a first step toward multilingual end-to-end automatic speech recognition (ASR) by integrating knowledge about speech articulators. The key idea is to leverage a rich set of fundamental units that can be defined "universally" across all spoken languages, referred to as speech attributes, namely manner and place of articulation. Specifically, several deterministic attribute-to-phoneme mapping matrices are constructed based on the predefined set of universal attribute inventory, which projects the knowledge-rich articulatory attribute logits, into output phoneme logits. The mapping puts knowledge-based constraints to limit inconsistency with acoustic-phonetic evidence in the integrated prediction. Combined with phoneme recognition, our phone recognizer is able to infer from both attribute and phoneme information. The proposed joint multilingual model is evaluated through phoneme recognition. In multilingual experiments over 6 languages on benchmark datasets LibriSpeech and CommonVoice, we find that our proposed solution outperforms conventional multilingual approaches with a relative improvement of 6.85% on average, and it also demonstrates a much better performance compared to monolingual model. Further analysis conclusively demonstrates that the proposed solution eliminates phoneme predictions that are inconsistent with attributes

    Combining Evidence from Unconstrained Spoken Term Frequency Estimation for Improved Speech Retrieval

    Get PDF
    This dissertation considers the problem of information retrieval in speech. Today's speech retrieval systems generally use a large vocabulary continuous speech recognition system to first hypothesize the words which were spoken. Because these systems have a predefined lexicon, words which fall outside of the lexicon can significantly reduce search quality---as measured by Mean Average Precision (MAP). This is particularly important because these Out-Of-Vocabulary (OOV) words are often rare and therefore good discriminators for topically relevant speech segments. The focus of this dissertation is on handling these out-of-vocabulary query words. The approach is to combine results from a word-based speech retrieval system with those from vocabulary-independent ranked utterance retrieval. The goal of ranked utterance retrieval is to rank speech utterances by the system's confidence that they contain a particular spoken word, which is accomplished by ranking the utterances by the estimated frequency of the word in the utterance. Several new approaches for estimating this frequency are considered, which are motivated by the disparity between reference and errorfully hypothesized phoneme sequences. The first method learns alternate pronunciations or degradations from actual recognition hypotheses and incorporates these variants into a new generative estimator for term frequency. A second method learns transformations of several easily computed features in a discriminative model for the same task. Both methods significantly improved ranked utterance retrieval in an experimental validation on new speech. The best of these ranked utterance retrieval methods is then combined with a word-based speech retrieval system. The combination approach uses a normalization learned in an additive model, which maps the retrieval status values from each system into estimated probabilities of relevance that are easily combined. Using this combination, much of the MAP lost because of OOV words is recovered. Evaluated on a collection of spontaneous, conversational speech, the system recovers 57.5\% of the MAP lost on short (title-only) queries and 41.3\% on longer (title plus description) queries

    Low-resource speech recognition and keyword-spotting

    Get PDF
    © Springer International Publishing AG 2017. The IARPA Babel program ran from March 2012 to November 2016. The aim of the program was to develop agile and robust speech technology that can be rapidly applied to any human language in order to provide effective search capability on large quantities of real world data. This paper will describe some of the developments in speech recognition and keyword-spotting during the lifetime of the project. Two technical areas will be briefly discussed with a focus on techniques developed at Cambridge University: the application of deep learning for low-resource speech recognition; and efficient approaches for keyword spotting. Finally a brief analysis of the Babel speech language characteristics and language performance will be presented

    Conditional Random Fields for Integrating Local Discriminative Classifiers

    Full text link
    corecore