12 research outputs found

    PRESENCE: A human-inspired architecture for speech-based human-machine interaction

    No full text
    Recent years have seen steady improvements in the quality and performance of speech-based human-machine interaction driven by a significant convergence in the methods and techniques employed. However, the quantity of training data required to improve state-of-the-art systems seems to be growing exponentially and performance appears to be asymptotic to a level that may be inadequate for many real-world applications. This suggests that there may be a fundamental flaw in the underlying architecture of contemporary systems, as well as a failure to capitalize on the combinatorial properties of human spoken language. This paper addresses these issues and presents a novel architecture for speech-based human-machine interaction inspired by recent findings in the neurobiology of living systems. Called PRESENCE-"PREdictive SENsorimotor Control and Emulation" - this new architecture blurs the distinction between the core components of a traditional spoken language dialogue system and instead focuses on a recursive hierarchical feedback control structure. Cooperative and communicative behavior emerges as a by-product of an architecture that is founded on a model of interaction in which the system has in mind the needs and intentions of a user and a user has in mind the needs and intentions of the system

    Detección de palabras claves en lenguajes sin datos de entrenamiento

    Get PDF
    Estudiamos el problema de detección de palabras claves (key-word-spotting) para idiomas que no disponen de corpus de datos con grabaciones y transcripciones fonéticas. Este problema es de central importancia para poder realizar búsquedas en bases de datos de grabaciones de habla. Usando el Boston University Radio Speech Corpus como corpus de referencia, analizamos diversas topologías y parametrizaciones de Modelos Ocultos de Markov para la detección de palabras sobre habla continua. Los modelos se basan en el uso de "fillers" para palabras no buscadas, y empleamos fonemas como unidades mínimas de detección. Para las pruebas, utilizamos un conjunto de 20 keywords entrenadas con 14 minutos de datos transcriptos y fillers entrenados con 7 horas sin transcripciones. Los resultados muestran que el mejor modelo alcanza rendimientos superiores a un 0.47 de FOM promedio, un porcentaje de detecciones correctas del 72.1% y 3.95 falsas alarmas por hora por keyword.XI Workshop Bases de Datos y Minería de DatosRed de Universidades con Carreras de Informática (RedUNCI

    Konuşma Tanıma için İnsan-makine Karşılaştırması

    Get PDF
    Speech/voice recognition by machines has been a topic of interest since 1950s. Research that initially adopted dynamic programming methodologies now mostly uses the hidden Markov model as the method for speech recognition. Nevertheless, even the most advanced speech recognition system makes, depending on the context, 2-20 times more errors than humans. Although the basic principles behind human speech recognition have not been completely understood, there are some theories that attempt to explain biological mechanisms for speech recognition. This paper aims to provide a review of these theories as well as a brief history of developments in automatic speech recognition technology. Furthermore, the paper discusses some recent studies on Turkish speech recognition. The paper concludes with a comparison between human and machine speech recognition performance

    Detección de palabras claves en lenguajes sin datos de entrenamiento

    Get PDF
    Estudiamos el problema de detección de palabras claves (key-word-spotting) para idiomas que no disponen de corpus de datos con grabaciones y transcripciones fonéticas. Este problema es de central importancia para poder realizar búsquedas en bases de datos de grabaciones de habla. Usando el Boston University Radio Speech Corpus como corpus de referencia, analizamos diversas topologías y parametrizaciones de Modelos Ocultos de Markov para la detección de palabras sobre habla continua. Los modelos se basan en el uso de "fillers" para palabras no buscadas, y empleamos fonemas como unidades mínimas de detección. Para las pruebas, utilizamos un conjunto de 20 keywords entrenadas con 14 minutos de datos transcriptos y fillers entrenados con 7 horas sin transcripciones. Los resultados muestran que el mejor modelo alcanza rendimientos superiores a un 0.47 de FOM promedio, un porcentaje de detecciones correctas del 72.1% y 3.95 falsas alarmas por hora por keyword.XI Workshop Bases de Datos y Minería de DatosRed de Universidades con Carreras de Informática (RedUNCI

    Error Correction based on Error Signatures applied to automatic speech recognition

    Get PDF

    Phonetic Event-based Whole-Word Modeling Approaches for Speech Recognition

    Get PDF
    Speech is composed of basic speech sounds called phonemes, and these subword units are the foundation of most speech recognition systems. While detailed acoustic models of phones (and phone sequences) are common, most recognizers model words themselves as a simple concatenation of phonemes and do not closely model the temporal relationships between phonemes within words. Human speech production is constrained by the movement of speech articulators, and there is abundant evidence to indicate that human speech recognition is inextricably linked to the temporal patterns of speech sounds. Structures such as the hidden Markov model (HMM) have proved extremely useful and effective because they offer a convenient framework for combining acoustic modeling of phones with powerful probabilistic language models. However, this convenience masks deficiencies in temporal modeling. Additionally, robust recognition requires complex automatic speech recognition (ASR) systems and entails non-trivial computational costs. As an alternative, we extend previous work on the point process model (PPM) for keyword spotting, an approach to speech recognition expressly based on whole-word modeling of the temporal relations of phonetic events. In our research, we have investigated and advanced a number of major components of this system. First, we have considered alternate methods of determining phonetic events from phone posteriorgrams. We have introduced several parametric approaches to modeling intra-word phonetic timing distributions which allow us to cope with data sparsity issues. We have substantially improved algorithms used to compute keyword detections, capitalizing on the sparse nature of the phonetic input which permits the system to be scaled to large data sets. We have considered enhanced CART-based modeling of phonetic timing distributions based on related text-to-speech synthesis work. Lastly, we have developed a point process based spoken term detection system and applied it to the conversational telephone speech task of the 2006 NIST Spoken Term Detection evaluation. We demonstrate the PPM system to be competitive with state-of-the-art phonetic search systems while requiring significantly fewer computational resources
    corecore