153 research outputs found

    Confidence Measures for Automatic and Interactive Speech Recognition

    Full text link
    [EN] This thesis work contributes to the field of the {Automatic Speech Recognition} (ASR). And particularly to the {Interactive Speech Transcription} and {Confidence Measures} (CM) for ASR. The main goals of this thesis work can be summarised as follows: 1. To design IST methods and tools to tackle the problem of improving automatically generated transcripts. 2. To assess the designed IST methods and tools on real-life tasks of transcription in large educational repositories of video lectures. 3. To improve the reliability of the IST by improving the underlying (CM). Abstracts: The {Automatic Speech Recognition} (ASR) is a crucial task in a broad range of important applications which could not accomplished by means of manual transcription. The ASR can provide cost-effective transcripts in scenarios of increasing social impact such as the {Massive Open Online Courses} (MOOC), for which the availability of accurate enough is crucial even if they are not flawless. The transcripts enable search-ability, summarisation, recommendation, translation; they make the contents accessible to non-native speakers and users with impairments, etc. The usefulness is such that students improve their academic performance when learning from subtitled video lectures even when transcript is not perfect. Unfortunately, the current ASR technology is still far from the necessary accuracy. The imperfect transcripts resulting from ASR can be manually supervised and corrected, but the effort can be even higher than manual transcription. For the purpose of alleviating this issue, a novel {Interactive Transcription of Speech} (IST) system is presented in this thesis. This IST succeeded in reducing the effort if a small quantity of errors can be allowed; and also in improving the underlying ASR models in a cost-effective way. In other to adequate the proposed framework into real-life MOOCs, another intelligent interaction methods involving limited user effort were investigated. And also, it was introduced a new method which benefit from the user interactions to improve automatically the unsupervised parts ({Constrained Search} for ASR). The conducted research was deployed into a web-based IST platform with which it was possible to produce a massive number of semi-supervised lectures from two different well-known repositories, videoLectures.net and poliMedia. Finally, the performance of the IST and ASR systems can be easily increased by improving the computation of the {Confidence Measure} (CM) of transcribed words. As so, two contributions were developed: a new particular {Logistic Regresion} (LR) model; and the speaker adaption of the CM for cases in which it is possible, such with MOOCs.[ES] Este trabajo contribuye en el campo del {reconocimiento automático del habla} (RAH). Y en especial, en el de la {transcripción interactiva del habla} (TIH) y el de las {medidas de confianza} (MC) para RAH. Los objetivos principales son los siguientes: 1. Diseño de métodos y herramientas TIH para mejorar las transcripciones automáticas. 2. Evaluar los métodos y herramientas TIH empleando tareas de transcripción realistas extraídas de grandes repositorios de vídeos educacionales. 3. Mejorar la fiabilidad del TIH mediante la mejora de las MC. Resumen: El {reconocimiento automático del habla} (RAH) es una tarea crucial en una amplia gama de aplicaciones importantes que no podrían realizarse mediante transcripción manual. El RAH puede proporcionar transcripciones rentables en escenarios de creciente impacto social como el de los {cursos abiertos en linea masivos} (MOOC), para el que la disponibilidad de transcripciones es crucial, incluso cuando no son completamente perfectas. Las transcripciones permiten la automatización de procesos como buscar, resumir, recomendar, traducir; hacen que los contenidos sean más accesibles para hablantes no nativos y usuarios con discapacidades, etc. Incluso se ha comprobado que mejora el rendimiento de los estudiantes que aprenden de videos con subtítulos incluso cuando estos no son completamente perfectos. Desafortunadamente, la tecnología RAH actual aún está lejos de la precisión necesaria. Las transcripciones imperfectas resultantes del RAH pueden ser supervisadas y corregidas manualmente, pero el esfuerzo puede ser incluso superior al de la transcripción manual. Con el fin de aliviar este problema, esta tesis presenta un novedoso sistema de {transcripción interactiva del habla} (TIH). Este método TIH consigue reducir el esfuerzo de semi-supervisión siempre que sea aceptable una pequeña cantidad de errores; además mejora a la par los modelos RAH subyacentes. Con objeto de transportar el marco propuesto para MOOCs, también se investigaron otros métodos de interacción inteligentes que involucran esfuerzo limitado por parte del usuario. Además, se introdujo un nuevo método que aprovecha las interacciones para mejorar aún más las partes no supervisadas (ASR con {búsqueda restringida}). La investigación en TIH llevada a cabo se desplegó en una plataforma web con el que fue posible producir un número masivo de transcripciones de videos de dos conocidos repositorios, videoLectures.net y poliMedia. Por último, el rendimiento de la TIH y los sistemas de RAH se puede aumentar directamente mediante la mejora de la estimación de la {medida de confianza} (MC) de las palabras transcritas. Por este motivo se desarrollaron dos contribuciones: un nuevo modelo discriminativo {logístico} (LR); y la adaptación al locutor de la MC para los casos en que es posible, como por ejemplo en MOOCs.[CA] Aquest treball hi contribueix al camp del {reconeixment automàtic de la parla} (RAP). I en especial, al de la {transcripció interactiva de la parla} i el de {mesures de confiança} (MC) per a RAP. Els objectius principals són els següents: 1. Dissenyar mètodes i eines per a TIP per tal de millorar les transcripcions automàtiques. 2. Avaluar els mètodes i eines TIP per a tasques de transcripció realistes extretes de grans repositoris de vídeos educacionals. 3. Millorar la fiabilitat del TIP, mitjançant la millora de les MC. Resum: El {reconeixment automàtic de la parla} (RAP) és una tasca crucial per una àmplia gamma d'aplicacions importants que no es poden dur a terme per mitjà de la transcripció manual. El RAP pot proporcionar transcripcions en escenaris de creixent impacte social com els {cursos online oberts massius} (MOOC). Les transcripcions permeten automatitzar tasques com ara cercar, resumir, recomanar, traduir; a més a més, fa accessibles els continguts als parlants no nadius i els usuaris amb discapacitat, etc. Fins i tot, pot millorar el rendiment acadèmic de estudiants que aprenen de xerrades amb subtítols, encara que aquests subtítols no siguen perfectes. Malauradament, la tecnologia RAP actual encara està lluny de la precisió necessària. Les transcripcions imperfectes resultants de RAP poden ser supervisades i corregides manualment, però aquest l'esforç pot acabar sent superior a la transcripció manual. Per tal de resoldre aquest problema, en aquest treball es presenta un sistema nou per a {transcripció interactiva de la parla} (TIP). Aquest sistema TIP va ser reeixit en la reducció de l'esforç per quan es pot permetre una certa quantitat d'errors; així com també en en la millora dels models RAP subjacents. Per tal d'adequar el marc proposat per a MOOCs, també es van investigar altres mètodes d'interacció intel·ligents amb esforç d''usuari limitat. A més a més, es va introduir un nou mètode que aprofita les interaccions per tal de millorar encara més les parts no supervisades (RAP amb {cerca restringida}). La investigació en TIP duta a terme es va desplegar en una plataforma web amb la qual va ser possible produir un nombre massiu de transcripcions semi-supervisades de xerrades de repositoris ben coneguts, videoLectures.net i poliMedia. Finalment, el rendiment de la TIP i els sistemes de RAP es pot augmentar directament mitjançant la millora de l'estimació de la {Confiança Mesura} (MC) de les paraules transcrites. Per tant, es van desenvolupar dues contribucions: un nou model discriminatiu logístic (LR); i l'adaptació al locutor de la MC per casos en que és possible, per exemple amb MOOCs.Sánchez Cortina, I. (2016). Confidence Measures for Automatic and Interactive Speech Recognition [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/61473TESI

    Effect of Speech Recognition Errors on Text Understandability for People who are Deaf or Hard of Hearing

    Get PDF
    Recent advancements in the accuracy of Automated Speech Recognition (ASR) technologies have made them a potential candidate for the task of captioning. However, the presence of errors in the output may present challenges in their use in a fully automatic system. In this research, we are looking more closely into the impact of different inaccurate transcriptions from the ASR system on the understandability of captions for Deaf or Hard-of-Hearing (DHH) individuals. Through a user study with 30 DHH users, we studied the effect of the presence of an error in a text on its understandability for DHH users. We also investigated different prediction models to capture this relation accurately. Among other models, our random forest based model provided the best mean accuracy of 62.04% on the task. Further, we plan to improve this model with more data and use it to advance our investigation on ASR technologies to improve ASR based captioning for DHH users

    A detection-based pattern recognition framework and its applications

    Get PDF
    The objective of this dissertation is to present a detection-based pattern recognition framework and demonstrate its applications in automatic speech recognition and broadcast news video story segmentation. Inspired by the studies of modern cognitive psychology and real-world pattern recognition systems, a detection-based pattern recognition framework is proposed to provide an alternative solution for some complicated pattern recognition problems. The primitive features are first detected and the task-specific knowledge hierarchy is constructed level by level; then a variety of heterogeneous information sources are combined together and the high-level context is incorporated as additional information at certain stages. A detection-based framework is a â divide-and-conquerâ design paradigm for pattern recognition problems, which will decompose a conceptually difficult problem into many elementary sub-problems that can be handled directly and reliably. Some information fusion strategies will be employed to integrate the evidence from a lower level to form the evidence at a higher level. Such a fusion procedure continues until reaching the top level. Generally, a detection-based framework has many advantages: (1) more flexibility in both detector design and fusion strategies, as these two parts can be optimized separately; (2) parallel and distributed computational components in primitive feature detection. In such a component-based framework, any primitive component can be replaced by a new one while other components remain unchanged; (3) incremental information integration; (4) high level context information as additional information sources, which can be combined with bottom-up processing at any stage. This dissertation presents the basic principles, criteria, and techniques for detector design and hypothesis verification based on the statistical detection and decision theory. In addition, evidence fusion strategies were investigated in this dissertation. Several novel detection algorithms and evidence fusion methods were proposed and their effectiveness was justified in automatic speech recognition and broadcast news video segmentation system. We believe such a detection-based framework can be employed in more applications in the future.Ph.D.Committee Chair: Lee, Chin-Hui; Committee Member: Clements, Mark; Committee Member: Ghovanloo, Maysam; Committee Member: Romberg, Justin; Committee Member: Yuan, Min

    Log-linear system combination using structured support vector machines

    Get PDF
    Building high accuracy speech recognition systems with limited language resources is a highly challenging task. Although the use of multi-language data for acoustic models yields improvements, performance is often unsatisfactory with highly limited acoustic training data. In these situations, it is possible to consider using multiple well trained acoustic models and combine the system outputs together. Unfortunately, the computational cost associated with these approaches is high as multiple decoding runs are required. To address this problem, this paper examines schemes based on log-linear score combination. This has a number of advantages over standard combination schemes. Even with limited acoustic training data, it is possible to train, for example, phone-specific combination weights, allowing detailed relationships between the available well trained models to be obtained. To ensure robust parameter estimation, this paper casts log-linear score combination into a structured support vector machine (SSVM) learning task. This yields a method to train model parameters with good generalisation properties. Here the SSVM feature space is a set of scores from well-trained individual systems. The SSVM approach is compared to lattice rescoring and confusion network combination using language packs released within the IARPA Babel program

    DNN adaptation by automatic quality estimation of ASR hypotheses

    Full text link
    In this paper we propose to exploit the automatic Quality Estimation (QE) of ASR hypotheses to perform the unsupervised adaptation of a deep neural network modeling acoustic probabilities. Our hypothesis is that significant improvements can be achieved by: i)automatically transcribing the evaluation data we are currently trying to recognise, and ii) selecting from it a subset of "good quality" instances based on the word error rate (WER) scores predicted by a QE component. To validate this hypothesis, we run several experiments on the evaluation data sets released for the CHiME-3 challenge. First, we operate in oracle conditions in which manual transcriptions of the evaluation data are available, thus allowing us to compute the "true" sentence WER. In this scenario, we perform the adaptation with variable amounts of data, which are characterised by different levels of quality. Then, we move to realistic conditions in which the manual transcriptions of the evaluation data are not available. In this case, the adaptation is performed on data selected according to the WER scores "predicted" by a QE component. Our results indicate that: i) QE predictions allow us to closely approximate the adaptation results obtained in oracle conditions, and ii) the overall ASR performance based on the proposed QE-driven adaptation method is significantly better than the strong, most recent, CHiME-3 baseline.Comment: Computer Speech & Language December 201
    corecore