316 research outputs found

    Word Importance Modeling to Enhance Captions Generated by Automatic Speech Recognition for Deaf and Hard of Hearing Users

    Get PDF
    People who are deaf or hard-of-hearing (DHH) benefit from sign-language interpreting or live-captioning (with a human transcriptionist), to access spoken information. However, such services are not legally required, affordable, nor available in many settings, e.g., impromptu small-group meetings in the workplace or online video content that has not been professionally captioned. As Automatic Speech Recognition (ASR) systems improve in accuracy and speed, it is natural to investigate the use of these systems to assist DHH users in a variety of tasks. But, ASR systems are still not perfect, especially in realistic conversational settings, leading to the issue of trust and acceptance of these systems from the DHH community. To overcome these challenges, our work focuses on: (1) building metrics for accurately evaluating the quality of automatic captioning systems, and (2) designing interventions for improving the usability of captions for DHH users. The first part of this dissertation describes our research on methods for identifying words that are important for understanding the meaning of a conversational turn within transcripts of spoken dialogue. Such knowledge about the relative importance of words in spoken messages can be used in evaluating ASR systems (in part 2 of this dissertation) or creating new applications for DHH users of captioned video (in part 3 of this dissertation). We found that models which consider both the acoustic properties of spoken words as well as text-based features (e.g., pre-trained word embeddings) are more effective at predicting the semantic importance of a word than models that utilize only one of these types of features. The second part of this dissertation describes studies to understand DHH users\u27 perception of the quality of ASR-generated captions; the goal of this work was to validate the design of automatic metrics for evaluating captions in real-time applications for these users. Such a metric could facilitate comparison of various ASR systems, for determining the suitability of specific ASR systems for supporting communication for DHH users. We designed experimental studies to elicit feedback on the quality of captions from DHH users, and we developed and evaluated automatic metrics for predicting the usability of automatically generated captions for these users. We found that metrics that consider the importance of each word in a text are more effective at predicting the usability of imperfect text captions than the traditional Word Error Rate (WER) metric. The final part of this dissertation describes research on importance-based highlighting of words in captions, as a way to enhance the usability of captions for DHH users. Similar to highlighting in static texts (e.g., textbooks or electronic documents), highlighting in captions involves changing the appearance of some texts in caption to enable readers to attend to the most important bits of information quickly. Despite the known benefits of highlighting in static texts, research on the usefulness of highlighting in captions for DHH users is largely unexplored. For this reason, we conducted experimental studies with DHH participants to understand the benefits of importance-based highlighting in captions, and their preference on different design configurations for highlighting in captions. We found that DHH users subjectively preferred highlighting in captions, and they reported higher readability and understandability scores and lower task-load scores when viewing videos with captions containing highlighting compared to the videos without highlighting. Further, in partial contrast to recommendations in prior research on highlighting in static texts (which had not been based on experimental studies with DHH users), we found that DHH participants preferred boldface, word-level, non-repeating highlighting in captions

    Spoken content retrieval: A survey of techniques and technologies

    Get PDF
    Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

    MISPRONUNCIATION DETECTION AND DIAGNOSIS IN MANDARIN ACCENTED ENGLISH SPEECH

    Get PDF
    This work presents the development, implementation, and evaluation of a Mispronunciation Detection and Diagnosis (MDD) system, with application to pronunciation evaluation of Mandarin-accented English speech. A comprehensive detection and diagnosis of errors in the Electromagnetic Articulography corpus of Mandarin-Accented English (EMA-MAE) was performed by using the expert phonetic transcripts and an Automatic Speech Recognition (ASR) system. Articulatory features derived from the parallel kinematic data available in the EMA-MAE corpus were used to identify the most significant articulatory error patterns seen in L2 speakers during common mispronunciations. Using both acoustic and articulatory information, an ASR based Mispronunciation Detection and Diagnosis (MDD) system was built and evaluated across different feature combinations and Deep Neural Network (DNN) architectures. The MDD system captured mispronunciation errors with a detection accuracy of 82.4%, a diagnostic accuracy of 75.8% and a false rejection rate of 17.2%. The results demonstrate the advantage of using articulatory features in revealing the significant contributors of mispronunciation as well as improving the performance of MDD systems

    Identification of Informativeness in Text using Natural Language Stylometry

    Get PDF
    In this age of information overload, one experiences a rapidly growing over-abundance of written text. To assist with handling this bounty, this plethora of texts is now widely used to develop and optimize statistical natural language processing (NLP) systems. Surprisingly, the use of more fragments of text to train these statistical NLP systems may not necessarily lead to improved performance. We hypothesize that those fragments that help the most with training are those that contain the desired information. Therefore, determining informativeness in text has become a central issue in our view of NLP. Recent developments in this field have spawned a number of solutions to identify informativeness in text. Nevertheless, a shortfall of most of these solutions is their dependency on the genre and domain of the text. In addition, most of them are not efficient regardless of the natural language processing problem areas. Therefore, we attempt to provide a more general solution to this NLP problem. This thesis takes a different approach to this problem by considering the underlying theme of a linguistic theory known as the Code Quantity Principle. This theory suggests that humans codify information in text so that readers can retrieve this information more efficiently. During the codification process, humans usually change elements of their writing ranging from characters to sentences. Examples of such elements are the use of simple words, complex words, function words, content words, syllables, and so on. This theory suggests that these elements have reasonable discriminating strength and can play a key role in distinguishing informativeness in natural language text. In another vein, Stylometry is a modern method to analyze literary style and deals largely with the aforementioned elements of writing. With this as background, we model text using a set of stylometric attributes to characterize variations in writing style present in it. We explore their effectiveness to determine informativeness in text. To the best of our knowledge, this is the first use of stylometric attributes to determine informativeness in statistical NLP. In doing so, we use texts of different genres, viz., scientific papers, technical reports, emails and newspaper articles, that are selected from assorted domains like agriculture, physics, and biomedical science. The variety of NLP systems that have benefitted from incorporating these stylometric attributes somewhere in their computational realm dealing with this set of multifarious texts suggests that these attributes can be regarded as an effective solution to identify informativeness in text. In addition to the variety of text genres and domains, the potential of stylometric attributes is also explored in some NLP application areas---including biomedical relation mining, automatic keyphrase indexing, spam classification, and text summarization---where performance improvement is both important and challenging. The success of the attributes in all these areas further highlights their usefulness

    DISCO: A Large Scale Human Annotated Corpus for Disfluency Correction in Indo-European Languages

    Full text link
    Disfluency correction (DC) is the process of removing disfluent elements like fillers, repetitions and corrections from spoken utterances to create readable and interpretable text. DC is a vital post-processing step applied to Automatic Speech Recognition (ASR) outputs, before subsequent processing by downstream language understanding tasks. Existing DC research has primarily focused on English due to the unavailability of large-scale open-source datasets. Towards the goal of multilingual disfluency correction, we present a high-quality human-annotated DC corpus covering four important Indo-European languages: English, Hindi, German and French. We provide extensive analysis of results of state-of-the-art DC models across all four languages obtaining F1 scores of 97.55 (English), 94.29 (Hindi), 95.89 (German) and 92.97 (French). To demonstrate the benefits of DC on downstream tasks, we show that DC leads to 5.65 points increase in BLEU scores on average when used in conjunction with a state-of-the-art Machine Translation (MT) system. We release code to run our experiments along with our annotated dataset here.Comment: Accepted at EMNLP 2023 Finding

    Testing quality in interlingual respeaking and other methods of interlingual live subtitling

    Get PDF
    La sottotitolazione in tempo reale (Live Subtitling, LS), trova le sue fondamenta nella sottotitolazione preregistrata per non udenti e ipoudenti per la produzione di sottotitoli per eventi o programmi televisivi dal vivo. La sottotitolazione live comporta il trasferimento da un contenuto orale a uno scritto (traduzione intersemiotica) e puĂČ essere effettuata da e verso la stessa lingua (intralinguistica), o da una lingua a un’altra (interlinguistica), fornendo cosĂŹ accessibilitĂ  per soggetti non udenti e al tempo stesso garantendo accesso multilingue ai contenuti audiovisivi. La sottotitolazione interlinguistica in tempo reale (d'ora in poi indicata come ILS, Interlingual Live Subtitling) viene attualmente realizzata con diversi metodi: l'attenzione Ăš qui posta sulla tecnica del respeaking interlinguistico, uno dei metodi di sottotitolazione in tempo reale o speech-to-text interpreting (STTI) che ha suscitato negli ultimi anni un crescente interesse, anche nel panorama italiano. Questa tesi di Dottorato intende fornire un quadro della letteratura e della ricerca sul respeaking intralinguistico e interlinguistico fino ad oggi, con particolare enfasi sulla situazione attuale in Italia di questa pratica. L'obiettivo della ricerca Ăš stato quello di esplorare diversi metodi di ILS, mettendone in luce i punti di forza e le debolezze nel tentativo di informare il settore delle potenzialitĂ  e dei rischi che possono riflettersi sulla qualitĂ  complessiva finale dei sottotitoli attraverso l’utilizzo di diverse tecniche. Per fare ciĂČ, sono stati testati in totale cinque metodi di ILS con diversi gradi di interazione uomo-macchina; ciascun metodo Ăš stato analizzato in termini di qualitĂ , quindi non solo dal punto di vista dell'accuratezza linguistica, ma anche considerando un altro fattore cruciale quale il ritardo nella trasmissione dei sottotitoli stessi. Nello svolgimento della ricerca sono stati condotti due casi di studio con diverse coppie linguistiche: il primo esperimento (dall'inglese all'italiano) ha testato e valutato la qualitĂ  di respeaking interlinguistico, interpretazione simultanea insieme a respeaking intralinguistico e, infine, interpretazione simultanea e sistema di riconoscimento automatico del parlato (Automatic Speech Recognition, ASR). Il secondo esperimento (dallo spagnolo all'italiano) ha valutato e confrontato cinque i metodi: i primi tre appena menzionati e altri due in cui la macchina svolgeva la maggior parte se non la totalitĂ  del lavoro: respeaking intralinguistico e traduzione automatica (Machine Translation, MT), e ASR con MT. Sono stati offerti due laboratori di respeaking interlinguistico nel Corso magistrale in Traduzione e Interpretazione dell'UniversitĂ  di Genova per preparare gli studenti agli esperimenti, volti a testare diversi moduli di formazione sull'ILS e la loro efficacia sull’apprendimento degli studenti. Durante le fasi di test, agli studenti sono stati assegnati diversi ruoli per ogni metodo, producendo sottotitoli interlinguistici live a partire dallo stesso testo di partenza: un video di un discorso originale completo durante un evento dal vivo. Le trascrizioni ottenute, sotto forma di sottotitoli, sono state analizzate utilizzando il modello NTR (Romero-Fresco & Pöchhacker, 2017) e per ciascun metodo Ăš anche stato calcolato il ritardo. I risultati quantitativi preliminari derivanti dalle analisi NTR e dal calcolo del ritardo sono stati confrontati con altri due casi di studio condotti dall'UniversitĂ  di Vigo (Spagna) e dall'UniversitĂ  del Surrey (Gran Bretagna), sottolineando come i flussi di lavoro piĂč automatizzati o completamente automatizzati siano effettivamente piĂč veloci degli altri, ma al contempo presentino ancora diversi problemi di traduzione e di punteggiatura. Anche se su scala ridotta, la ricerca dimostra anche quanto sia urgente e possa potenzialmente essere facile formare i traduttori e gli interpreti sul respeaking durante il loro percorso accademico, grazie anche al loro spiccato interesse per la materia. Si spera che i risultati ottenuti possano meglio mettere in luce le ripercussioni dell'uso dei diversi metodi a confronto, nonchĂ© indurre un'ulteriore riflessione sull'importanza dell'interazione umana con i sistemi automatici di traduzione e di riconoscimento del parlato nel fornire accessibilitĂ  di alta qualitĂ  per eventi dal vivo. Si spera inoltre che l’interesse degli studenti in questo campo, che era a loro completamente sconosciuto prima di questa ricerca, possa informare sull'urgenza di sensibilizzare gli studenti nel campo della sottotitolazione dal vivo attraverso il respeaking.Live subtitling (LS) finds its foundations in pre-recorded subtitling for the d/Deaf and hard of hearing (SDH) to produce real-time subtitles for live events and programs. LS implies the transfer from oral into written content (intersemiotic translation) and can be carried out from and to the same language (intralingual), or from one language to another (interlingual) to provide full accessibility for all, therefore combining SDH to the need of guaranteeing multilingual access as well. Interlingual Live Subtitling (from now on referred to as ILS) in real-time is currently being achieved by using different methods: the focus here is placed on interlingual respeaking as one of the currently used methods of LS – also referred to in this work as speech-to-text interpreting (STTI) – which has triggered growing interest also in the Italian industry over the past years. The hereby presented doctoral thesis intends to provide a wider picture of the literature and the research on intralingual and interlingual respeaking to the date, emphasizing the current situation in Italy in this practice. The aim of the research was to explore different ILS methods through their strengths and weaknesses, in an attempt to inform the industry on the impact that both potentialities and risks can have on the final overall quality of the subtitles with the involvement of different techniques in producing ILS. To do so, five ILS workflows requiring human and machine interaction to different extents were tested overall in terms of quality, thus not only from a linguistic accuracy point of view, but also considering another crucial factor such as delay in the broadcast of the subtitles. Two case studies were carried out with different language pairs: a first experiment (English to Italian) tested and assessed quality in interlingual respeaking on one hand, then simultaneous interpreting (SI) combined with intralingual respeaking, and SI and Automatic Speech Recognition (ASR) on the other. A second experiment (Spanish to Italian) evaluated and compared all the five methods: the first three again, and two others more machine-centered: intralingual respeaking combined with machine translation (MT), and ASR with MT. Two workshops in interlingual respeaking were offered at the master’s degree in Translation and Interpreting from the University of Genova to prepare students for the experiments, aimed at testing different training modules on ILS and their effectiveness on students’ learning outcomes. For the final experiments, students were assigned different roles for each tested method and performed different required tasks producing ILS from the same source text: a video of a full original speech at a live event. The obtained outputs were analyzed using the NTR model (Romero-Fresco & Pöchhacker, 2017) and the delay was calculated for each method. Preliminary quantitative results deriving from the NTR analyses and the calculation of delay were compared to other two case studies conducted by the University of Vigo and the University of Surrey, showing that more and fully-automated workflows are, indeed, faster than the others, while they still present several important issues in translation and punctuation. Albeit on a small scale, the research also shows how urgent and potentially easy could be to educate translators and interpreters in respeaking during their training phase, given their keen interest in the subject matter. It is hoped that the results obtained can better shed light on the repercussions of the use of different methods and induce further reflection on the importance of human interaction with automatic machine systems in providing high quality accessibility at live events. It is also hoped that involved students’ interest in this field, which was completely unknown to them prior to this research, can inform on the urgency of raising students’ awareness and competence acquisition in the field of live subtitling through respeaking

    Exploring simplified subtitles to support spoken language understanding

    Get PDF
    Understanding spoken language is a crucial skill we need throughout our lives. Yet, it can be difficult for various reasons, especially for those who are hard-of-hearing or just learning to speak a language. Captions or subtitles are a common means to make spoken information accessible. Verbatim transcriptions of talks or lectures are often cumbersome to read, as we generally speak faster than we read. Thus, subtitles are often edited to improve their readability, either manually or automatically. This thesis explores the automatic summarization of sentences and employs the method of sentence compression by deletion with recurrent neural networks. We tackle the task of sentence compression from different directions. On one hand, we look at a technical solution for the problem. On the other hand, we look at the human-centered perspective by investigating the effect of compressed subtitles on comprehension and cognitive load in a user study. Thus, the contribution is twofold: We present a neural network model for sentence compression and the results of a user study evaluating the concept of simplified subtitles. Regarding the technical aspect 60 different configurations of the model were tested. The best-scoring models achieved results comparable to state of the art approaches. We use a Sequence to Sequence architecture together with a compression ratio parameter to control the resulting compression ratio. Thereby, a compression ratio accuracy of 42.1 % was received for the best-scoring model configuration, which can be used as baseline for future experiments in that direction. Results from the 30 participants of the user study show that shortened subtitles could be enough to foster comprehension, but result in higher cognitive load. Based on that feedback we gathered design suggestions to improve future implementations in respect to their usability. Overall, this thesis provides insights on the technological side as well as from the end-user perspective to contribute to an easier access to spoken language.Die FĂ€higkeit gesprochene Sprache zu verstehen, ist ein essentieller Teil unseres Lebens. Das VerstĂ€ndnis kann jedoch aus einer Vielzahl von GrĂŒnden erschwert werden, insbesondere wenn man anfĂ€ngt eine Sprache zu lernen oder das Hörvermögen beeintrĂ€chtigt ist. Untertitel erleichtern und ermöglichen das VerstĂ€ndnis von gesprochener Sprache. Wortwörtliche Beschreibungen des Gesagten sind oftmals anstrengend zu lesen, da man weitaus schneller sprechen als lesen kann. Um Untertitel besser lesbar zu machen, werden sie daher manuell oder maschinell bearbeitet. Diese Arbeit untersucht das automatische Zusammenfassen von SĂ€tzen mithilfe der Satzkompression durch rekurrente neuronale Netzen. Die Problemstellung wird von zwei Gesichtspunkten aus betrachtet. Es wird eine technische Lösung fĂŒr Satzkompression vorgestellt, aber auch eine nutzerorientierte Perspektive eingenommen. Hierzu wurde eine Nutzerstudie durchgefĂŒhrt, welche die Effekte von verkĂŒrzten Untertiteln auf VerstĂ€ndnis und kognitive Belastung untersucht. FĂŒr die technische Lösung des Problems wurden 60 verschiedene Modellkonfigurationen evaluiert. Die erzielten Resultate sind vergleichbar mit denen verwandter Arbeiten. Dabei wurde der Einfluss der sogenannten Kompressionsrate untersucht. Dazu wurde eine Sequence to Sequence Architektur implementiert, welche die Kompressionsrate benutzt, um die resultierende Rate des verkĂŒrzten Satzes zu kontrollieren. Im Bestfall wurde die Kompressionsrate in 42.1 % der FĂ€lle eingehalten. Die Ergebnisse der Nutzerstudie zeigen, dass verkĂŒrzte Untertitel fĂŒr das VerstĂ€ndnis ausreichend sind, aber auch in mehr kognitiver Belastung resultieren. Auf Grundlage dieses Feedbacks prĂ€sentiert diese Arbeit DesignvorschlĂ€ge, um die Benutzbarkeit von verkĂŒrzten Untertiteln angenehmer zu gestalten. Mit den Resultaten von technischer und nutzerorientierter Seite leistet diese Arbeit einen Betrag zur Erforschung von Methoden zur VerstĂ€ndniserleichterung von gesprochener Sprache
    • 

    corecore