316 research outputs found
Word Importance Modeling to Enhance Captions Generated by Automatic Speech Recognition for Deaf and Hard of Hearing Users
People who are deaf or hard-of-hearing (DHH) benefit from sign-language interpreting or live-captioning (with a human transcriptionist), to access spoken information. However, such services are not legally required, affordable, nor available in many settings, e.g., impromptu small-group meetings in the workplace or online video content that has not been professionally captioned. As Automatic Speech Recognition (ASR) systems improve in accuracy and speed, it is natural to investigate the use of these systems to assist DHH users in a variety of tasks. But, ASR systems are still not perfect, especially in realistic conversational settings, leading to the issue of trust and acceptance of these systems from the DHH community. To overcome these challenges, our work focuses on: (1) building metrics for accurately evaluating the quality of automatic captioning systems, and (2) designing interventions for improving the usability of captions for DHH users.
The first part of this dissertation describes our research on methods for identifying words that are important for understanding the meaning of a conversational turn within transcripts of spoken dialogue. Such knowledge about the relative importance of words in spoken messages can be used in evaluating ASR systems (in part 2 of this dissertation) or creating new applications for DHH users of captioned video (in part 3 of this dissertation). We found that models which consider both the acoustic properties of spoken words as well as text-based features (e.g., pre-trained word embeddings) are more effective at predicting the semantic importance of a word than models that utilize only one of these types of features.
The second part of this dissertation describes studies to understand DHH users\u27 perception of the quality of ASR-generated captions; the goal of this work was to validate the design of automatic metrics for evaluating captions in real-time applications for these users. Such a metric could facilitate comparison of various ASR systems, for determining the suitability of specific ASR systems for supporting communication for DHH users. We designed experimental studies to elicit feedback on the quality of captions from DHH users, and we developed and evaluated automatic metrics for predicting the usability of automatically generated captions for these users. We found that metrics that consider the importance of each word in a text are more effective at predicting the usability of imperfect text captions than the traditional Word Error Rate (WER) metric.
The final part of this dissertation describes research on importance-based highlighting of words in captions, as a way to enhance the usability of captions for DHH users. Similar to highlighting in static texts (e.g., textbooks or electronic documents), highlighting in captions involves changing the appearance of some texts in caption to enable readers to attend to the most important bits of information quickly. Despite the known benefits of highlighting in static texts, research on the usefulness of highlighting in captions for DHH users is largely unexplored. For this reason, we conducted experimental studies with DHH participants to understand the benefits of importance-based highlighting in captions, and their preference on different design configurations for highlighting in captions. We found that DHH users subjectively preferred highlighting in captions, and they reported higher readability and understandability scores and lower task-load scores when viewing videos with captions containing highlighting compared to the videos without highlighting. Further, in partial contrast to recommendations in prior research on highlighting in static texts (which had not been based on experimental studies with DHH users), we found that DHH participants preferred boldface, word-level, non-repeating highlighting in captions
Spoken content retrieval: A survey of techniques and technologies
Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR
MISPRONUNCIATION DETECTION AND DIAGNOSIS IN MANDARIN ACCENTED ENGLISH SPEECH
This work presents the development, implementation, and evaluation of a Mispronunciation Detection and Diagnosis (MDD) system, with application to pronunciation evaluation of Mandarin-accented English speech. A comprehensive detection and diagnosis of errors in the Electromagnetic Articulography corpus of Mandarin-Accented English (EMA-MAE) was performed by using the expert phonetic transcripts and an Automatic Speech Recognition (ASR) system. Articulatory features derived from the parallel kinematic data available in the EMA-MAE corpus were used to identify the most significant articulatory error patterns seen in L2 speakers during common mispronunciations. Using both acoustic and articulatory information, an ASR based Mispronunciation Detection and Diagnosis (MDD) system was built and evaluated across different feature combinations and Deep Neural Network (DNN) architectures. The MDD system captured mispronunciation errors with a detection accuracy of 82.4%, a diagnostic accuracy of 75.8% and a false rejection rate of 17.2%. The results demonstrate the advantage of using articulatory features in revealing the significant contributors of mispronunciation as well as improving the performance of MDD systems
Identification of Informativeness in Text using Natural Language Stylometry
In this age of information overload, one experiences a rapidly growing over-abundance of written text. To assist with handling this bounty, this plethora of texts is now widely used to develop and optimize statistical natural language processing (NLP) systems. Surprisingly, the use of more fragments of text to train these statistical NLP systems may not necessarily lead to improved performance. We hypothesize that those fragments that help the most with training are those that contain the desired information. Therefore, determining informativeness in text has become a central issue in our view of NLP. Recent developments in this field have spawned a number of solutions to identify informativeness in text. Nevertheless, a shortfall of most of these solutions is their dependency on the genre and domain of the text. In addition, most of them are not efficient regardless of the natural language processing problem areas. Therefore, we attempt to provide a more general solution to this NLP problem.
This thesis takes a different approach to this problem by considering the underlying theme of a linguistic theory known as the Code Quantity Principle. This theory suggests that humans codify information in text so that readers can retrieve this information more efficiently. During the codification process, humans usually change elements of their writing ranging from characters to sentences. Examples of such elements are the use of simple words, complex words, function words, content words, syllables, and so on. This theory suggests that these elements have reasonable discriminating strength and can play a key role in distinguishing informativeness in natural language text. In another vein, Stylometry is a modern method to analyze literary style and deals largely with the aforementioned elements of writing. With this as background, we model text using a set of stylometric attributes to characterize variations in writing style present in it. We explore their effectiveness to determine informativeness in text. To the best of our knowledge, this is the first use of stylometric attributes to determine informativeness in statistical NLP. In doing so, we use texts of different genres, viz., scientific papers, technical reports, emails and newspaper articles, that are selected from assorted domains like agriculture, physics, and biomedical science. The variety of NLP systems that have benefitted from incorporating these stylometric attributes somewhere in their computational realm dealing with this set of multifarious texts suggests that these attributes can be regarded as an effective solution to identify informativeness in text. In addition to the variety of text genres and domains, the potential of stylometric attributes is also explored in some NLP application areas---including biomedical relation mining, automatic keyphrase indexing, spam classification, and text summarization---where performance improvement is both important and challenging. The success of the attributes in all these areas further highlights their usefulness
DISCO: A Large Scale Human Annotated Corpus for Disfluency Correction in Indo-European Languages
Disfluency correction (DC) is the process of removing disfluent elements like
fillers, repetitions and corrections from spoken utterances to create readable
and interpretable text. DC is a vital post-processing step applied to Automatic
Speech Recognition (ASR) outputs, before subsequent processing by downstream
language understanding tasks. Existing DC research has primarily focused on
English due to the unavailability of large-scale open-source datasets. Towards
the goal of multilingual disfluency correction, we present a high-quality
human-annotated DC corpus covering four important Indo-European languages:
English, Hindi, German and French. We provide extensive analysis of results of
state-of-the-art DC models across all four languages obtaining F1 scores of
97.55 (English), 94.29 (Hindi), 95.89 (German) and 92.97 (French). To
demonstrate the benefits of DC on downstream tasks, we show that DC leads to
5.65 points increase in BLEU scores on average when used in conjunction with a
state-of-the-art Machine Translation (MT) system. We release code to run our
experiments along with our annotated dataset here.Comment: Accepted at EMNLP 2023 Finding
Recommended from our members
Enabling Structured Navigation of Longform Spoken Dialog with Automatic Summarization
Longform spoken dialog is a rich source of information that is present in all facets of everyday life, taking the form of podcasts, debates, and interviews; these mediums contain important topics ranging from healthcare and diversity to current events, economics and politics. Individuals need to digest informative content to know how to vote, decide how to stay safe from COVID-19, and how to increase diversity in the workplace.
Unfortunately compared to text, spoken dialog can be challenging to consume as it is slower than reading and difficult to skim or navigate. Although an individual may be interested in a given topic, they may be unwilling to commit the required time necessary to consume long form auditory media given the uncertainty as to whether such content will live up to their expectations. Clearly, there exists a need to provide access to the information spoken dialog provides in a manner through which individuals can quickly and intuitively access areas of interest without investing large amounts of time.
From Human Computer Interaction, we apply the idea of information foraging, which theorizes how people browse and navigate to satisfy an information need, to the longform spoken dialog domain. Information foraging states that people do not browse linearly. Rather people âforageâ for information similar to how animals sniff around for food, scanning from area to area, constantly deciding whether to keep investigating their current area or to move on to greener pastures. This is an instance of the classic breadth vs. depth dilemma. People rely on perceived structure and information cues to make these decisions. Unfortunately speech, either spoken or transcribed, is unstructured and lacks information cues, making it difficult for users to browse and navigate.
We create a longform spoken dialog browsing system that utilizes automatic summarization and speech modeling to structure longform dialog to present information in a manner that is both intuitive and flexible towards different user browsing needs. Leveraging summarization models to automatically and hierarchically structure spoken dialog, the system is able to distill information into increasingly salient and abstract summaries, allowing for a tiered representation that, if interested, users can progressively explore. Additionally, we address spoken dialogâs own set of technical challenges to speech modeling that are not present in written text, such as disfluencies, improper punctuation, lack of annotated speech data, and inherent lack of structure.
We create a longform spoken dialog browsing system that utilizes automatic summarization and speech modeling to structure longform dialog to present information in a manner that is both intuitive and flexible towards different user browsing needs. Leveraging summarization models to automatically and hierarchically structure spoken dialog, the system is able to distill information into increasingly salient and abstract summaries, allowing for a tiered representation that, if interested, users can progressively explore. Additionally, we address spoken dialogâs own set of technical challenges to speech modeling that are not present in written text, such as disfluencies, improper punctuation, lack of annotated speech data, and inherent lack of structure. Since summarization is a lossy compression of information, the system provides users with information cues to signal how much additional information is contained on a topic.
This thesis makes the following contributions:
1. We applied the HCI concept of information foraging to longform speech, enabling people to browse and navigate information in podcasts, interviews, panels, and meetings.
2. We created a system that structures longform dialog into hierarchical summaries which help users to 1) skim (browse) audio and 2) navigate and drill down into interesting sections to read full details.
3. We created a human annotated hierarchical dataset to quantitatively evaluate the effectiveness of our systemâs hierarchical text generation performance.
4. Lastly, we developed a suite of dialog oriented processing optimizations to improve the user experience of summaries: enhanced readability and fluency of short summaries through better topic chunking and pronoun imputation, and reliable indication of semantic coverage within short summaries to help direct navigation towards interesting information.
We discuss future research in extending the browsing and navigating system to more challenging domains such as lectures, which contain many external references, or workplace conversations, which contain uncontextualized background information and are far less structured than podcasts and interviews
Testing quality in interlingual respeaking and other methods of interlingual live subtitling
La sottotitolazione in tempo reale (Live Subtitling, LS), trova le sue fondamenta nella sottotitolazione preregistrata per non udenti e ipoudenti per la produzione di sottotitoli per eventi o programmi televisivi dal vivo. La sottotitolazione live comporta il trasferimento da un contenuto orale a uno scritto (traduzione intersemiotica) e puĂČ essere effettuata da e verso la stessa lingua (intralinguistica), o da una lingua a unâaltra (interlinguistica), fornendo cosĂŹ accessibilitĂ per soggetti non udenti e al tempo stesso garantendo accesso multilingue ai contenuti audiovisivi.
La sottotitolazione interlinguistica in tempo reale (d'ora in poi indicata come ILS, Interlingual Live Subtitling) viene attualmente realizzata con diversi metodi: l'attenzione Ăš qui posta sulla tecnica del respeaking interlinguistico, uno dei metodi di sottotitolazione in tempo reale o speech-to-text interpreting (STTI) che ha suscitato negli ultimi anni un crescente interesse, anche nel panorama italiano.
Questa tesi di Dottorato intende fornire un quadro della letteratura e della ricerca sul respeaking intralinguistico e interlinguistico fino ad oggi, con particolare enfasi sulla situazione attuale in Italia di questa pratica.
L'obiettivo della ricerca Ăš stato quello di esplorare diversi metodi di ILS, mettendone in luce i punti di forza e le debolezze nel tentativo di informare il settore delle potenzialitĂ e dei rischi che possono riflettersi sulla qualitĂ complessiva finale dei sottotitoli attraverso lâutilizzo di diverse tecniche. Per fare ciĂČ, sono stati testati in totale cinque metodi di ILS con diversi gradi di interazione uomo-macchina; ciascun metodo Ăš stato analizzato in termini di qualitĂ , quindi non solo dal punto di vista dell'accuratezza linguistica, ma anche considerando un altro fattore cruciale quale il ritardo nella trasmissione dei sottotitoli stessi. Nello svolgimento della ricerca sono stati condotti due casi di studio con diverse coppie linguistiche: il primo esperimento (dall'inglese all'italiano) ha testato e valutato la qualitĂ di respeaking interlinguistico, interpretazione simultanea insieme a respeaking intralinguistico e, infine, interpretazione simultanea e sistema di riconoscimento automatico del parlato (Automatic Speech Recognition, ASR). Il secondo esperimento (dallo spagnolo all'italiano) ha valutato e confrontato cinque i metodi: i primi tre appena menzionati e altri due in cui la macchina svolgeva la maggior parte se non la totalitĂ del lavoro: respeaking intralinguistico e traduzione automatica (Machine Translation, MT), e ASR con MT.
Sono stati offerti due laboratori di respeaking interlinguistico nel Corso magistrale in Traduzione e Interpretazione dell'UniversitĂ di Genova per preparare gli studenti agli esperimenti, volti a testare diversi moduli di formazione sull'ILS e la loro efficacia sullâapprendimento degli studenti. Durante le fasi di test, agli studenti sono stati assegnati diversi ruoli per ogni metodo, producendo sottotitoli interlinguistici live a partire dallo stesso testo di partenza: un video di un discorso originale completo durante un evento dal vivo. Le trascrizioni ottenute, sotto forma di sottotitoli, sono state analizzate utilizzando il modello NTR (Romero-Fresco & Pöchhacker, 2017) e per ciascun metodo Ăš anche stato calcolato il ritardo.
I risultati quantitativi preliminari derivanti dalle analisi NTR e dal calcolo del ritardo sono stati confrontati con altri due casi di studio condotti dall'UniversitĂ di Vigo (Spagna) e dall'UniversitĂ del Surrey (Gran Bretagna), sottolineando come i flussi di lavoro piĂč automatizzati o completamente automatizzati siano effettivamente piĂč veloci degli altri, ma al contempo presentino ancora diversi problemi di traduzione e di punteggiatura. Anche se su scala ridotta, la ricerca dimostra anche quanto sia urgente e possa potenzialmente essere facile formare i traduttori e gli interpreti sul respeaking durante il loro percorso accademico, grazie anche al loro spiccato interesse per la materia.
Si spera che i risultati ottenuti possano meglio mettere in luce le ripercussioni dell'uso dei diversi metodi a confronto, nonchĂ© indurre un'ulteriore riflessione sull'importanza dell'interazione umana con i sistemi automatici di traduzione e di riconoscimento del parlato nel fornire accessibilitĂ di alta qualitĂ per eventi dal vivo. Si spera inoltre che lâinteresse degli studenti in questo campo, che era a loro completamente sconosciuto prima di questa ricerca, possa informare sull'urgenza di sensibilizzare gli studenti nel campo della sottotitolazione dal vivo attraverso il respeaking.Live subtitling (LS) finds its foundations in pre-recorded subtitling for the d/Deaf and hard of hearing (SDH) to produce real-time subtitles for live events and programs. LS implies the transfer from oral into written content (intersemiotic translation) and can be carried out from and to the same language (intralingual), or from one language to another (interlingual) to provide full accessibility for all, therefore combining SDH to the need of guaranteeing multilingual access as well. Interlingual Live Subtitling (from now on referred to as ILS) in real-time is currently being achieved by using different methods: the focus here is placed on interlingual respeaking as one of the currently used methods of LS â also referred to in this work as speech-to-text interpreting (STTI) â which has triggered growing interest also in the Italian industry over the past years.
The hereby presented doctoral thesis intends to provide a wider picture of the literature and the research on intralingual and interlingual respeaking to the date, emphasizing the current situation in Italy in this practice.
The aim of the research was to explore different ILS methods through their strengths and weaknesses, in an attempt to inform the industry on the impact that both potentialities and risks can have on the final overall quality of the subtitles with the involvement of different techniques in producing ILS. To do so, five ILS workflows requiring human and machine interaction to different extents were tested overall in terms of quality, thus not only from a linguistic accuracy point of view, but also considering another crucial factor such as delay in the broadcast of the subtitles. Two case studies were carried out with different language pairs: a first experiment (English to Italian) tested and assessed quality in interlingual respeaking on one hand, then simultaneous interpreting (SI) combined with intralingual respeaking, and SI and Automatic Speech Recognition (ASR) on the other. A second experiment (Spanish to Italian) evaluated and compared all the five methods: the first three again, and two others more machine-centered: intralingual respeaking combined with machine translation (MT), and ASR with MT.
Two workshops in interlingual respeaking were offered at the masterâs degree in Translation and Interpreting from the University of Genova to prepare students for the experiments, aimed at testing different training modules on ILS and their effectiveness on studentsâ learning outcomes. For the final experiments, students were assigned different roles for each tested method and performed different required tasks producing ILS from the same source text: a video of a full original speech at a live event. The obtained outputs were analyzed using the NTR model (Romero-Fresco & Pöchhacker, 2017) and the delay was calculated for each method.
Preliminary quantitative results deriving from the NTR analyses and the calculation of delay were compared to other two case studies conducted by the University of Vigo and the University of Surrey, showing that more and fully-automated workflows are, indeed, faster than the others, while they still present several important issues in translation and punctuation. Albeit on a small scale, the research also shows how urgent and potentially easy could be to educate translators and interpreters in respeaking during their training phase, given their keen interest in the subject matter.
It is hoped that the results obtained can better shed light on the repercussions of the use of different methods and induce further reflection on the importance of human interaction with automatic machine systems in providing high quality accessibility at live events. It is also hoped that involved studentsâ interest in this field, which was completely unknown to them prior to this research, can inform on the urgency of raising studentsâ awareness and competence acquisition in the field of live subtitling through respeaking
Exploring simplified subtitles to support spoken language understanding
Understanding spoken language is a crucial skill we need throughout our lives. Yet, it can be difficult for various reasons, especially for those who are hard-of-hearing or just learning to speak a language. Captions or subtitles are a common means to make spoken information accessible. Verbatim transcriptions of talks or lectures are often cumbersome to read, as we generally speak faster than we read. Thus, subtitles are often edited to improve their readability, either manually or automatically.
This thesis explores the automatic summarization of sentences and employs the method of sentence compression by deletion with recurrent neural networks. We tackle the task of sentence compression from different directions. On one hand, we look at a technical solution for the problem. On the other hand, we look at the human-centered perspective by investigating the effect of compressed subtitles on comprehension and cognitive load in a user study. Thus, the contribution is twofold: We present a neural network model for sentence compression and the results of a user study evaluating the concept of simplified subtitles.
Regarding the technical aspect 60 different configurations of the model were tested. The best-scoring models achieved results comparable to state of the art approaches. We use a Sequence to Sequence architecture together with a compression ratio parameter to control the resulting compression ratio. Thereby, a compression ratio accuracy of 42.1 % was received for the best-scoring model configuration, which can be used as baseline for future experiments in that direction. Results from the 30 participants of the user study show that shortened subtitles could be enough to foster comprehension, but result in higher cognitive load. Based on that feedback we gathered design suggestions to improve future implementations in respect to their usability. Overall, this thesis provides insights on the technological side as well as from the end-user perspective to contribute to an easier access to spoken language.Die FĂ€higkeit gesprochene Sprache zu verstehen, ist ein essentieller Teil unseres Lebens. Das VerstĂ€ndnis kann jedoch aus einer Vielzahl von GrĂŒnden erschwert werden, insbesondere wenn man anfĂ€ngt eine Sprache zu lernen oder das Hörvermögen beeintrĂ€chtigt ist. Untertitel erleichtern und ermöglichen das VerstĂ€ndnis von gesprochener Sprache. Wortwörtliche Beschreibungen des Gesagten sind oftmals anstrengend zu lesen, da man weitaus schneller sprechen als lesen kann. Um Untertitel besser lesbar zu machen, werden sie daher manuell oder maschinell bearbeitet.
Diese Arbeit untersucht das automatische Zusammenfassen von SĂ€tzen mithilfe der Satzkompression durch rekurrente neuronale Netzen. Die Problemstellung wird von zwei Gesichtspunkten aus betrachtet. Es wird eine technische Lösung fĂŒr Satzkompression vorgestellt, aber auch eine nutzerorientierte Perspektive eingenommen. Hierzu wurde eine Nutzerstudie durchgefĂŒhrt, welche die Effekte von verkĂŒrzten Untertiteln auf VerstĂ€ndnis und kognitive Belastung untersucht.
FĂŒr die technische Lösung des Problems wurden 60 verschiedene Modellkonfigurationen evaluiert. Die erzielten Resultate sind vergleichbar mit denen verwandter Arbeiten. Dabei wurde der Einfluss der sogenannten Kompressionsrate untersucht. Dazu wurde eine Sequence to Sequence Architektur implementiert, welche die Kompressionsrate benutzt, um die resultierende Rate des verkĂŒrzten Satzes zu kontrollieren. Im Bestfall wurde die Kompressionsrate in 42.1 % der FĂ€lle eingehalten.
Die Ergebnisse der Nutzerstudie zeigen, dass verkĂŒrzte Untertitel fĂŒr das VerstĂ€ndnis ausreichend sind, aber auch in mehr kognitiver Belastung resultieren. Auf Grundlage dieses Feedbacks prĂ€sentiert diese Arbeit DesignvorschlĂ€ge, um die Benutzbarkeit von verkĂŒrzten Untertiteln angenehmer zu gestalten. Mit den Resultaten von technischer und nutzerorientierter Seite leistet diese Arbeit einen Betrag zur Erforschung von Methoden zur VerstĂ€ndniserleichterung von gesprochener Sprache
- âŠ