20 research outputs found

    Human Transcription Quality Improvement

    Full text link
    High quality transcription data is crucial for training automatic speech recognition (ASR) systems. However, the existing industry-level data collection pipelines are expensive to researchers, while the quality of crowdsourced transcription is low. In this paper, we propose a reliable method to collect speech transcriptions. We introduce two mechanisms to improve transcription quality: confidence estimation based reprocessing at labeling stage, and automatic word error correction at post-labeling stage. We collect and release LibriCrowd - a large-scale crowdsourced dataset of audio transcriptions on 100 hours of English speech. Experiment shows the Transcription WER is reduced by over 50%. We further investigate the impact of transcription error on ASR model performance and found a strong correlation. The transcription quality improvement provides over 10% relative WER reduction for ASR models. We release the dataset and code to benefit the research community.Comment: 5 pages, 3 figures, 5 tables, INTERSPEECH 202

    Optimizing Computer-Assisted Transcription Quality with Iterative User Interfaces

    Get PDF
    Computer-assisted transcription promises high-quality speech transcription at reduced costs. This is achieved by limiting human effort to transcribing parts for which automatic transcription quality is insufficient. Our goal is to improve the human transcription quality via appropriate user interface design. We focus on iterative interfaces that allow humans to solve tasks based on an initially given suggestion, in this case an automatic transcription. We conduct a user study that reveals considerable quality gains for three variations of iterative interfaces over a non-iterative from-scratch transcription interface. Our iterative interfaces included post-editing, confidence-enhanced post-editing, and a novel retyping interface. All three yielded similar quality on average, but we found that the proposed retyping interface was less sensitive to the difficulty of the segment, and superior when the automatic transcription of the segment contained relatively many errors. An analysis using mixed-effects models allows us to quantify these and other factors and draw conclusions over which interface design should be chosen in which circumstance

    Robust speaker diarization for meetings

    Get PDF
    Aquesta tesi doctoral mostra la recerca feta en l'àrea de la diarització de locutor per a sales de reunions. En la present s'estudien els algorismes i la implementació d'un sistema en diferit de segmentació i aglomerat de locutor per a grabacions de reunions a on normalment es té accés a més d'un micròfon per al processat. El bloc més important de recerca s'ha fet durant una estada al International Computer Science Institute (ICSI, Berkeley, Caligornia) per un període de dos anys.La diarització de locutor s'ha estudiat força per al domini de grabacions de ràdio i televisió. La majoria dels sistemes proposats utilitzen algun tipus d'aglomerat jeràrquic de les dades en grups acústics a on de bon principi no se sap el número de locutors òptim ni tampoc la seva identitat. Un mètode molt comunment utilitzat s'anomena "bottom-up clustering" (aglomerat de baix-a-dalt), amb el qual inicialment es defineixen molts grups acústics de dades que es van ajuntant de manera iterativa fins a obtenir el nombre òptim de grups tot i acomplint un criteri de parada. Tots aquests sistemes es basen en l'anàlisi d'un canal d'entrada individual, el qual no permet la seva aplicació directa per a reunions. A més a més, molts d'aquests algorisms necessiten entrenar models o afinar els parameters del sistema usant dades externes, el qual dificulta l'aplicabilitat d'aquests sistemes per a dades diferents de les usades per a l'adaptació.La implementació proposada en aquesta tesi es dirigeix a solventar els problemes mencionats anteriorment. Aquesta pren com a punt de partida el sistema existent al ICSI de diarització de locutor basat en l'aglomerat de "baix-a-dalt". Primer es processen els canals de grabació disponibles per a obtindre un sol canal d'audio de qualitat major, a més dínformació sobre la posició dels locutors existents. Aleshores s'implementa un sistema de detecció de veu/silenci que no requereix de cap entrenament previ, i processa els segments de veu resultant amb una versió millorada del sistema mono-canal de diarització de locutor. Aquest sistema ha estat modificat per a l'ús de l'informació de posició dels locutors (quan es tingui) i s'han adaptat i creat nous algorismes per a que el sistema obtingui tanta informació com sigui possible directament del senyal acustic, fent-lo menys depenent de les dades de desenvolupament. El sistema resultant és flexible i es pot usar en qualsevol tipus de sala de reunions pel que fa al nombre de micròfons o la seva posició. El sistema, a més, no requereix en absolute dades d´entrenament, sent més senzill adaptar-lo a diferents tipus de dades o dominis d'aplicació. Finalment, fa un pas endavant en l'ús de parametres que siguin mes robusts als canvis en les dades acústiques. Dos versions del sistema es van presentar amb resultats excel.lents a les evaluacions de RT05s i RT06s del NIST en transcripció rica per a reunions, a on aquests es van avaluar amb dades de dos subdominis diferents (conferencies i reunions). A més a més, es fan experiments utilitzant totes les dades disponibles de les evaluacions RT per a demostrar la viabilitat dels algorisms proposats en aquesta tasca.This thesis shows research performed into the topic of speaker diarization for meeting rooms. It looks into the algorithms and the implementation of an offline speaker segmentation and clustering system for a meeting recording where usually more than one microphone is available. The main research and system implementation has been done while visiting the International Computes Science Institute (ICSI, Berkeley, California) for a period of two years. Speaker diarization is a well studied topic on the domain of broadcast news recordings. Most of the proposed systems involve some sort of hierarchical clustering of the data into clusters, where the optimum number of speakers of their identities are unknown a priory. A very commonly used method is called bottom-up clustering, where multiple initial clusters are iteratively merged until the optimum number of clusters is reached, according to some stopping criterion. Such systems are based on a single channel input, not allowing a direct application for the meetings domain. Although some efforts have been done to adapt such systems to multichannel data, at the start of this thesis no effective implementation had been proposed. Furthermore, many of these speaker diarization algorithms involve some sort of models training or parameter tuning using external data, which impedes its usability with data different from what they have been adapted to.The implementation proposed in this thesis works towards solving the aforementioned problems. Taking the existing hierarchical bottom-up mono-channel speaker diarization system from ICSI, it first uses a flexible acoustic beamforming to extract speaker location information and obtain a single enhanced signal from all available microphones. It then implements a train-free speech/non-speech detection on such signal and processes the resulting speech segments with an improved version of the mono-channel speaker diarization system. Such system has been modified to use speaker location information (then available) and several algorithms have been adapted or created new to adapt the system behavior to each particular recording by obtaining information directly from the acoustics, making it less dependent on the development data.The resulting system is flexible to any meetings room layout regarding the number of microphones and their placement. It is train-free making it easy to adapt to different sorts of data and domains of application. Finally, it takes a step forward into the use of parameters that are more robust to changes in the acoustic data. Two versions of the system were submitted with excellent results in RT05s and RT06s NIST Rich Transcription evaluations for meetings, where data from two different subdomains (lectures and conferences) was evaluated. Also, experiments using the RT datasets from all meetings evaluations were used to test the different proposed algorithms proving their suitability to the task.Postprint (published version

    Multi-dialect Arabic broadcast speech recognition

    Get PDF
    Dialectal Arabic speech research suffers from the lack of labelled resources and standardised orthography. There are three main challenges in dialectal Arabic speech recognition: (i) finding labelled dialectal Arabic speech data, (ii) training robust dialectal speech recognition models from limited labelled data and (iii) evaluating speech recognition for dialects with no orthographic rules. This thesis is concerned with the following three contributions: Arabic Dialect Identification: We are mainly dealing with Arabic speech without prior knowledge of the spoken dialect. Arabic dialects could be sufficiently diverse to the extent that one can argue that they are different languages rather than dialects of the same language. We have two contributions: First, we use crowdsourcing to annotate a multi-dialectal speech corpus collected from Al Jazeera TV channel. We obtained utterance level dialect labels for 57 hours of high-quality consisting of four major varieties of dialectal Arabic (DA), comprised of Egyptian, Levantine, Gulf or Arabic peninsula, North African or Moroccan from almost 1,000 hours. Second, we build an Arabic dialect identification (ADI) system. We explored two main groups of features, namely acoustic features and linguistic features. For the linguistic features, we look at a wide range of features, addressing words, characters and phonemes. With respect to acoustic features, we look at raw features such as mel-frequency cepstral coefficients combined with shifted delta cepstra (MFCC-SDC), bottleneck features and the i-vector as a latent variable. We studied both generative and discriminative classifiers, in addition to deep learning approaches, namely deep neural network (DNN) and convolutional neural network (CNN). In our work, we propose Arabic as a five class dialect challenge comprising of the previously mentioned four dialects as well as modern standard Arabic. Arabic Speech Recognition: We introduce our effort in building Arabic automatic speech recognition (ASR) and we create an open research community to advance it. This section has two main goals: First, creating a framework for Arabic ASR that is publicly available for research. We address our effort in building two multi-genre broadcast (MGB) challenges. MGB-2 focuses on broadcast news using more than 1,200 hours of speech and 130M words of text collected from the broadcast domain. MGB-3, however, focuses on dialectal multi-genre data with limited non-orthographic speech collected from YouTube, with special attention paid to transfer learning. Second, building a robust Arabic ASR system and reporting a competitive word error rate (WER) to use it as a potential benchmark to advance the state of the art in Arabic ASR. Our overall system is a combination of five acoustic models (AM): unidirectional long short term memory (LSTM), bidirectional LSTM (BLSTM), time delay neural network (TDNN), TDNN layers along with LSTM layers (TDNN-LSTM) and finally TDNN layers followed by BLSTM layers (TDNN-BLSTM). The AM is trained using purely sequence trained neural networks lattice-free maximum mutual information (LFMMI). The generated lattices are rescored using a four-gram language model (LM) and a recurrent neural network with maximum entropy (RNNME) LM. Our official WER is 13%, which has the lowest WER reported on this task. Evaluation: The third part of the thesis addresses our effort in evaluating dialectal speech with no orthographic rules. Our methods learn from multiple transcribers and align the speech hypothesis to overcome the non-orthographic aspects. Our multi-reference WER (MR-WER) approach is similar to the BLEU score used in machine translation (MT). We have also automated this process by learning different spelling variants from Twitter data. We mine automatically from a huge collection of tweets in an unsupervised fashion to build more than 11M n-to-m lexical pairs, and we propose a new evaluation metric: dialectal WER (WERd). Finally, we tried to estimate the word error rate (e-WER) with no reference transcription using decoding and language features. We show that our word error rate estimation is robust for many scenarios with and without the decoding features

    Conversational Arabic Automatic Speech Recognition

    Get PDF
    Colloquial Arabic (CA) is the set of spoken variants of modern Arabic that exist in the form of regional dialects and are considered generally to be mother-tongues in those regions. CA has limited textual resource because it exists only as a spoken language and without a standardised written form. Normally the modern standard Arabic (MSA) writing convention is employed that has limitations in phonetically representing CA. Without phonetic dictionaries the pronunciation of CA words is ambiguous, and can only be obtained through word and/or sentence context. Moreover, CA inherits the MSA complex word structure where words can be created from attaching affixes to a word. In automatic speech recognition (ASR), commonly used approaches to model acoustic, pronunciation and word variability are language independent. However, one can observe significant differences in performance between English and CA, with the latter yielding up to three times higher error rates. This thesis investigates the main issues for the under-performance of CA ASR systems. The work focuses on two directions: first, the impact of limited lexical coverage, and insufficient training data for written CA on language modelling is investigated; second, obtaining better models for the acoustics and pronunciations by learning to transfer between written and spoken forms. Several original contributions result from each direction. Using data-driven classes from decomposed text are shown to reduce out-of-vocabulary rate. A novel colloquialisation system to import additional data is introduced; automatic diacritisation to restore the missing short vowels was found to yield good performance; and a new acoustic set for describing CA was defined. Using the proposed methods improved the ASR performance in terms of word error rate in a CA conversational telephone speech ASR task

    Keskusteluavustimen kehittäminen kuulovammaisia varten automaattista puheentunnistusta käyttäen

    Get PDF
    Understanding and participating in conversations has been reported as one of the biggest challenges hearing impaired people face in their daily lives. These communication problems have been shown to have wide-ranging negative consequences, affecting their quality of life and the opportunities available to them in education and employment. A conversational assistance application was investigated to alleviate these problems. The application uses automatic speech recognition technology to provide real-time speech-to-text transcriptions to the user, with the goal of helping deaf and hard of hearing persons in conversational situations. To validate the method and investigate its usefulness, a prototype application was developed for testing purposes using open-source software. A user test was designed and performed with test participants representing the target user group. The results indicate that the Conversation Assistant method is valid, meaning it can help the hearing impaired to follow and participate in conversational situations. Speech recognition accuracy, especially in noisy environments, was identified as the primary target for further development for increased usefulness of the application. Conversely, recognition speed was deemed to be sufficient and already surpass the transcription speed of human transcribers.Keskustelupuheen ymmärtäminen ja keskusteluihin osallistuminen on raportoitu yhdeksi suurimmista haasteista, joita kuulovammaiset kohtaavat jokapäiväisessä elämässään. Näillä viestintäongelmilla on osoitettu olevan laaja-alaisia negatiivisia vaikutuksia, jotka heijastuvat elämänlaatuun ja heikentävät kuulovammaisten yhdenvertaisia osallistumismahdollisuuksia opiskeluun ja työelämään. Työssä kehitettiin ja arvioitiin apusovellusta keskustelupuheen ymmärtämisen ja keskusteluihin osallistumisen helpottamiseksi. Sovellus käyttää automaattista puheentunnistusta reaaliaikaiseen puheen tekstittämiseen kuuroja ja huonokuuloisia varten. Menetelmän toimivuuden vahvistamiseksi ja sen hyödyllisyyden tutkimiseksi siitä kehitettiin prototyyppisovellus käyttäjätestausta varten avointa lähdekoodia hyödyntäen. Testaamista varten suunniteltiin ja toteutettiin käyttäjäkoe sovelluksen kohderyhmää edustavilla koekäyttäjillä. Saadut tulokset viittaavat siihen, että työssä esitetty Keskusteluavustin on toimiva ja hyödyllinen apuväline huonokuuloisille ja kuuroille. Puheentunnistustarkkuus erityisesti meluisissa olosuhteissa osoittautui ensisijaiseksi kehityskohteeksi apusovelluksen hyödyllisyyden lisäämiseksi. Puheentunnistuksen nopeus arvioitiin puolestaan jo riittävän nopeaksi, ylittäen selkeästi kirjoitustulkkien kirjoitusnopeuden

    Meeting decision detection: multimodal information fusion for multi-party dialogue understanding

    Get PDF
    Modern advances in multimedia and storage technologies have led to huge archives of human conversations in widely ranging areas. These archives offer a wealth of information in the organization contexts. However, retrieving and managing information in these archives is a time-consuming and labor-intensive task. Previous research applied keyword and computer vision-based methods to do this. However, spontaneous conversations, complex in the use of multimodal cues and intricate in the interactions between multiple speakers, have posed new challenges to these methods. We need new techniques that can leverage the information hidden in multiple communication modalities – including not just “what” the speakers say but also “how” they express themselves and interact with others. In responding to this need, the thesis inquires into the multimodal nature of meeting dialogues and computational means to retrieve and manage the recorded meeting information. In particular, this thesis develops the Meeting Decision Detector (MDD) to detect and track decisions, one of the most important outcomes of the meetings. The MDD involves not only the generation of extractive summaries pertaining to the decisions (“decision detection”), but also the organization of a continuous stream of meeting speech into locally coherent segments (“discourse segmentation”). This inquiry starts with a corpus analysis which constitutes a comprehensive empirical study of the decision-indicative and segment-signalling cues in the meeting corpora. These cues are uncovered from a variety of communication modalities, including the words spoken, gesture and head movements, pitch and energy level, rate of speech, pauses, and use of subjective terms. While some of the cues match the previous findings of speech segmentation, some others have not been studied before. The analysis also provides empirical grounding for computing features and integrating them into a computational model. To handle the high-dimensional multimodal feature space in the meeting domain, this thesis compares empirically feature discriminability and feature pattern finding criteria. As the different knowledge sources are expected to capture different types of features, the thesis also experiments with methods that can harness synergy between the multiple knowledge sources. The problem formalization and the modeling algorithm so far correspond to an optimal setting: an off-line, post-meeting analysis scenario. However, ultimately the MDD is expected to be operated online – right after a meeting, or when a meeting is still in progress. Thus this thesis also explores techniques that help relax the optimal setting, especially those using only features that can be generated with a higher degree of automation. Empirically motivated experiments are designed to handle the corresponding performance degradation. Finally, with the users in mind, this thesis evaluates the use of query-focused summaries in a decision debriefing task, which is common in the organization context. The decision-focused extracts (which represent compressions of 1%) is compared against the general-purpose extractive summaries (which represent compressions of 10-40%). To examine the effect of model automation on the debriefing task, this evaluation experiments with three versions of decision-focused extracts, each relaxing one manual annotation constraint. Task performance is measured in actual task effectiveness, usergenerated report quality, and user-perceived success. The users’ clicking behaviors are also recorded and analyzed to understand how the users leverage the different versions of extractive summaries to produce abstractive summaries. The analysis framework and computational means developed in this work is expected to be useful for the creation of other dialogue understanding applications, especially those that require to uncover the implicit semantics of meeting dialogues
    corecore