10 research outputs found

    Lightly supervised alignment of subtitles on multi-genre broadcasts

    Get PDF
    This paper describes a system for performing alignment of subtitles to audio on multigenre broadcasts using a lightly supervised approach. Accurate alignment of subtitles plays a substantial role in the daily work of media companies and currently still requires large human effort. Here, a comprehensive approach to performing this task in an automated way using lightly supervised alignment is proposed. The paper explores the different alternatives to speech segmentation, lightly supervised speech recognition and alignment of text streams. The proposed system uses lightly supervised decoding to improve the alignment accuracy by performing language model adaptation using the target subtitles. The system thus built achieves the third best reported result in the alignment of broadcast subtitles in the Multi–Genre Broadcast (MGB) challenge, with an F1 score of 88.8%. This system is available for research and other non–commercial purposes through webASR, the University of Sheffield’s cloud–based speech technology web service. Taking as inputs an audio file and untimed subtitles, webASR can produce timed subtitles in multiple formats, including TTML, WebVTT and SRT

    A CLARIN transcription portal for interview data

    Get PDF
    In this paper we present a first version of a transcription portal for audio files based on automatic speech recognition (ASR) in various languages. The portal is implemented in the CLARIN resources research network and intended for use by non-technical scholars. We explain the background and interdisciplinary nature of interview data, the perks and quirks of using ASR for transcribing the audio in a research context, the dos and don'ts for optimal use of the portal, and future developments foreseen. The portal is promoted in a range of workshops, but there are a number of challenges that have to be met. These challenges concern privacy issues, ASR quality, and cost, amongst others.</p

    A CLARIN Transcription Portal for Interview Data

    Get PDF
    In this paper we present a first version of a transcription portal for audio files based on automatic speech recognition (ASR) in various languages. The portal is implemented in the CLARIN resources research network and intended for use by non-technical scholars. We explain the background and interdisciplinary nature of interview data, the perks and quirks of using ASR for transcribing the audio in a research context, the dos and don’ts for optimal use of the portal, and future developments foreseen. The portal is promoted in a range of workshops, but there are a number of challenges that have to be met. These challenges concern privacy issues, ASR quality, and cost, amongst others

    The Ambient Spotlight: Queryless Desktop Search from Meeting Speech

    Get PDF
    It has recently become possible to record any small meeting using a laptop equipped with a plug-and-play USB microphone array. We show the potential for such recordings in a personal aid that allows project managers to record their meetings and, when reviewing them afterwards through a standard calendar interface, to find relevant documents on their computer. This interface is intended to supplement or replace the textual searches that managers typically perform. The prototype, which relies on meeting speech recognition and topic segmentation, formulates and runs desktop search queries in order to present its results

    Exploring the possibilities of Thomson’s fourth paradigm transformation—The case for a multimodal approach to digital oral history?

    Get PDF
    This article seeks to reorientate ‘digital oral history’ towards a new research paradigm, Multimodal Digital Oral History (MDOH), and in so doing it seeks to build upon Alistair Thomson’s (Thomson, A., 2007, Four paradigm transformations in oral history. Oral History Review, 34(1): 49–70.) characterization of a ‘dizzying digital revolution’ and paradigmatic transformation in oral history (OH). Calling for a recalibration of the current dominance of the textual transcript, and for active engagement with the oral, aural, and sonic affordances of both retro-digitized and born digital OH (DOH) collections, we call for a re-orientation of the digital from passive to generative and self-reflexive in the human–machine study of spoken word recordings. First, we take stock of the field of DOH as it is currently conceived and the ways in which it has or has not answered calls for a return to the orality of the interview by digital means. Secondly, we address the predominant trend of working with transcriptions in digital analysis of spoken word recordings and the tools being used by oral historians. Thirdly, we ask about the emerging possibilities—tools and experimental methodologies—for sonic analysis of spoken word collections within and beyond OH, looking to intersections with digital humanities, sociolinguistics, and sound studies. Lastly, we consider ethical questions and practicalities concomitant with data-driven methods, analyses and technologies like AI for the study of sonic research artefacts, reflections that dovetail with digital hermeneutics and digital tool criticism and point towards a new MDOH departure, a sub-field that has potential to inform the many fields that seek patterns in audio, audio-visual, and post-textual materials, serially and at scale

    Deriving and Exploiting Situational Information in Speech: Investigations in a Simulated Search and Rescue Scenario

    Get PDF
    The need for automatic recognition and understanding of speech is emerging in tasks involving the processing of large volumes of natural conversations. In application domains such as Search and Rescue, exploiting automated systems for extracting mission-critical information from speech communications has the potential to make a real difference. Spoken language understanding has commonly been approached by identifying units of meaning (such as sentences, named entities, and dialogue acts) for providing a basis for further discourse analysis. However, this fine-grained identification of fundamental units of meaning is sensitive to high error rates in the automatic transcription of noisy speech. This thesis demonstrates that topic segmentation and identification techniques can be employed for information extraction from spoken conversations by being robust to such errors. Two novel topic-based approaches are presented for extracting situational information within the search and rescue context. The first approach shows that identifying the changes in the context and content of first responders' report over time can provide an estimation of their location. The second approach presents a speech-based topological map estimation technique that is inspired, in part, by automatic mapping algorithms commonly used in robotics. The proposed approaches are evaluated on a goal-oriented conversational speech corpus, which has been designed and collected based on an abstract communication model between a first responder and a task leader during a search process. Results have confirmed that a highly imperfect transcription of noisy speech has limited impact on the information extraction performance compared with that obtained on the transcription of clean speech data. This thesis also shows that speech recognition accuracy can benefit from rescoring its initial transcription hypotheses based on the derived high-level location information. A new two-pass speech decoding architecture is presented. In this architecture, the location estimation from a first decoding pass is used to dynamically adapt a general language model which is used for rescoring the initial recognition hypotheses. This decoding strategy has resulted in a statistically significant gain in the recognition accuracy of the spoken conversations in high background noise. It is concluded that the techniques developed in this thesis can be extended to more application domains that deal with large volumes of natural spoken conversations

    Deliverable D2.7 Final Linked Media Layer and Evaluation

    Get PDF
    This deliverable presents the evaluation of content annotation and content enrichment systems that are part of the final tool set developed within the LinkedTV consortium. The evaluations were performed on both the Linked News and Linked Culture trial content, as well as on other content annotated for this purpose. The evaluation spans three languages: German (Linked News), Dutch (Linked Culture) and English. Selected algorithms and tools were also subject to benchmarking in two international contests: MediaEval 2014 and TAC’14. Additionally, the Microposts 2015 NEEL Challenge is being organized with the support of LinkedTV

    webASR 2 - Improved cloud based speech technology

    No full text
    This paper presents the most recent developments of the webASR service (www.webasr.org), the world’s first web– based fully functioning automatic speech recognition platform for scientific use. Initially released in 2008, the functionalities of webASR have recently been expanded with 3 main goals in mind: Facilitate access through a RESTful architecture, that allows for easy use through either the web interface or an API; allow the use of input metadata when available by the user to improve system performance; and increase the coverage of available systems beyond speech recognition. Several new systems for transcription, diarisation, lightly supervised alignment and translation are currently available through webASR. The results in a series of well–known benchmarks (RT’09, IWSLT’12 and MGB’15 evaluations) show how these webASR systems provides state–of–the–art performances across these task
    corecore