181 research outputs found
Automated speech and audio analysis for semantic access to multimedia
The deployment and integration of audio processing tools can enhance the semantic annotation of multimedia content, and as a consequence, improve the effectiveness of conceptual access tools. This paper overviews the various ways in which automatic speech and audio analysis can contribute to increased granularity of automatically extracted metadata. A number of techniques will be presented, including the alignment of speech and text resources, large vocabulary speech recognition, key word spotting and speaker classification. The applicability of techniques will be discussed from a media crossing perspective. The added value of the techniques and their potential contribution to the content value chain will be illustrated by the description of two (complementary) demonstrators for browsing broadcast news archives
Reducing Costs in Human Assisted Speech Transcription
The only official documentation of the lawmaking process at the California Legislature is unedited video recordings of committee hearings, bill texts, votes and analyses. While the bills resulting from these hearings are clear, using video recordings to understand how a bill was created is far too laborious for the average citizen. To increase public transparency, a service that provides easier access to the bill creation process was needed. In response to this need, the Digital Democracy initiative was established at Cal Poly by the Honorable Sam Blakeslee, former California State Senator and founder of the Institute for Advanced Technology and Public Policy.
The Digital Democracy initiative seeks to create a web platform that organizes, generates, and indexes large amounts of information about the legislative process. To accomplish this, automatic speech recognition is performed on the video recordings of committee hearings and the resulting text is manually improved and annotated with a web application called the Transcription Tool . Unfortunately, this process is costly, labor intensive, and prohibits the scaling and long term viability of the platform. Early efforts to reduce transcription costs involved the development of improved transcription tool UI and systems for speaker diarization and text correction.
This thesis evaluates the effectiveness of these improvements on the human assisted transcription process employed by the Digital Democracy initiative. To facilitate this evaluation, a pipeline for automatic transcription improvement was developed, the improvements were incorporated into the transcription process, and a controlled experiment was run to measure the effects of these improvements. The results of the experiment demonstrate that the improvements reduced transcription editing costs by 16.89% while maintaining similar transcription quality
Access to recorded interviews: A research agenda
Recorded interviews form a rich basis for scholarly inquiry. Examples include oral histories, community memory projects, and interviews conducted for broadcast media. Emerging technologies offer the potential to radically transform the way in which recorded interviews are made accessible, but this vision will demand substantial investments from a broad range of research communities. This article reviews the present state of practice for making recorded interviews available and the state-of-the-art for key component technologies. A large number of important research issues are identified, and from that set of issues, a coherent research agenda is proposed
Unsupervised naming of speakers in broadcast TV: using written names, pronounced names or both ?
International audiencePersons identification in video from TV broadcast is a valuable tool for indexing them. However, the use of biometric mod- els is not a very sustainable option without a priori knowledge of people present in the videos. The pronounced names (PN) or written names (WN) on the screen can provide hypotheses names for speakers. We propose an experimental comparison of the potential of these two modalities (names pronounced or written) to extract the true names of the speakers. The names pronounced offer many instances of citation but transcription and named-entity detection errors halved the potential of this modality. On the contrary, the written names detection benefits of the video quality improvement and is nowadays rather robust and efficient to name speakers. Oracle experiments presented for the mapping between written names and speakers also show the complementarity of both PN and WN modalities
Lessons Learned in ATCO2: 5000 hours of Air Traffic Control Communications for Robust Automatic Speech Recognition and Understanding
Voice communication between air traffic controllers (ATCos) and pilots is
critical for ensuring safe and efficient air traffic control (ATC). This task
requires high levels of awareness from ATCos and can be tedious and
error-prone. Recent attempts have been made to integrate artificial
intelligence (AI) into ATC in order to reduce the workload of ATCos. However,
the development of data-driven AI systems for ATC demands large-scale annotated
datasets, which are currently lacking in the field. This paper explores the
lessons learned from the ATCO2 project, a project that aimed to develop a
unique platform to collect and preprocess large amounts of ATC data from
airspace in real time. Audio and surveillance data were collected from publicly
accessible radio frequency channels with VHF receivers owned by a community of
volunteers and later uploaded to Opensky Network servers, which can be
considered an "unlimited source" of data. In addition, this paper reviews
previous work from ATCO2 partners, including (i) robust automatic speech
recognition, (ii) natural language processing, (iii) English language
identification of ATC communications, and (iv) the integration of surveillance
data such as ADS-B. We believe that the pipeline developed during the ATCO2
project, along with the open-sourcing of its data, will encourage research in
the ATC field. A sample of the ATCO2 corpus is available on the following
website: https://www.atco2.org/data, while the full corpus can be purchased
through ELDA at http://catalog.elra.info/en-us/repository/browse/ELRA-S0484. We
demonstrated that ATCO2 is an appropriate dataset to develop ASR engines when
little or near to no ATC in-domain data is available. For instance, with the
CNN-TDNNf kaldi model, we reached the performance of as low as 17.9% and 24.9%
WER on public ATC datasets which is 6.6/7.6% better than "out-of-domain" but
supervised CNN-TDNNf model.Comment: Manuscript under revie
QCompere @ REPERE 2013
International audienceWe describe QCompere consortium submissions to the REPERE 2013 evaluation campaign. The REPERE challenge aims at gathering four communities (face recognition, speaker identification, optical character recognition and named entity detection) towards the same goal: multimodal person recognition in TV broadcast. First, four mono-modal components are introduced (one for each foregoing community) constituting the elementary building blocks of our various submissions. Then, depending on the target modality (speaker or face recognition) and on the task (supervised or unsupervised recognition), four different fusion techniques are introduced: they can be summarized as propagation-, classifier-, rule- or graph-based approaches. Finally, their performance is evaluated on REPERE 2013 test set and their advantages and limitations are discussed
ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications
Personal assistants, automatic speech recognizers and dialogue understanding
systems are becoming more critical in our interconnected digital world. A clear
example is air traffic control (ATC) communications. ATC aims at guiding
aircraft and controlling the airspace in a safe and optimal manner. These
voice-based dialogues are carried between an air traffic controller (ATCO) and
pilots via very-high frequency radio channels. In order to incorporate these
novel technologies into ATC (low-resource domain), large-scale annotated
datasets are required to develop the data-driven AI systems. Two examples are
automatic speech recognition (ASR) and natural language understanding (NLU). In
this paper, we introduce the ATCO2 corpus, a dataset that aims at fostering
research on the challenging ATC field, which has lagged behind due to lack of
annotated data. The ATCO2 corpus covers 1) data collection and pre-processing,
2) pseudo-annotations of speech data, and 3) extraction of ATC-related named
entities. The ATCO2 corpus is split into three subsets. 1) ATCO2-test-set
corpus contains 4 hours of ATC speech with manual transcripts and a subset with
gold annotations for named-entity recognition (callsign, command, value). 2)
The ATCO2-PL-set corpus consists of 5281 hours of unlabeled ATC data enriched
with automatic transcripts from an in-domain speech recognizer, contextual
information, speaker turn information, signal-to-noise ratio estimate and
English language detection score per sample. Both available for purchase
through ELDA at http://catalog.elra.info/en-us/repository/browse/ELRA-S0484. 3)
The ATCO2-test-set-1h corpus is a one-hour subset from the original test set
corpus, that we are offering for free at https://www.atco2.org/data. We expect
the ATCO2 corpus will foster research on robust ASR and NLU not only in the
field of ATC communications but also in the general research community.Comment: Manuscript under review; The code will be available at
https://github.com/idiap/atco2-corpu
Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings
This paper presents a streaming speaker-attributed automatic speech
recognition (SA-ASR) model that can recognize "who spoke what" with low latency
even when multiple people are speaking simultaneously. Our model is based on
token-level serialized output training (t-SOT) which was recently proposed to
transcribe multi-talker speech in a streaming fashion. To further recognize
speaker identities, we propose an encoder-decoder based speaker embedding
extractor that can estimate a speaker representation for each recognized token
not only from non-overlapping speech but also from overlapping speech. The
proposed speaker embedding, named t-vector, is extracted synchronously with the
t-SOT ASR model, enabling joint execution of speaker identification (SID) or
speaker diarization (SD) with the multi-talker transcription with low latency.
We evaluate the proposed model for a joint task of ASR and SID/SD by using
LibriSpeechMix and LibriCSS corpora. The proposed model achieves substantially
better accuracy than a prior streaming model and shows comparable or sometimes
even superior results to the state-of-the-art offline SA-ASR model.Comment: Submitted to Interspeech 202
- âŠ