1,044 research outputs found
Interpretation of Multiparty Meetings: The AMI and AMIDA Projects
The AMI and AMIDA projects are collaborative EU projects concerned with the automatic recognition and interpretation of multiparty meetings. This paper provides an overview of the advances we have made in these projects with a particular focus on the multimodal recording infrastructure, the publicly available AMI corpus of annotated meeting recordings, and the speech recognition framework that we have developed for this domain
VarArray Meets t-SOT: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition
This paper presents a novel streaming automatic speech recognition (ASR)
framework for multi-talker overlapping speech captured by a distant microphone
array with an arbitrary geometry. Our framework, named t-SOT-VA, capitalizes on
independently developed two recent technologies; array-geometry-agnostic
continuous speech separation, or VarArray, and streaming multi-talker ASR based
on token-level serialized output training (t-SOT). To combine the best of both
technologies, we newly design a t-SOT-based ASR model that generates a
serialized multi-talker transcription based on two separated speech signals
from VarArray. We also propose a pre-training scheme for such an ASR model
where we simulate VarArray's output signals based on monaural single-talker ASR
training data. Conversation transcription experiments using the AMI meeting
corpus show that the system based on the proposed framework significantly
outperforms conventional ones. Our system achieves the state-of-the-art word
error rates of 13.7% and 15.5% for the AMI development and evaluation sets,
respectively, in the multiple-distant-microphone setting while retaining the
streaming inference capability.Comment: 6 pages, 2 figure, 3 tables, v2: Appendix A has been adde
Recommended from our members
Mapping the Klangdom Live: Cartographies for piano with two performers and electronics
The use of high-density loudspeaker arrays (HDLAs) has recently experienced rapid growth in a wide variety of technical and aesthetic approaches. Still less explored, however, are applications to interactive music with live acoustic instruments. How can immersive spatialization accompany an instrument already with its own rich spatial diffusion pattern, like the grand piano, in the context of a score-based concert work? Potential models include treating the spatialized electronic sound in analogy to the diffusion pattern of the instrument, with spatial dimensions parametrized as functions of timbral features. Another approach is to map the concert hall as a three-dimensional projection of the instrument’s internal physical layout, a kind of virtual sonic microscope. Or, the diffusion of electronic spatial sound can be treated as an independent polyphonic element, complementary to but not dependent upon the instrument’s own spatial characteristics. Cartographies (2014), for piano with two performers and electronics, explores each of these models individually and in combination, as well as their technical implementation with the Meyer Sound Matrix3 system of the Su ̈ dwestrundfunk Experimentalstudio in Freiburg, Germany, and the 43.4-channel Klangdom of the Institut fu ̈ r Musik und Akustik at the Zentrum fu ̈ r Kunst und Media in Karlsruhe, Germany. The process of composing, producing, and performing the work raises intriguing questions, and invaluable hints, for the composition and performance of live interactive works with HDLAs in the future
Learning to Rank Microphones for Distant Speech Recognition
Fully exploiting ad-hoc microphone networks for distant speech recognition is
still an open issue. Empirical evidence shows that being able to select the
best microphone leads to significant improvements in recognition without any
additional effort on front-end processing. Current channel selection techniques
either rely on signal, decoder or posterior-based features. Signal-based
features are inexpensive to compute but do not always correlate with
recognition performance. Instead decoder and posterior-based features exhibit
better correlation but require substantial computational resources. In this
work, we tackle the channel selection problem by proposing MicRank, a learning
to rank framework where a neural network is trained to rank the available
channels using directly the recognition performance on the training set. The
proposed approach is agnostic with respect to the array geometry and type of
recognition back-end. We investigate different learning to rank strategies
using a synthetic dataset developed on purpose and the CHiME-6 data. Results
show that the proposed approach is able to considerably improve over previous
selection techniques, reaching comparable and in some instances better
performance than oracle signal-based measures
3D-Speaker: A Large-Scale Multi-Device, Multi-Distance, and Multi-Dialect Corpus for Speech Representation Disentanglement
Disentangling uncorrelated information in speech utterances is a crucial
research topic within speech community. Different speech-related tasks focus on
extracting distinct speech representations while minimizing the affects of
other uncorrelated information. We present a large-scale speech corpus to
facilitate the research of speech representation disentanglement. 3D-Speaker
contains over 10,000 speakers, each of whom are simultaneously recorded by
multiple Devices, locating at different Distances, and some speakers are
speaking multiple Dialects. The controlled combinations of multi-dimensional
audio data yield a matrix of a diverse blend of speech representation
entanglement, thereby motivating intriguing methods to untangle them. The
multi-domain nature of 3D-Speaker also makes it a suitable resource to evaluate
large universal speech models and experiment methods of out-of-domain learning
and self-supervised learning. https://3dspeaker.github.io
- …