26 research outputs found
The I4U Mega Fusion and Collaboration for NIST Speaker Recognition Evaluation 2016
The 2016 speaker recognition evaluation (SRE'16) is the latest edition in the series of benchmarking events conducted by the National Institute of Standards and Technology (NIST). I4U is a joint entry to SRE'16 as the result from the collaboration and active exchange of information among researchers from sixteen Institutes and Universities across 4 continents. The joint submission and several of its 32 sub-systems were among top-performing systems. A lot of efforts have been devoted to two major challenges, namely, unlabeled training data and dataset shift from Switchboard-Mixer to the new Call My Net dataset. This paper summarizes the lessons learned, presents our shared view from the sixteen research groups on recent advances, major paradigm shift, and common tool chain used in speaker recognition as we have witnessed in SRE'16. More importantly, we look into the intriguing question of fusing a large ensemble of sub-systems and the potential benefit of large-scale collaboration.Peer reviewe
I4U System Description for NIST SRE'20 CTS Challenge
This manuscript describes the I4U submission to the 2020 NIST Speaker
Recognition Evaluation (SRE'20) Conversational Telephone Speech (CTS)
Challenge. The I4U's submission was resulted from active collaboration among
researchers across eight research teams - IR (Singapore), UEF (Finland),
VALPT (Italy, Spain), NEC (Japan), THUEE (China), LIA (France), NUS
(Singapore), INRIA (France) and TJU (China). The submission was based on the
fusion of top performing sub-systems and sub-fusion systems contributed by
individual teams. Efforts have been spent on the use of common development and
validation sets, submission schedule and milestone, minimizing inconsistency in
trial list and score file format across sites.Comment: SRE 2021, NIST Speaker Recognition Evaluation Workshop, CTS Speaker
Recognition Challenge, 14-12 December 202
On the Robustness of Arabic Speech Dialect Identification
Arabic dialect identification (ADI) tools are an important part of the
large-scale data collection pipelines necessary for training speech recognition
models. As these pipelines require application of ADI tools to potentially
out-of-domain data, we aim to investigate how vulnerable the tools may be to
this domain shift. With self-supervised learning (SSL) models as a starting
point, we evaluate transfer learning and direct classification from SSL
features. We undertake our evaluation under rich conditions, with a goal to
develop ADI systems from pretrained models and ultimately evaluate performance
on newly collected data. In order to understand what factors contribute to
model decisions, we carry out a careful human study of a subset of our data.
Our analysis confirms that domain shift is a major challenge for ADI models. We
also find that while self-training does alleviate this challenges, it may be
insufficient for realistic conditions
Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification: Fundamentals
Recent years have seen growing efforts to develop spoofing countermeasures
(CMs) to protect automatic speaker verification (ASV) systems from being
deceived by manipulated or artificial inputs. The reliability of spoofing CMs
is typically gauged using the equal error rate (EER) metric. The primitive EER
fails to reflect application requirements and the impact of spoofing and CMs
upon ASV and its use as a primary metric in traditional ASV research has long
been abandoned in favour of risk-based approaches to assessment. This paper
presents several new extensions to the tandem detection cost function (t-DCF),
a recent risk-based approach to assess the reliability of spoofing CMs deployed
in tandem with an ASV system. Extensions include a simplified version of the
t-DCF with fewer parameters, an analysis of a special case for a fixed ASV
system, simulations which give original insights into its interpretation and
new analyses using the ASVspoof 2019 database. It is hoped that adoption of the
t-DCF for the CM assessment will help to foster closer collaboration between
the anti-spoofing and ASV research communities.Comment: Published in IEEE/ACM Transactions on Audio, Speech, and Language
Processing (doi updated
Speech segmentation and speaker diarisation for transcription and translation
This dissertation outlines work related to Speech Segmentation – segmenting an audio
recording into regions of speech and non-speech, and Speaker Diarization – further
segmenting those regions into those pertaining to homogeneous speakers.
Knowing not only what was said but also who said it and when, has many useful
applications. As well as providing a richer level of transcription for speech, we will
show how such knowledge can improve Automatic Speech Recognition (ASR) system
performance and can also benefit downstream Natural Language Processing (NLP)
tasks such as machine translation and punctuation restoration.
While segmentation and diarization may appear to be relatively simple tasks to
describe, in practise we find that they are very challenging and are, in general, ill-defined
problems. Therefore, we first provide a formalisation of each of the problems
as the sub-division of speech within acoustic space and time. Here, we see that the
task can become very difficult when we want to partition this domain into our target
classes of speakers, whilst avoiding other classes that reside in the same space, such as
phonemes. We present a theoretical framework for describing and discussing the tasks
as well as introducing existing state-of-the-art methods and research.
Current Speaker Diarization systems are notoriously sensitive to hyper-parameters
and lack robustness across datasets. Therefore, we present a method which uses a series
of oracle experiments to expose the limitations of current systems and to which
system components these limitations can be attributed. We also demonstrate how Diarization
Error Rate (DER), the dominant error metric in the literature, is not a comprehensive
or reliable indicator of overall performance or of error propagation to subsequent
downstream tasks. These results inform our subsequent research.
We find that, as a precursor to Speaker Diarization, the task of Speech Segmentation
is a crucial first step in the system chain. Current methods typically do not account
for the inherent structure of spoken discourse. As such, we explored a novel method
which exploits an utterance-duration prior in order to better model the segment distribution
of speech. We show how this method improves not only segmentation, but also
the performance of subsequent speech recognition, machine translation and speaker
diarization systems.
Typical ASR transcriptions do not include punctuation and the task of enriching
transcriptions with this information is known as ‘punctuation restoration’. The benefit
is not only improved readability but also better compatibility with NLP systems
that expect sentence-like units such as in conventional machine translation. We show
how segmentation and diarization are related tasks that are able to contribute acoustic
information that complements existing linguistically-based punctuation approaches.
There is a growing demand for speech technology applications in the broadcast media
domain. This domain presents many new challenges including diverse noise and
recording conditions. We show that the capacity of existing GMM-HMM based speech
segmentation systems is limited for such scenarios and present a Deep Neural Network
(DNN) based method which offers a more robust speech segmentation method resulting
in improved speech recognition performance for a television broadcast dataset.
Ultimately, we are able to show that the speech segmentation is an inherently ill-defined
problem for which the solution is highly dependent on the downstream task
that it is intended for
CONTRIBUTIONS TO EFFICIENT AUTOMATIC TRANSCRIPTION OF VIDEO LECTURES
Tesis por compendio[ES] Durante los últimos años, los repositorios multimedia en línea se han convertido
en fuentes clave de conocimiento gracias al auge de Internet, especialmente en
el área de la educación. Instituciones educativas de todo el mundo han dedicado
muchos recursos en la búsqueda de nuevos métodos de enseñanza, tanto para
mejorar la asimilación de nuevos conocimientos, como para poder llegar a una
audiencia más amplia. Como resultado, hoy en día disponemos de diferentes
repositorios con clases grabadas que siven como herramientas complementarias en
la enseñanza, o incluso pueden asentar una nueva base en la enseñanza a
distancia. Sin embargo, deben cumplir con una serie de requisitos para que la
experiencia sea totalmente satisfactoria y es aquí donde la transcripción de los
materiales juega un papel fundamental. La transcripción posibilita una búsqueda
precisa de los materiales en los que el alumno está interesado, se abre la
puerta a la traducción automática, a funciones de recomendación, a la
generación de resumenes de las charlas y además, el poder hacer
llegar el contenido a personas con discapacidades auditivas. No obstante, la
generación de estas transcripciones puede resultar muy costosa.
Con todo esto en mente, la presente tesis tiene como objetivo proporcionar
nuevas herramientas y técnicas que faciliten la transcripción de estos
repositorios. En particular, abordamos el desarrollo de un conjunto de herramientas
de reconocimiento de automático del habla, con énfasis en las técnicas de aprendizaje
profundo que contribuyen a proporcionar transcripciones precisas en casos de
estudio reales. Además, se presentan diferentes participaciones en competiciones
internacionales donde se demuestra la competitividad del software comparada con
otras soluciones. Por otra parte, en aras de mejorar los sistemas de
reconocimiento, se propone una nueva técnica de adaptación de estos sistemas al
interlocutor basada en el uso Medidas de Confianza. Esto además motivó el
desarrollo de técnicas para la mejora en la estimación de este tipo de medidas
por medio de Redes Neuronales Recurrentes.
Todas las contribuciones presentadas se han probado en diferentes repositorios
educativos. De hecho, el toolkit transLectures-UPV es parte de un conjunto de
herramientas que sirve para generar transcripciones de clases en diferentes
universidades e instituciones españolas y europeas.[CA] Durant els últims anys, els repositoris multimèdia en línia s'han convertit
en fonts clau de coneixement gràcies a l'expansió d'Internet, especialment en
l'àrea de l'educació. Institucions educatives de tot el món han dedicat
molts recursos en la recerca de nous mètodes d'ensenyament, tant per
millorar l'assimilació de nous coneixements, com per poder arribar a una
audiència més àmplia. Com a resultat, avui dia disposem de diferents
repositoris amb classes gravades que serveixen com a eines complementàries en
l'ensenyament, o fins i tot poden assentar una nova base a l'ensenyament a
distància. No obstant això, han de complir amb una sèrie de requisits perquè la
experiència siga totalment satisfactòria i és ací on la transcripció dels
materials juga un paper fonamental. La transcripció possibilita una recerca
precisa dels materials en els quals l'alumne està interessat, s'obri la
porta a la traducció automàtica, a funcions de recomanació, a la
generació de resums de les xerrades i el poder fer
arribar el contingut a persones amb discapacitats auditives. No obstant, la
generació d'aquestes transcripcions pot resultar molt costosa.
Amb això en ment, la present tesi té com a objectiu proporcionar noves
eines i tècniques que faciliten la transcripció d'aquests repositoris. En
particular, abordem el desenvolupament d'un conjunt d'eines de reconeixement
automàtic de la parla, amb èmfasi en les tècniques d'aprenentatge profund que
contribueixen a proporcionar transcripcions precises en casos d'estudi reals. A
més, es presenten diferents participacions en competicions internacionals on es
demostra la competitivitat del programari comparada amb altres solucions.
D'altra banda, per tal de millorar els sistemes de reconeixement, es proposa una
nova tècnica d'adaptació d'aquests sistemes a l'interlocutor basada en l'ús de
Mesures de Confiança. A més, això va motivar el desenvolupament de tècniques per
a la millora en l'estimació d'aquest tipus de mesures per mitjà de Xarxes
Neuronals Recurrents.
Totes les contribucions presentades s'han provat en diferents repositoris
educatius. De fet, el toolkit transLectures-UPV és part d'un conjunt d'eines
que serveix per generar transcripcions de classes en diferents universitats i
institucions espanyoles i europees.[EN] During the last years, on-line multimedia repositories have become key
knowledge assets thanks to the rise of Internet and especially in the area of
education. Educational institutions around the world have devoted big efforts
to explore different teaching methods, to improve the transmission of knowledge
and to reach a wider audience. As a result, online video lecture repositories
are now available and serve as complementary tools that can boost the learning
experience to better assimilate new concepts. In order to guarantee the success
of these repositories the transcription of each lecture plays a very important
role because it constitutes the first step towards the availability of many other
features. This transcription allows the searchability of learning materials,
enables the translation into another languages, provides recommendation
functions, gives the possibility to provide content summaries, guarantees
the access to people with hearing disabilities, etc. However, the
transcription of these videos is expensive in terms of time and human cost.
To this purpose, this thesis aims at providing new tools and techniques that
ease the transcription of these repositories. In particular, we address the
development of a complete Automatic Speech Recognition Toolkit with an special
focus on the Deep Learning techniques that contribute to provide accurate
transcriptions in real-world scenarios. This toolkit is tested against many
other in different international competitions showing comparable transcription
quality. Moreover, a new technique to improve the recognition accuracy has been
proposed which makes use of Confidence Measures, and constitutes the spark that
motivated the proposal of new Confidence Measures techniques that helped to
further improve the transcription quality. To this end, a new speaker-adapted
confidence measure approach was proposed for models based on Recurrent Neural
Networks.
The contributions proposed herein have been tested in real-life scenarios in
different educational repositories. In fact, the transLectures-UPV toolkit is
part of a set of tools for providing video lecture transcriptions in many
different Spanish and European universities and institutions.Agua Teba, MÁD. (2019). CONTRIBUTIONS TO EFFICIENT AUTOMATIC TRANSCRIPTION OF VIDEO LECTURES [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/130198TESISCompendi