9 research outputs found
Bio-motivated features and deep learning for robust speech recognition
Mención Internacional en el título de doctorIn spite of the enormous leap forward that the Automatic Speech
Recognition (ASR) technologies has experienced over the last five years
their performance under hard environmental condition is still far from
that of humans preventing their adoption in several real applications.
In this thesis the challenge of robustness of modern automatic speech
recognition systems is addressed following two main research lines.
The first one focuses on modeling the human auditory system to
improve the robustness of the feature extraction stage yielding to novel
auditory motivated features. Two main contributions are produced.
On the one hand, a model of the masking behaviour of the Human
Auditory System (HAS) is introduced, based on the non-linear filtering
of a speech spectro-temporal representation applied simultaneously
to both frequency and time domains. This filtering is accomplished
by using image processing techniques, in particular mathematical
morphology operations with an specifically designed Structuring Element
(SE) that closely resembles the masking phenomena that take
place in the cochlea. On the other hand, the temporal patterns of
auditory-nerve firings are modeled. Most conventional acoustic features
are based on short-time energy per frequency band discarding
the information contained in the temporal patterns. Our contribution
is the design of several types of feature extraction schemes based on
the synchrony effect of auditory-nerve activity, showing that the modeling
of this effect can indeed improve speech recognition accuracy in
the presence of additive noise. Both models are further integrated into
the well known Power Normalized Cepstral Coefficients (PNCC).
The second research line addresses the problem of robustness in
noisy environments by means of the use of Deep Neural Networks
(DNNs)-based acoustic modeling and, in particular, of Convolutional
Neural Networks (CNNs) architectures. A deep residual network
scheme is proposed and adapted for our purposes, allowing Residual
Networks (ResNets), originally intended for image processing tasks,
to be used in speech recognition where the network input is small
in comparison with usual image dimensions. We have observed that
ResNets on their own already enhance the robustness of the whole system
against noisy conditions. Moreover, our experiments demonstrate
that their combination with the auditory motivated features devised
in this thesis provide significant improvements in recognition accuracy
in comparison to other state-of-the-art CNN-based ASR systems
under mismatched conditions, while maintaining the performance in
matched scenarios.
The proposed methods have been thoroughly tested and compared
with other state-of-the-art proposals for a variety of datasets and
conditions. The obtained results prove that our methods outperform
other state-of-the-art approaches and reveal that they are suitable for
practical applications, specially where the operating conditions are
unknown.El objetivo de esta tesis se centra en proponer soluciones al problema
del reconocimiento de habla robusto; por ello, se han llevado a cabo
dos líneas de investigación.
En la primera líınea se han propuesto esquemas de extracción de características novedosos, basados en el modelado del comportamiento
del sistema auditivo humano, modelando especialmente los fenómenos
de enmascaramiento y sincronía. En la segunda, se propone mejorar
las tasas de reconocimiento mediante el uso de técnicas de
aprendizaje profundo, en conjunto con las características propuestas.
Los métodos propuestos tienen como principal objetivo, mejorar la
precisión del sistema de reconocimiento cuando las condiciones de
operación no son conocidas, aunque el caso contrario también ha sido
abordado.
En concreto, nuestras principales propuestas son los siguientes:
Simular el sistema auditivo humano con el objetivo de mejorar
la tasa de reconocimiento en condiciones difíciles, principalmente
en situaciones de alto ruido, proponiendo esquemas de
extracción de características novedosos.
Siguiendo esta dirección, nuestras principales propuestas se detallan a continuación:
• Modelar el comportamiento de enmascaramiento del sistema
auditivo humano, usando técnicas del procesado de
imagen sobre el espectro, en concreto, llevando a cabo el
diseño de un filtro morfológico que captura este efecto.
• Modelar el efecto de la sincroní que tiene lugar en el nervio
auditivo.
• La integración de ambos modelos en los conocidos Power
Normalized Cepstral Coefficients (PNCC).
La aplicación de técnicas de aprendizaje profundo con el objetivo
de hacer el sistema más robusto frente al ruido, en particular
con el uso de redes neuronales convolucionales profundas, como
pueden ser las redes residuales.
Por último, la aplicación de las características propuestas en
combinación con las redes neuronales profundas, con el objetivo
principal de obtener mejoras significativas, cuando las condiciones
de entrenamiento y test no coinciden.Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: Javier Ferreiros López.- Secretario: Fernando Díaz de María.- Vocal: Rubén Solera Ureñ
Deep Scattering and End-to-End Speech Models towards Low Resource Speech Recognition
Automatic Speech Recognition (ASR) has made major leaps in its advancement
largely due to two different machine learning models: Hidden Markov Models (HMMs)
and Deep Neural Networks (DNNs). State-of-the art results have been achieved by
combining these two disparate methods to form a hybrid system. This also requires
that various components of the speech recognizer be trained independently based on
a probabilistic noisy channel model. Although this HMM-DNN hybrid ASR method
has been successful in recent studies, the independent development of the individual
components used in hybrid HMM-DNN models makes ASR development fragile and
expensive in terms of time-to-develop the various components and their associated
sub-systems. The resulting trade-off is that ASR systems are difficult to develop
and use especially for new applications and languages.
The alternative approach, known as the end-to-end paradigm, makes use of a
single deep neural-network architecture used to encapsulate as many as possible subcomponents
of speech recognition as a single process. In the so-called end-to-end
paradigm, latent variables of sub-components are subsumed by the neural network
sub-architectures and the associated parameters. The end-to-end paradigm gains
of a simplified ASR-development process again are traded for higher internal model
complexity and computational resources needed to train the end-to-end models.
This research focuses on taking advantage of the end-to-end model ASR development
gains for new and low-resource languages. Using a specialised light weight
convolution-like neural network called the deep scattering network (DSN) to replace
the input layer of the end-to-end model, our objective was to measure the
performance of the end-to-end model using these augmented speech features while
checking to see if the light-weight, wavelet-based architecture brought about any
improvements for low resource Speech recognition in particular.
The results showed that it is possible to use this compact strategy for speech
pattern recognition by deploying deep scattering network features with higher dimensional
vectors when compared to traditional speech features. With Word Error
Rates of 26.8% and 76.7% for SVCSR and LVCSR respective tasks, the ASR system
metrics fell few WER points short of their respective baselines. In addition, training
times tended to be longer when compared to their respective baselines and therefore
had no significant improvement for low resource speech recognition training
Multi-dialect Arabic broadcast speech recognition
Dialectal Arabic speech research suffers from the lack of labelled resources and
standardised orthography. There are three main challenges in dialectal Arabic
speech recognition: (i) finding labelled dialectal Arabic speech data, (ii) training
robust dialectal speech recognition models from limited labelled data and (iii)
evaluating speech recognition for dialects with no orthographic rules. This thesis
is concerned with the following three contributions:
Arabic Dialect Identification: We are mainly dealing with Arabic speech
without prior knowledge of the spoken dialect. Arabic dialects could be sufficiently
diverse to the extent that one can argue that they are different languages
rather than dialects of the same language. We have two contributions:
First, we use crowdsourcing to annotate a multi-dialectal speech corpus collected
from Al Jazeera TV channel. We obtained utterance level dialect labels for 57
hours of high-quality consisting of four major varieties of dialectal Arabic (DA),
comprised of Egyptian, Levantine, Gulf or Arabic peninsula, North African or
Moroccan from almost 1,000 hours. Second, we build an Arabic dialect identification
(ADI) system. We explored two main groups of features, namely acoustic
features and linguistic features. For the linguistic features, we look at a wide
range of features, addressing words, characters and phonemes. With respect to
acoustic features, we look at raw features such as mel-frequency cepstral coefficients
combined with shifted delta cepstra (MFCC-SDC), bottleneck features and
the i-vector as a latent variable. We studied both generative and discriminative
classifiers, in addition to deep learning approaches, namely deep neural network
(DNN) and convolutional neural network (CNN). In our work, we propose Arabic
as a five class dialect challenge comprising of the previously mentioned four
dialects as well as modern standard Arabic.
Arabic Speech Recognition: We introduce our effort in building Arabic automatic
speech recognition (ASR) and we create an open research community
to advance it. This section has two main goals: First, creating a framework for
Arabic ASR that is publicly available for research. We address our effort in building
two multi-genre broadcast (MGB) challenges. MGB-2 focuses on broadcast
news using more than 1,200 hours of speech and 130M words of text collected
from the broadcast domain. MGB-3, however, focuses on dialectal multi-genre
data with limited non-orthographic speech collected from YouTube, with special
attention paid to transfer learning. Second, building a robust Arabic ASR system
and reporting a competitive word error rate (WER) to use it as a potential
benchmark to advance the state of the art in Arabic ASR. Our overall system is
a combination of five acoustic models (AM): unidirectional long short term memory
(LSTM), bidirectional LSTM (BLSTM), time delay neural network (TDNN),
TDNN layers along with LSTM layers (TDNN-LSTM) and finally TDNN layers
followed by BLSTM layers (TDNN-BLSTM). The AM is trained using purely
sequence trained neural networks lattice-free maximum mutual information (LFMMI).
The generated lattices are rescored using a four-gram language model
(LM) and a recurrent neural network with maximum entropy (RNNME) LM.
Our official WER is 13%, which has the lowest WER reported on this task.
Evaluation: The third part of the thesis addresses our effort in evaluating dialectal
speech with no orthographic rules. Our methods learn from multiple
transcribers and align the speech hypothesis to overcome the non-orthographic
aspects. Our multi-reference WER (MR-WER) approach is similar to the BLEU
score used in machine translation (MT). We have also automated this process
by learning different spelling variants from Twitter data. We mine automatically
from a huge collection of tweets in an unsupervised fashion to build more than
11M n-to-m lexical pairs, and we propose a new evaluation metric: dialectal
WER (WERd). Finally, we tried to estimate the word error rate (e-WER) with
no reference transcription using decoding and language features. We show that
our word error rate estimation is robust for many scenarios with and without the
decoding features
Arquitecturas y métodos en sistemas de reconocimiento automático de habla de gran vocabulario
La tesis que se presenta en este documento, se enmarca en el área del Reconocimiento
Automático de Habla y específicamente en el diseño de sistemas de reconocimiento de gran
vocabulario. En todos los casos, la tecnología de base en lo que se refiere al modelado, la aportan los
modelos ocultos de Markov que, hoy por hoy, representan el paradigma de modelado dominante. En
concreto, se utilizarán técnicas de modelado discreto y semicontinuo, dependiente e independiente del
contexto.
En primer lugar, y a partir de una clasificación de alternativas arquitecturales en el diseño de
sistemas de reconocimiento se hace un estudio teórico de la formulación del comportamiento de
arquitecturas multi-módulo, tanto en coste computacional como en tasa de reconocimiento, definiendo
una metodología de diseño para determinar la adecuación de módulos particulares de cara a su uso
conjunto, que es validada con la experimentación correspondiente.
Igualmente, se hace énfasis en el estudio y evaluación de algunas de las alternativas de
compresión del espacio de búsqueda, estableciendo relaciones de compromiso entre coste y tasa, que
es el binomio decisivo a la hora de abordar el diseño de sistemas en tiempo real. Se presentan estudios
sobre distintas estrategias de organización del espacio de búsqueda orientadas a exploración y
búsqueda con algoritmos de programación dinámica: árboles y grafos, deterministas y no
deterministas, proponiendo soluciones prometedoras para incrementar la tasa de inclusión obtenible
sobre estructuras de grafo (en las que la compresión del espacio de búsqueda produce peores resultados
que con la búsqueda lineal o en árbol). Especialmente importante es el trabajo sobre estimación de
listas variables de preselección, analizando métodos paramétricos y no paramétricos, centrándonos en
el uso de redes neuronales como mecanismo estimador. Se ha propuesto una metodología de selección
de parámetros de entrada, topologías y métodos de codificación, en base a su potencia discriminativa
en una tarea simplificada. Dicha propuesta que ha sido ampliamente evaluada y comparada con el
enfoque tradicional de uso de listas fijas, mostrando la consistente mejora tanto en tasa como en coste
computacional conseguible con el uso de redes neuronales. Dicho estudio sobre listas variables ha sido
extendido de forma natural al problema de estimación de fiabilidad de hipótesis, habiéndose
aprovechando estos resultados, de nuevo, para la estimación de longitudes de listas, obteniendo
también buenos resultados.
En lo que respecta al repertorio de unidades de reconocimiento y a la composición de los
diccionarios usados (en cuanto al uso de múltiples pronunciaciones), se aplican, evalúan y comparan
métodos dirigidos por datos y basados en conocimiento. En el apartado de introducción de variantes de
pronunciación se ha discutido ampliamente la problemática de contar con bases de datos
representativas y haciendo énfasis en la importancia de atender y evaluar las mejoras marginales
obtenidas con algunos de estos métodos.
La evaluación de los resultados es planteada cuidadosamente, sobre dos tareas radicalmente
distintas: habla telefónica independiente del locutor y habla aislada dependiente, ambas usando gran
vocabulario (hasta 10000 palabras), lo que permite obtener conclusiones y claves de diseño para cada
una de ellas, con lo que se consigue una generalización más fundamentada de su bondades o perjuicios.
En este sentido se aplican análisis de validez y relevancia estadística que pongan en su justo sitio las
mejoras o degradaciones observadas. En los procesos de evaluación se han propuesto nuevas métricas
y mecanismos originales de comparación
IberSPEECH 2020: XI Jornadas en Tecnología del Habla and VII Iberian SLTech
IberSPEECH2020 is a two-day event, bringing together the best researchers and practitioners in speech and language technologies in Iberian languages to promote interaction and discussion. The organizing committee has planned a wide variety of scientific and social activities, including technical paper presentations, keynote lectures, presentation of projects, laboratories activities, recent PhD thesis, discussion panels, a round table, and awards to the best thesis and papers. The program of IberSPEECH2020 includes a total of 32 contributions that will be presented distributed among 5 oral sessions, a PhD session, and a projects session. To ensure the quality of all the contributions, each submitted paper was reviewed by three members of the scientific review committee. All the papers in the conference will be accessible through the International Speech Communication Association (ISCA) Online Archive. Paper selection was based on the scores and comments provided by the scientific review committee, which includes 73 researchers from different institutions (mainly from Spain and Portugal, but also from France, Germany, Brazil, Iran, Greece, Hungary, Czech Republic, Ucrania, Slovenia). Furthermore, it is confirmed to publish an extension of selected papers as a special issue of the Journal of Applied Sciences, “IberSPEECH 2020: Speech and Language Technologies for Iberian Languages”, published by MDPI with fully open access. In addition to regular paper sessions, the IberSPEECH2020 scientific program features the following activities: the ALBAYZIN evaluation challenge session.Red Española de Tecnologías del Habla. Universidad de Valladoli
Ways and Capacity in Archaeological Data Management in Serbia
Over the past year and due to the COVID-19 pandemic, the entire world has witnessed inequalities across borders and societies.
They also include access to archaeological resources, both physical and digital. Both archaeological data creators and users spent
a lot of time working from their homes, away from artefact collections and research data. However, this was the perfect moment to
understand the importance of making data freely and openly available, both nationally and internationally.
This is why the authors of this paper chose to make a selection of data bases from various institutions responsible for preservation
and protection of cultural heritage, in order to understand their policies regarding accessibility and usage of the data they keep.
This will be done by simple visits to various web-sites or data bases. They intend to check on the volume and content, but also
importance of the offered archaeological heritage. In addition, the authors will estimate whether the heritage has adequately been
classified and described and also check whether data is available in foreign languages.
It needs to be seen whether it is possible to access digital objects (documents and the accompanying metadata), whether access
is opened for all users or it requires a certain hierarchy access, what is the policy of usage, reusage and distribution etc. It remains to
be seen whether there are public API or whether it is possible to collect data through API. In case that there is a public API, one needs
to check whether datasets are interoperable or messy, requiring data cleaning.
After having visited a certain number of web-sites, the authors expect to collect enough data to make a satisfactory conclusion
about accessibility and usage of Serbian archaeological data web bases
Neolithic land-use in the Dutch wetlands: estimating the land-use implications of resource exploitation strategies in the Middle Swifterbant Culture (4600-3900 BCE)
The Dutch wetlands witness the gradual adoption of Neolithic novelties by foraging societies during the Swifterbant period. Recent analyses provide new insights into the subsistence palette of Middle Swifterbant societies. Small-scale livestock herding and cultivation are in evidence at this time, but their importance if unclear. Within the framework of PAGES Land-use at 6000BP project, we aim to translate the information on resource exploitation into information on land-use that can be incorporated into global climate modelling efforts, with attention for the importance of agriculture. A reconstruction of patterns of resource exploitation and their land-use dimensions is complicated by methodological issues in comparing the results of varied recent investigations. Analyses of organic residues in ceramics have attested to the cooking of aquatic foods, ruminant meat, porcine meat, as well as rare cases of dairy. In terms of vegetative matter, some ceramics exclusively yielded evidence of wild plants, while others preserve cereal remains. Elevated δ15N values of human were interpreted as demonstrating an important aquatic component of the diet well into the 4th millennium BC. Yet recent assays on livestock remains suggest grazing on salt marshes partly accounts for the human values. Finally, renewed archaeozoological investigations have shown the early presence of domestic animals to be more limited than previously thought. We discuss the relative importance of exploited resources to produce a best-fit interpretation of changing patterns of land-use during the Middle Swifterbant phase. Our review combines recent archaeological data with wider data on anthropogenic influence on the landscape. Combining the results of plant macroremains, information from pollen cores about vegetation development, the structure of faunal assemblages, and finds of arable fields and dairy residue, we suggest the most parsimonious interpretation is one of a limited land-use footprint of cultivation and livestock keeping in Dutch wetlands between 4600 and 3900 BCE.NWOVidi 276-60-004Human Origin
Taphonomy, environment or human plant exploitation strategies?: Deciphering changes in Pleistocene-Holocene plant representation at Umhlatuzana rockshelter, South Africa
The period between ~40 and 20 ka BP encompassing the Middle Stone Age (MSA) and Later Stone Age (LSA) transition has long been of interest because of the associated technological change. Understanding this transition in southern Africa is complicated by the paucity of archaeological sites that span this period. With its occupation sequence spanning the last ~70,000 years, Umhlatuzana Rock Shelter is one of the few sites that record this transition. Umhlatuzana thus offers a great opportunity to study past environmental dynamics from the Late Pleistocene (MIS 4) to the Late Holocene, and past human subsistence strategies, their social organisation, technological and symbolic innovations. Although organic preservation is poor (bones, seeds, and charcoal) at the site, silica phytoliths preserve generally well throughout the sequence. These microscopic silica particles can identify different plant types that are no longer visible at the site because of decomposition or burning to a reliable taxonomical level. Thus, to trace site occupation, plant resource use, and in turn reconstruct past vegetation, we applied phytolith analyses to sediment samples of the newly excavated Umhlatuzana sequence. We present results of the phytolith assemblage variability to determine change in plant use from the Pleistocene to the Holocene and discuss them in relation to taphonomical processes and human plant gathering strategies and activities. This study ultimately seeks to provide a palaeoenvironmental context for modes of occupation and will shed light on past human-environmental interactions in eastern South Africa.NWOVidi 276-60-004Human Origin