9 research outputs found

    Bio-motivated features and deep learning for robust speech recognition

    Get PDF
    Mención Internacional en el título de doctorIn spite of the enormous leap forward that the Automatic Speech Recognition (ASR) technologies has experienced over the last five years their performance under hard environmental condition is still far from that of humans preventing their adoption in several real applications. In this thesis the challenge of robustness of modern automatic speech recognition systems is addressed following two main research lines. The first one focuses on modeling the human auditory system to improve the robustness of the feature extraction stage yielding to novel auditory motivated features. Two main contributions are produced. On the one hand, a model of the masking behaviour of the Human Auditory System (HAS) is introduced, based on the non-linear filtering of a speech spectro-temporal representation applied simultaneously to both frequency and time domains. This filtering is accomplished by using image processing techniques, in particular mathematical morphology operations with an specifically designed Structuring Element (SE) that closely resembles the masking phenomena that take place in the cochlea. On the other hand, the temporal patterns of auditory-nerve firings are modeled. Most conventional acoustic features are based on short-time energy per frequency band discarding the information contained in the temporal patterns. Our contribution is the design of several types of feature extraction schemes based on the synchrony effect of auditory-nerve activity, showing that the modeling of this effect can indeed improve speech recognition accuracy in the presence of additive noise. Both models are further integrated into the well known Power Normalized Cepstral Coefficients (PNCC). The second research line addresses the problem of robustness in noisy environments by means of the use of Deep Neural Networks (DNNs)-based acoustic modeling and, in particular, of Convolutional Neural Networks (CNNs) architectures. A deep residual network scheme is proposed and adapted for our purposes, allowing Residual Networks (ResNets), originally intended for image processing tasks, to be used in speech recognition where the network input is small in comparison with usual image dimensions. We have observed that ResNets on their own already enhance the robustness of the whole system against noisy conditions. Moreover, our experiments demonstrate that their combination with the auditory motivated features devised in this thesis provide significant improvements in recognition accuracy in comparison to other state-of-the-art CNN-based ASR systems under mismatched conditions, while maintaining the performance in matched scenarios. The proposed methods have been thoroughly tested and compared with other state-of-the-art proposals for a variety of datasets and conditions. The obtained results prove that our methods outperform other state-of-the-art approaches and reveal that they are suitable for practical applications, specially where the operating conditions are unknown.El objetivo de esta tesis se centra en proponer soluciones al problema del reconocimiento de habla robusto; por ello, se han llevado a cabo dos líneas de investigación. En la primera líınea se han propuesto esquemas de extracción de características novedosos, basados en el modelado del comportamiento del sistema auditivo humano, modelando especialmente los fenómenos de enmascaramiento y sincronía. En la segunda, se propone mejorar las tasas de reconocimiento mediante el uso de técnicas de aprendizaje profundo, en conjunto con las características propuestas. Los métodos propuestos tienen como principal objetivo, mejorar la precisión del sistema de reconocimiento cuando las condiciones de operación no son conocidas, aunque el caso contrario también ha sido abordado. En concreto, nuestras principales propuestas son los siguientes: Simular el sistema auditivo humano con el objetivo de mejorar la tasa de reconocimiento en condiciones difíciles, principalmente en situaciones de alto ruido, proponiendo esquemas de extracción de características novedosos. Siguiendo esta dirección, nuestras principales propuestas se detallan a continuación: • Modelar el comportamiento de enmascaramiento del sistema auditivo humano, usando técnicas del procesado de imagen sobre el espectro, en concreto, llevando a cabo el diseño de un filtro morfológico que captura este efecto. • Modelar el efecto de la sincroní que tiene lugar en el nervio auditivo. • La integración de ambos modelos en los conocidos Power Normalized Cepstral Coefficients (PNCC). La aplicación de técnicas de aprendizaje profundo con el objetivo de hacer el sistema más robusto frente al ruido, en particular con el uso de redes neuronales convolucionales profundas, como pueden ser las redes residuales. Por último, la aplicación de las características propuestas en combinación con las redes neuronales profundas, con el objetivo principal de obtener mejoras significativas, cuando las condiciones de entrenamiento y test no coinciden.Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: Javier Ferreiros López.- Secretario: Fernando Díaz de María.- Vocal: Rubén Solera Ureñ

    Deep Scattering and End-to-End Speech Models towards Low Resource Speech Recognition

    Get PDF
    Automatic Speech Recognition (ASR) has made major leaps in its advancement largely due to two different machine learning models: Hidden Markov Models (HMMs) and Deep Neural Networks (DNNs). State-of-the art results have been achieved by combining these two disparate methods to form a hybrid system. This also requires that various components of the speech recognizer be trained independently based on a probabilistic noisy channel model. Although this HMM-DNN hybrid ASR method has been successful in recent studies, the independent development of the individual components used in hybrid HMM-DNN models makes ASR development fragile and expensive in terms of time-to-develop the various components and their associated sub-systems. The resulting trade-off is that ASR systems are difficult to develop and use especially for new applications and languages. The alternative approach, known as the end-to-end paradigm, makes use of a single deep neural-network architecture used to encapsulate as many as possible subcomponents of speech recognition as a single process. In the so-called end-to-end paradigm, latent variables of sub-components are subsumed by the neural network sub-architectures and the associated parameters. The end-to-end paradigm gains of a simplified ASR-development process again are traded for higher internal model complexity and computational resources needed to train the end-to-end models. This research focuses on taking advantage of the end-to-end model ASR development gains for new and low-resource languages. Using a specialised light weight convolution-like neural network called the deep scattering network (DSN) to replace the input layer of the end-to-end model, our objective was to measure the performance of the end-to-end model using these augmented speech features while checking to see if the light-weight, wavelet-based architecture brought about any improvements for low resource Speech recognition in particular. The results showed that it is possible to use this compact strategy for speech pattern recognition by deploying deep scattering network features with higher dimensional vectors when compared to traditional speech features. With Word Error Rates of 26.8% and 76.7% for SVCSR and LVCSR respective tasks, the ASR system metrics fell few WER points short of their respective baselines. In addition, training times tended to be longer when compared to their respective baselines and therefore had no significant improvement for low resource speech recognition training

    Multi-dialect Arabic broadcast speech recognition

    Get PDF
    Dialectal Arabic speech research suffers from the lack of labelled resources and standardised orthography. There are three main challenges in dialectal Arabic speech recognition: (i) finding labelled dialectal Arabic speech data, (ii) training robust dialectal speech recognition models from limited labelled data and (iii) evaluating speech recognition for dialects with no orthographic rules. This thesis is concerned with the following three contributions: Arabic Dialect Identification: We are mainly dealing with Arabic speech without prior knowledge of the spoken dialect. Arabic dialects could be sufficiently diverse to the extent that one can argue that they are different languages rather than dialects of the same language. We have two contributions: First, we use crowdsourcing to annotate a multi-dialectal speech corpus collected from Al Jazeera TV channel. We obtained utterance level dialect labels for 57 hours of high-quality consisting of four major varieties of dialectal Arabic (DA), comprised of Egyptian, Levantine, Gulf or Arabic peninsula, North African or Moroccan from almost 1,000 hours. Second, we build an Arabic dialect identification (ADI) system. We explored two main groups of features, namely acoustic features and linguistic features. For the linguistic features, we look at a wide range of features, addressing words, characters and phonemes. With respect to acoustic features, we look at raw features such as mel-frequency cepstral coefficients combined with shifted delta cepstra (MFCC-SDC), bottleneck features and the i-vector as a latent variable. We studied both generative and discriminative classifiers, in addition to deep learning approaches, namely deep neural network (DNN) and convolutional neural network (CNN). In our work, we propose Arabic as a five class dialect challenge comprising of the previously mentioned four dialects as well as modern standard Arabic. Arabic Speech Recognition: We introduce our effort in building Arabic automatic speech recognition (ASR) and we create an open research community to advance it. This section has two main goals: First, creating a framework for Arabic ASR that is publicly available for research. We address our effort in building two multi-genre broadcast (MGB) challenges. MGB-2 focuses on broadcast news using more than 1,200 hours of speech and 130M words of text collected from the broadcast domain. MGB-3, however, focuses on dialectal multi-genre data with limited non-orthographic speech collected from YouTube, with special attention paid to transfer learning. Second, building a robust Arabic ASR system and reporting a competitive word error rate (WER) to use it as a potential benchmark to advance the state of the art in Arabic ASR. Our overall system is a combination of five acoustic models (AM): unidirectional long short term memory (LSTM), bidirectional LSTM (BLSTM), time delay neural network (TDNN), TDNN layers along with LSTM layers (TDNN-LSTM) and finally TDNN layers followed by BLSTM layers (TDNN-BLSTM). The AM is trained using purely sequence trained neural networks lattice-free maximum mutual information (LFMMI). The generated lattices are rescored using a four-gram language model (LM) and a recurrent neural network with maximum entropy (RNNME) LM. Our official WER is 13%, which has the lowest WER reported on this task. Evaluation: The third part of the thesis addresses our effort in evaluating dialectal speech with no orthographic rules. Our methods learn from multiple transcribers and align the speech hypothesis to overcome the non-orthographic aspects. Our multi-reference WER (MR-WER) approach is similar to the BLEU score used in machine translation (MT). We have also automated this process by learning different spelling variants from Twitter data. We mine automatically from a huge collection of tweets in an unsupervised fashion to build more than 11M n-to-m lexical pairs, and we propose a new evaluation metric: dialectal WER (WERd). Finally, we tried to estimate the word error rate (e-WER) with no reference transcription using decoding and language features. We show that our word error rate estimation is robust for many scenarios with and without the decoding features

    Arquitecturas y métodos en sistemas de reconocimiento automático de habla de gran vocabulario

    Get PDF
    La tesis que se presenta en este documento, se enmarca en el área del Reconocimiento Automático de Habla y específicamente en el diseño de sistemas de reconocimiento de gran vocabulario. En todos los casos, la tecnología de base en lo que se refiere al modelado, la aportan los modelos ocultos de Markov que, hoy por hoy, representan el paradigma de modelado dominante. En concreto, se utilizarán técnicas de modelado discreto y semicontinuo, dependiente e independiente del contexto. En primer lugar, y a partir de una clasificación de alternativas arquitecturales en el diseño de sistemas de reconocimiento se hace un estudio teórico de la formulación del comportamiento de arquitecturas multi-módulo, tanto en coste computacional como en tasa de reconocimiento, definiendo una metodología de diseño para determinar la adecuación de módulos particulares de cara a su uso conjunto, que es validada con la experimentación correspondiente. Igualmente, se hace énfasis en el estudio y evaluación de algunas de las alternativas de compresión del espacio de búsqueda, estableciendo relaciones de compromiso entre coste y tasa, que es el binomio decisivo a la hora de abordar el diseño de sistemas en tiempo real. Se presentan estudios sobre distintas estrategias de organización del espacio de búsqueda orientadas a exploración y búsqueda con algoritmos de programación dinámica: árboles y grafos, deterministas y no deterministas, proponiendo soluciones prometedoras para incrementar la tasa de inclusión obtenible sobre estructuras de grafo (en las que la compresión del espacio de búsqueda produce peores resultados que con la búsqueda lineal o en árbol). Especialmente importante es el trabajo sobre estimación de listas variables de preselección, analizando métodos paramétricos y no paramétricos, centrándonos en el uso de redes neuronales como mecanismo estimador. Se ha propuesto una metodología de selección de parámetros de entrada, topologías y métodos de codificación, en base a su potencia discriminativa en una tarea simplificada. Dicha propuesta que ha sido ampliamente evaluada y comparada con el enfoque tradicional de uso de listas fijas, mostrando la consistente mejora tanto en tasa como en coste computacional conseguible con el uso de redes neuronales. Dicho estudio sobre listas variables ha sido extendido de forma natural al problema de estimación de fiabilidad de hipótesis, habiéndose aprovechando estos resultados, de nuevo, para la estimación de longitudes de listas, obteniendo también buenos resultados. En lo que respecta al repertorio de unidades de reconocimiento y a la composición de los diccionarios usados (en cuanto al uso de múltiples pronunciaciones), se aplican, evalúan y comparan métodos dirigidos por datos y basados en conocimiento. En el apartado de introducción de variantes de pronunciación se ha discutido ampliamente la problemática de contar con bases de datos representativas y haciendo énfasis en la importancia de atender y evaluar las mejoras marginales obtenidas con algunos de estos métodos. La evaluación de los resultados es planteada cuidadosamente, sobre dos tareas radicalmente distintas: habla telefónica independiente del locutor y habla aislada dependiente, ambas usando gran vocabulario (hasta 10000 palabras), lo que permite obtener conclusiones y claves de diseño para cada una de ellas, con lo que se consigue una generalización más fundamentada de su bondades o perjuicios. En este sentido se aplican análisis de validez y relevancia estadística que pongan en su justo sitio las mejoras o degradaciones observadas. En los procesos de evaluación se han propuesto nuevas métricas y mecanismos originales de comparación

    IberSPEECH 2020: XI Jornadas en Tecnología del Habla and VII Iberian SLTech

    Get PDF
    IberSPEECH2020 is a two-day event, bringing together the best researchers and practitioners in speech and language technologies in Iberian languages to promote interaction and discussion. The organizing committee has planned a wide variety of scientific and social activities, including technical paper presentations, keynote lectures, presentation of projects, laboratories activities, recent PhD thesis, discussion panels, a round table, and awards to the best thesis and papers. The program of IberSPEECH2020 includes a total of 32 contributions that will be presented distributed among 5 oral sessions, a PhD session, and a projects session. To ensure the quality of all the contributions, each submitted paper was reviewed by three members of the scientific review committee. All the papers in the conference will be accessible through the International Speech Communication Association (ISCA) Online Archive. Paper selection was based on the scores and comments provided by the scientific review committee, which includes 73 researchers from different institutions (mainly from Spain and Portugal, but also from France, Germany, Brazil, Iran, Greece, Hungary, Czech Republic, Ucrania, Slovenia). Furthermore, it is confirmed to publish an extension of selected papers as a special issue of the Journal of Applied Sciences, “IberSPEECH 2020: Speech and Language Technologies for Iberian Languages”, published by MDPI with fully open access. In addition to regular paper sessions, the IberSPEECH2020 scientific program features the following activities: the ALBAYZIN evaluation challenge session.Red Española de Tecnologías del Habla. Universidad de Valladoli

    Ways and Capacity in Archaeological Data Management in Serbia

    Get PDF
    Over the past year and due to the COVID-19 pandemic, the entire world has witnessed inequalities across borders and societies. They also include access to archaeological resources, both physical and digital. Both archaeological data creators and users spent a lot of time working from their homes, away from artefact collections and research data. However, this was the perfect moment to understand the importance of making data freely and openly available, both nationally and internationally. This is why the authors of this paper chose to make a selection of data bases from various institutions responsible for preservation and protection of cultural heritage, in order to understand their policies regarding accessibility and usage of the data they keep. This will be done by simple visits to various web-sites or data bases. They intend to check on the volume and content, but also importance of the offered archaeological heritage. In addition, the authors will estimate whether the heritage has adequately been classified and described and also check whether data is available in foreign languages. It needs to be seen whether it is possible to access digital objects (documents and the accompanying metadata), whether access is opened for all users or it requires a certain hierarchy access, what is the policy of usage, reusage and distribution etc. It remains to be seen whether there are public API or whether it is possible to collect data through API. In case that there is a public API, one needs to check whether datasets are interoperable or messy, requiring data cleaning. After having visited a certain number of web-sites, the authors expect to collect enough data to make a satisfactory conclusion about accessibility and usage of Serbian archaeological data web bases

    Neolithic land-use in the Dutch wetlands: estimating the land-use implications of resource exploitation strategies in the Middle Swifterbant Culture (4600-3900 BCE)

    Get PDF
    The Dutch wetlands witness the gradual adoption of Neolithic novelties by foraging societies during the Swifterbant period. Recent analyses provide new insights into the subsistence palette of Middle Swifterbant societies. Small-scale livestock herding and cultivation are in evidence at this time, but their importance if unclear. Within the framework of PAGES Land-use at 6000BP project, we aim to translate the information on resource exploitation into information on land-use that can be incorporated into global climate modelling efforts, with attention for the importance of agriculture. A reconstruction of patterns of resource exploitation and their land-use dimensions is complicated by methodological issues in comparing the results of varied recent investigations. Analyses of organic residues in ceramics have attested to the cooking of aquatic foods, ruminant meat, porcine meat, as well as rare cases of dairy. In terms of vegetative matter, some ceramics exclusively yielded evidence of wild plants, while others preserve cereal remains. Elevated δ15N values of human were interpreted as demonstrating an important aquatic component of the diet well into the 4th millennium BC. Yet recent assays on livestock remains suggest grazing on salt marshes partly accounts for the human values. Finally, renewed archaeozoological investigations have shown the early presence of domestic animals to be more limited than previously thought. We discuss the relative importance of exploited resources to produce a best-fit interpretation of changing patterns of land-use during the Middle Swifterbant phase. Our review combines recent archaeological data with wider data on anthropogenic influence on the landscape. Combining the results of plant macroremains, information from pollen cores about vegetation development, the structure of faunal assemblages, and finds of arable fields and dairy residue, we suggest the most parsimonious interpretation is one of a limited land-use footprint of cultivation and livestock keeping in Dutch wetlands between 4600 and 3900 BCE.NWOVidi 276-60-004Human Origin

    Taphonomy, environment or human plant exploitation strategies?: Deciphering changes in Pleistocene-Holocene plant representation at Umhlatuzana rockshelter, South Africa

    Get PDF
    The period between ~40 and 20 ka BP encompassing the Middle Stone Age (MSA) and Later Stone Age (LSA) transition has long been of interest because of the associated technological change. Understanding this transition in southern Africa is complicated by the paucity of archaeological sites that span this period. With its occupation sequence spanning the last ~70,000 years, Umhlatuzana Rock Shelter is one of the few sites that record this transition. Umhlatuzana thus offers a great opportunity to study past environmental dynamics from the Late Pleistocene (MIS 4) to the Late Holocene, and past human subsistence strategies, their social organisation, technological and symbolic innovations. Although organic preservation is poor (bones, seeds, and charcoal) at the site, silica phytoliths preserve generally well throughout the sequence. These microscopic silica particles can identify different plant types that are no longer visible at the site because of decomposition or burning to a reliable taxonomical level. Thus, to trace site occupation, plant resource use, and in turn reconstruct past vegetation, we applied phytolith analyses to sediment samples of the newly excavated Umhlatuzana sequence. We present results of the phytolith assemblage variability to determine change in plant use from the Pleistocene to the Holocene and discuss them in relation to taphonomical processes and human plant gathering strategies and activities. This study ultimately seeks to provide a palaeoenvironmental context for modes of occupation and will shed light on past human-environmental interactions in eastern South Africa.NWOVidi 276-60-004Human Origin
    corecore