9 research outputs found

    An Empirical Evaluation of Zero Resource Acoustic Unit Discovery

    Full text link
    Acoustic unit discovery (AUD) is a process of automatically identifying a categorical acoustic unit inventory from speech and producing corresponding acoustic unit tokenizations. AUD provides an important avenue for unsupervised acoustic model training in a zero resource setting where expert-provided linguistic knowledge and transcribed speech are unavailable. Therefore, to further facilitate zero-resource AUD process, in this paper, we demonstrate acoustic feature representations can be significantly improved by (i) performing linear discriminant analysis (LDA) in an unsupervised self-trained fashion, and (ii) leveraging resources of other languages through building a multilingual bottleneck (BN) feature extractor to give effective cross-lingual generalization. Moreover, we perform comprehensive evaluations of AUD efficacy on multiple downstream speech applications, and their correlated performance suggests that AUD evaluations are feasible using different alternative language resources when only a subset of these evaluation resources can be available in typical zero resource applications.Comment: 5 pages, 1 figure; Accepted for publication at ICASSP 201

    Early phonetic learning without phonetic categories: Insights from large-scale simulations on realistic input

    Get PDF
    International audienceBefore they even speak, infants become attuned to the sounds of the language(s) they hear, processing native phonetic contrasts more easily than non-native ones. For example, between 6-8 months and 10-12 months, infants learning American English get better at distinguishing English [ɹ] and [l], as in ‘rock’ vs ‘lock’, relative to infants learning Japanese. Influential accounts of this early phonetic learning phenomenon initially proposed that infants group sounds into native vowel- and consonant-like phonetic categories—like [ɹ] and [l] in English—through a statistical clustering mechanism dubbed ‘distributional learning’. The feasibility of this mechanism for learning phonetic categories has been challenged, however. Here we demonstrate that a distributional learning algorithm operating on naturalistic speech can predict early phonetic learning as observed in Japanese and American English infants, suggesting that infants might learn through distributional learning after all. We further show, however, that contrary to the original distributional learning proposal, our model learns units too brief and too fine-grained acoustically to correspond to phonetic categories. This challenges the influential idea that what infants learn are phonetic categories. More broadly, our work introduces a novel mechanism-driven approach to the study of early phonetic learning, together with a quantitative modeling framework that can handle realistic input. This allows, for the first time, accounts of early phonetic learning to be linked to concrete, systematic predictions regarding infants’ attunement

    Disentanglement Learning for Text-Free Voice Conversion

    Get PDF
    Voice conversion (VC) aims to change the perceived speaker identity of a speech signal from one to another, while preserving the linguistic content. Recent state-of-the-art VC systems typically are dependent on automatic speech recognition (ASR) models and they have gained great successes. Results of recent challenges show these VC systems have reached a level of performance close to real human voices. However, they are highly relying on the performance of the ASR models, which might experience degradations in practical applications because of the mismatch between training and test data. VC systems independent of ASR models are typically regarded as text-free systems. They commonly apply disentanglement learning methods to remove the speaker information of a speech signal, for example, vector quantisation (VQ) or instance normalisation (IN). However, text-free VC systems have not reached the same level of performance as text-dependent systems. This thesis mainly studies disentanglement learning methods for improving the performance of text-free VC systems. Three major contributions are summarised as follows. Firstly, in order to improve the performance of an auto-encoder based VC model, the information loss issue caused by the VQ of the model is studied. Two disentanglement learning methods are exploited to replace the VQ of the model. Experiments show that these two methods improve the naturalness and intelligibility performance of the model, but hurt the speaker similarity performance of the model. The reason for the degradation of the speaker similarity performance is studied in the further analysis experiments. Next, the performance and the robustness of Generative Adversarial Networks (GAN) based VC models are studied. In order to improve the performance and the robustness of an GAN based VC model, a new model is proposed. This new model introduces a new speaker adaptation layer for alleviating the information loss issue caused by a speaker adaptation method based on IN. Experiments show that the proposed model outperformed the baseline models on VC performance and robustness. The third contribution studies whether Self-Supervised Learning (SSL) based VC models can reach the same level of performance of the state-of-the-art text-dependent models. An encoder-decoder framework is established for experiments. In this framework, the performance of a VC systems implemented with a SSL model can be compared to a VC system implemented with an ASR model. Experiment results show that SSL based VC models can reach the same level of naturalness performance of the state-of-the-art text- dependent VC models. Also, SSL based VC models gained advantages on intelligibility performance when tested on out of domain target speakers. But they performed worse on speaker similarity

    Efficient interaction with large medical imaging databases

    Get PDF
    Everyday, a wide quantity of hospitals and medical centers around the world are producing large amounts of imaging content to support clinical decisions, medical research, and education. With the current trend towards Evidence-based medicine, there is an increasing need of strategies that allow pathologists to properly interact with the valuable information such imaging repositories host and extract relevant content for supporting decision making. Unfortunately, current systems are very limited at providing access to content and extracting information from it because of different semantic and computational challenges. This thesis presents a whole pipeline, comprising 3 building blocks, that aims to to improve the way pathologists and systems interact. The first building block consists in an adaptable strategy oriented to ease the access and visualization of histopathology imaging content. The second block explores the extraction of relevant information from such imaging content by exploiting low- and mid-level information obtained from from morphology and architecture of cell nuclei. The third block aims to integrate high-level information from the expert in the process of identifying relevant information in the imaging content. This final block not only attempts to deal with the semantic gap but also to present an alternative to manual annotation, a time consuming and prone-to-error task. Different experiments were carried out and demonstrated that the introduced pipeline not only allows pathologist to navigate and visualize images but also to extract diagnostic and prognostic information that potentially could support clinical decisions.Resumen: Diariamente, gran cantidad de hospitales y centros médicos de todo el mundo producen grandes cantidades de imágenes diagnósticas para respaldar decisiones clínicas y apoyar labores de investigación y educación. Con la tendencia actual hacia la medicina basada en evidencia, existe una creciente necesidad de estrategias que permitan a los médicos patólogos interactuar adecuadamente con la información que albergan dichos repositorios de imágenes y extraer contenido relevante que pueda ser empleado para respaldar la toma de decisiones. Desafortunadamente, los sistemas actuales son muy limitados en cuanto al acceso y extracción de contenido de las imágenes debido a diferentes desafíos semánticos y computacionales. Esta tesis presenta un marco de trabajo completo para patología, el cual se compone de 3 bloques y tiene como objetivo mejorar la forma en que interactúan los patólogos y los sistemas. El primer bloque de construcción consiste en una estrategia adaptable orientada a facilitar el acceso y la visualización del contenido de imágenes histopatológicas. El segundo bloque explora la extracción de información relevante de las imágenes mediante la explotación de información de características visuales y estructurales de la morfología y la arquitectura de los núcleos celulares. El tercer bloque apunta a integrar información de alto nivel del experto en el proceso de identificación de información relevante en las imágenes. Este bloque final no solo intenta lidiar con la brecha semántica, sino que también presenta una alternativa a la anotación manual, una tarea que demanda mucho tiempo y es propensa a errores. Se llevaron a cabo diferentes experimentos que demostraron que el marco de trabajo presentado no solo permite que el patólogo navegue y visualice imágenes, sino que también extraiga información de diagnóstico y pronóstico que potencialmente podría respaldar decisiones clínicas.Doctorad
    corecore