257 research outputs found

    A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments

    Full text link
    Most speech and language technologies are trained with massive amounts of speech and text information. However, most of the world languages do not have such resources or stable orthography. Systems constructed under these almost zero resource conditions are not only promising for speech technology but also for computational language documentation. The goal of computational language documentation is to help field linguists to (semi-)automatically analyze and annotate audio recordings of endangered and unwritten languages. Example tasks are automatic phoneme discovery or lexicon discovery from the speech signal. This paper presents a speech corpus collected during a realistic language documentation process. It is made up of 5k speech utterances in Mboshi (Bantu C25) aligned to French text translations. Speech transcriptions are also made available: they correspond to a non-standard graphemic form close to the language phonology. We present how the data was collected, cleaned and processed and we illustrate its use through a zero-resource task: spoken term discovery. The dataset is made available to the community for reproducible computational language documentation experiments and their evaluation.Comment: accepted to LREC 201

    Towards trustworthy phoneme boundary detection with autoregressive model and improved evaluation metric

    Full text link
    Phoneme boundary detection has been studied due to its central role in various speech applications. In this work, we point out that this task needs to be addressed not only by algorithmic way, but also by evaluation metric. To this end, we first propose a state-of-the-art phoneme boundary detector that operates in an autoregressive manner, dubbed SuperSeg. Experiments on the TIMIT and Buckeye corpora demonstrates that SuperSeg identifies phoneme boundaries with significant margin compared to existing models. Furthermore, we note that there is a limitation on the popular evaluation metric, R-value, and propose new evaluation metrics that prevent each boundary from contributing to evaluation multiple times. The proposed metrics reveal the weaknesses of non-autoregressive baselines and establishes a reliable criterion that suits for evaluating phoneme boundary detection.Comment: 5 pages, submitted to ICASSP 202

    Deep Learning for Audio Signal Processing

    Full text link
    Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

    Multi-Task Learning of Keyphrase Boundary Classification

    Full text link
    Keyphrase boundary classification (KBC) is the task of detecting keyphrases in scientific articles and labelling them with respect to predefined types. Although important in practice, this task is so far underexplored, partly due to the lack of labelled data. To overcome this, we explore several auxiliary tasks, including semantic super-sense tagging and identification of multi-word expressions, and cast the task as a multi-task learning problem with deep recurrent neural networks. Our multi-task models perform significantly better than previous state of the art approaches on two scientific KBC datasets, particularly for long keyphrases.Comment: ACL 201

    Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation

    Full text link
    We propose a self-supervised representation learning model for the task of unsupervised phoneme boundary detection. The model is a convolutional neural network that operates directly on the raw waveform. It is optimized to identify spectral changes in the signal using the Noise-Contrastive Estimation principle. At test time, a peak detection algorithm is applied over the model outputs to produce the final boundaries. As such, the proposed model is trained in a fully unsupervised manner with no manual annotations in the form of target boundaries nor phonetic transcriptions. We compare the proposed approach to several unsupervised baselines using both TIMIT and Buckeye corpora. Results suggest that our approach surpasses the baseline models and reaches state-of-the-art performance on both data sets. Furthermore, we experimented with expanding the training set with additional examples from the Librispeech corpus. We evaluated the resulting model on distributions and languages that were not seen during the training phase (English, Hebrew and German) and showed that utilizing additional untranscribed data is beneficial for model performance.Comment: Interspeech 2020 pape
    corecore