3,326 research outputs found

    Improved Audio Embeddings by Adjacency-Based Clustering with Applications in Spoken Term Detection

    Full text link
    Embedding audio signal segments into vectors with fixed dimensionality is attractive because all following processing will be easier and more efficient, for example modeling, classifying or indexing. Audio Word2Vec previously proposed was shown to be able to represent audio segments for spoken words as such vectors carrying information about the phonetic structures of the signal segments. However, each linguistic unit (word, syllable, phoneme in text form) corresponds to unlimited number of audio segments with vector representations inevitably spread over the embedding space, which causes some confusion. It is therefore desired to better cluster the audio embeddings such that those corresponding to the same linguistic unit can be more compactly distributed. In this paper, inspired by Siamese networks, we propose some approaches to achieve the above goal. This includes identifying positive and negative pairs from unlabeled data for Siamese style training, disentangling acoustic factors such as speaker characteristics from the audio embedding, handling unbalanced data distribution, and having the embedding processes learn from the adjacency relationships among data points. All these can be done in an unsupervised way. Improved performance was obtained in preliminary experiments on the LibriSpeech data set, including clustering characteristics analysis and applications of spoken term detection

    Learning acoustic word embeddings with phonetically associated triplet network

    Full text link
    Previous researches on acoustic word embeddings used in query-by-example spoken term detection have shown remarkable performance improvements when using a triplet network. However, the triplet network is trained using only a limited information about acoustic similarity between words. In this paper, we propose a novel architecture, phonetically associated triplet network (PATN), which aims at increasing discriminative power of acoustic word embeddings by utilizing phonetic information as well as word identity. The proposed model is learned to minimize a combined loss function that was made by introducing a cross entropy loss to the lower layer of LSTM-based triplet network. We observed that the proposed method performs significantly better than the baseline triplet network on a word discrimination task with the WSJ dataset resulting in over 20% relative improvement in recall rate at 1.0 false alarm per hour. Finally, we examined the generalization ability by conducting the out-of-domain test on the RM dataset.Comment: 5 pages, 4 figures, submitted to ICASSP 201

    Integrate Document Ranking Information into Confidence Measure Calculation for Spoken Term Detection

    Full text link
    This paper proposes an algorithm to improve the calculation of confidence measure for spoken term detection (STD). Given an input query term, the algorithm first calculates a measurement named document ranking weight for each document in the speech database to reflect its relevance with the query term by summing all the confidence measures of the hypothesized term occurrences in this document. The confidence measure of each term occurrence is then re-estimated through linear interpolation with the calculated document ranking weight to improve its reliability by integrating document-level information. Experiments are conducted on three standard STD tasks for Tamil, Vietnamese and English respectively. The experimental results all demonstrate that the proposed algorithm achieves consistent improvements over the state-of-the-art method for confidence measure calculation. Furthermore, this algorithm is still effective even if a high accuracy speech recognizer is not available, which makes it applicable for the languages with limited speech resources.Comment: 4 page

    Personalized Acoustic Modeling by Weakly Supervised Multi-Task Deep Learning using Acoustic Tokens Discovered from Unlabeled Data

    Full text link
    It is well known that recognizers personalized to each user are much more effective than user-independent recognizers. With the popularity of smartphones today, although it is not difficult to collect a large set of audio data for each user, it is difficult to transcribe it. However, it is now possible to automatically discover acoustic tokens from unlabeled personal data in an unsupervised way. We therefore propose a multi-task deep learning framework called a phoneme-token deep neural network (PTDNN), jointly trained from unsupervised acoustic tokens discovered from unlabeled data and very limited transcribed data for personalized acoustic modeling. We term this scenario "weakly supervised". The underlying intuition is that the high degree of similarity between the HMM states of acoustic token models and phoneme models may help them learn from each other in this multi-task learning framework. Initial experiments performed over a personalized audio data set recorded from Facebook posts demonstrated that very good improvements can be achieved in both frame accuracy and word accuracy over popularly-considered baselines such as fDLR, speaker code and lightly supervised adaptation. This approach complements existing speaker adaptation approaches and can be used jointly with such techniques to yield improved results.Comment: 5 pages, 5 figures, published in IEEE ICASSP 201

    Time Series Classification using the Hidden-Unit Logistic Model

    Full text link
    We present a new model for time series classification, called the hidden-unit logistic model, that uses binary stochastic hidden units to model latent structure in the data. The hidden units are connected in a chain structure that models temporal dependencies in the data. Compared to the prior models for time series classification such as the hidden conditional random field, our model can model very complex decision boundaries because the number of latent states grows exponentially with the number of hidden units. We demonstrate the strong performance of our model in experiments on a variety of (computer vision) tasks, including handwritten character recognition, speech recognition, facial expression, and action recognition. We also present a state-of-the-art system for facial action unit detection based on the hidden-unit logistic model.Comment: 17 pages, 4 figures, 3 table

    Streaming Small-Footprint Keyword Spotting using Sequence-to-Sequence Models

    Full text link
    We develop streaming keyword spotting systems using a recurrent neural network transducer (RNN-T) model: an all-neural, end-to-end trained, sequence-to-sequence model which jointly learns acoustic and language model components. Our models are trained to predict either phonemes or graphemes as subword units, thus allowing us to detect arbitrary keyword phrases, without any out-of-vocabulary words. In order to adapt the models to the requirements of keyword spotting, we propose a novel technique which biases the RNN-T system towards a specific keyword of interest. Our systems are compared against a strong sequence-trained, connectionist temporal classification (CTC) based "keyword-filler" baseline, which is augmented with a separate phoneme language model. Overall, our RNN-T system with the proposed biasing technique significantly improves performance over the baseline system.Comment: To appear in Proceedings of IEEE ASRU 201

    Information Extraction from Scientific Literature for Method Recommendation

    Full text link
    As a research community grows, more and more papers are published each year. As a result there is increasing demand for improved methods for finding relevant papers, automatically understanding the key ideas and recommending potential methods for a target problem. Despite advances in search engines, it is still hard to identify new technologies according to a researcher's need. Due to the large variety of domains and extremely limited annotated resources, there has been relatively little work on leveraging natural language processing in scientific recommendation. In this proposal, we aim at making scientific recommendations by extracting scientific terms from a large collection of scientific papers and organizing the terms into a knowledge graph. In preliminary work, we trained a scientific term extractor using a small amount of annotated data and obtained state-of-the-art performance by leveraging large amount of unannotated papers through applying multiple semi-supervised approaches. We propose to construct a knowledge graph in a way that can make minimal use of hand annotated data, using only the extracted terms, unsupervised relational signals such as co-occurrence, and structural external resources such as Wikipedia. Latent relations between scientific terms can be learned from the graph. Recommendations will be made through graph inference for both observed and unobserved relational pairs.Comment: Thesis Proposal. arXiv admin note: text overlap with arXiv:1708.0607

    End-to-end Language Identification using NetFV and NetVLAD

    Full text link
    In this paper, we apply the NetFV and NetVLAD layers for the end-to-end language identification task. NetFV and NetVLAD layers are the differentiable implementations of the standard Fisher Vector and Vector of Locally Aggregated Descriptors (VLAD) methods, respectively. Both of them can encode a sequence of feature vectors into a fixed dimensional vector which is very important to process those variable-length utterances. We first present the relevances and differences between the classical i-vector and the aforementioned encoding schemes. Then, we construct a flexible end-to-end framework including a convolutional neural network (CNN) architecture and an encoding layer (NetFV or NetVLAD) for the language identification task. Experimental results on the NIST LRE 2007 close-set task show that the proposed system achieves significant EER reductions against the conventional i-vector baseline and the CNN temporal average pooling system, respectively.Comment: Accepted for ISCSLP 201

    Multi-Task Network for Noise-Robust Keyword Spotting and Speaker Verification using CTC-based Soft VAD and Global Query Attention

    Full text link
    Keyword spotting (KWS) and speaker verification (SV) have been studied independently although it is known that acoustic and speaker domains are complementary. In this paper, we propose a multi-task network that performs KWS and SV simultaneously to fully utilize the interrelated domain information. The multi-task network tightly combines sub-networks aiming at performance improvement in challenging conditions such as noisy environments, open-vocabulary KWS, and short-duration SV, by introducing novel techniques of connectionist temporal classification (CTC)-based soft voice activity detection (VAD) and global query attention. Frame-level acoustic and speaker information is integrated with phonetically originated weights so that forms a word-level global representation. Then it is used for the aggregation of feature vectors to generate discriminative embeddings. Our proposed approach shows 4.06% and 26.71% relative improvements in equal error rate (EER) compared to the baselines for both tasks. We also present a visualization example and results of ablation experiments.Comment: Accepted to Interspeech 202

    An End-to-End Approach to Automatic Speech Assessment for Cantonese-speaking People with Aphasia

    Full text link
    Conventional automatic assessment of pathological speech usually follows two main steps: (1) extraction of pathology-specific features; (2) classification or regression on extracted features. Given the great variety of speech and language disorders, feature design is never a straightforward task, and yet it is most crucial to the performance of assessment. This paper presents an end-to-end approach to automatic speech assessment for Cantonese-speaking People With Aphasia (PWA). The assessment is formulated as a binary classification task to discriminate PWA with high scores of subjective assessment from those with low scores. The sequence-to-one Recurrent Neural Network with Gated Recurrent Unit (GRU-RNN) and Convolutional Neural Network (CNN) models are applied to realize the end-to-end mapping from fundamental speech features to the classification result. The pathology-specific features used for assessment can be learned implicitly by the neural network model. Class Activation Mapping (CAM) method is utilized to visualize how those features contribute to the assessment result. Our experimental results show that the end-to-end approach outperforms the conventional two-step approach in the classification task, and confirm that the CNN model is able to learn impairment-related features that are similar to human-designed features. The experimental results also suggest that CNN model performs better than sequence-to-one GRU-RNN model in this specific task
    corecore