3,326 research outputs found
Improved Audio Embeddings by Adjacency-Based Clustering with Applications in Spoken Term Detection
Embedding audio signal segments into vectors with fixed dimensionality is
attractive because all following processing will be easier and more efficient,
for example modeling, classifying or indexing. Audio Word2Vec previously
proposed was shown to be able to represent audio segments for spoken words as
such vectors carrying information about the phonetic structures of the signal
segments. However, each linguistic unit (word, syllable, phoneme in text form)
corresponds to unlimited number of audio segments with vector representations
inevitably spread over the embedding space, which causes some confusion. It is
therefore desired to better cluster the audio embeddings such that those
corresponding to the same linguistic unit can be more compactly distributed. In
this paper, inspired by Siamese networks, we propose some approaches to achieve
the above goal. This includes identifying positive and negative pairs from
unlabeled data for Siamese style training, disentangling acoustic factors such
as speaker characteristics from the audio embedding, handling unbalanced data
distribution, and having the embedding processes learn from the adjacency
relationships among data points. All these can be done in an unsupervised way.
Improved performance was obtained in preliminary experiments on the LibriSpeech
data set, including clustering characteristics analysis and applications of
spoken term detection
Learning acoustic word embeddings with phonetically associated triplet network
Previous researches on acoustic word embeddings used in query-by-example
spoken term detection have shown remarkable performance improvements when using
a triplet network. However, the triplet network is trained using only a limited
information about acoustic similarity between words. In this paper, we propose
a novel architecture, phonetically associated triplet network (PATN), which
aims at increasing discriminative power of acoustic word embeddings by
utilizing phonetic information as well as word identity. The proposed model is
learned to minimize a combined loss function that was made by introducing a
cross entropy loss to the lower layer of LSTM-based triplet network. We
observed that the proposed method performs significantly better than the
baseline triplet network on a word discrimination task with the WSJ dataset
resulting in over 20% relative improvement in recall rate at 1.0 false alarm
per hour. Finally, we examined the generalization ability by conducting the
out-of-domain test on the RM dataset.Comment: 5 pages, 4 figures, submitted to ICASSP 201
Integrate Document Ranking Information into Confidence Measure Calculation for Spoken Term Detection
This paper proposes an algorithm to improve the calculation of confidence
measure for spoken term detection (STD). Given an input query term, the
algorithm first calculates a measurement named document ranking weight for each
document in the speech database to reflect its relevance with the query term by
summing all the confidence measures of the hypothesized term occurrences in
this document. The confidence measure of each term occurrence is then
re-estimated through linear interpolation with the calculated document ranking
weight to improve its reliability by integrating document-level information.
Experiments are conducted on three standard STD tasks for Tamil, Vietnamese and
English respectively. The experimental results all demonstrate that the
proposed algorithm achieves consistent improvements over the state-of-the-art
method for confidence measure calculation. Furthermore, this algorithm is still
effective even if a high accuracy speech recognizer is not available, which
makes it applicable for the languages with limited speech resources.Comment: 4 page
Personalized Acoustic Modeling by Weakly Supervised Multi-Task Deep Learning using Acoustic Tokens Discovered from Unlabeled Data
It is well known that recognizers personalized to each user are much more
effective than user-independent recognizers. With the popularity of smartphones
today, although it is not difficult to collect a large set of audio data for
each user, it is difficult to transcribe it. However, it is now possible to
automatically discover acoustic tokens from unlabeled personal data in an
unsupervised way. We therefore propose a multi-task deep learning framework
called a phoneme-token deep neural network (PTDNN), jointly trained from
unsupervised acoustic tokens discovered from unlabeled data and very limited
transcribed data for personalized acoustic modeling. We term this scenario
"weakly supervised". The underlying intuition is that the high degree of
similarity between the HMM states of acoustic token models and phoneme models
may help them learn from each other in this multi-task learning framework.
Initial experiments performed over a personalized audio data set recorded from
Facebook posts demonstrated that very good improvements can be achieved in both
frame accuracy and word accuracy over popularly-considered baselines such as
fDLR, speaker code and lightly supervised adaptation. This approach complements
existing speaker adaptation approaches and can be used jointly with such
techniques to yield improved results.Comment: 5 pages, 5 figures, published in IEEE ICASSP 201
Time Series Classification using the Hidden-Unit Logistic Model
We present a new model for time series classification, called the hidden-unit
logistic model, that uses binary stochastic hidden units to model latent
structure in the data. The hidden units are connected in a chain structure that
models temporal dependencies in the data. Compared to the prior models for time
series classification such as the hidden conditional random field, our model
can model very complex decision boundaries because the number of latent states
grows exponentially with the number of hidden units. We demonstrate the strong
performance of our model in experiments on a variety of (computer vision)
tasks, including handwritten character recognition, speech recognition, facial
expression, and action recognition. We also present a state-of-the-art system
for facial action unit detection based on the hidden-unit logistic model.Comment: 17 pages, 4 figures, 3 table
Streaming Small-Footprint Keyword Spotting using Sequence-to-Sequence Models
We develop streaming keyword spotting systems using a recurrent neural
network transducer (RNN-T) model: an all-neural, end-to-end trained,
sequence-to-sequence model which jointly learns acoustic and language model
components. Our models are trained to predict either phonemes or graphemes as
subword units, thus allowing us to detect arbitrary keyword phrases, without
any out-of-vocabulary words. In order to adapt the models to the requirements
of keyword spotting, we propose a novel technique which biases the RNN-T system
towards a specific keyword of interest.
Our systems are compared against a strong sequence-trained, connectionist
temporal classification (CTC) based "keyword-filler" baseline, which is
augmented with a separate phoneme language model. Overall, our RNN-T system
with the proposed biasing technique significantly improves performance over the
baseline system.Comment: To appear in Proceedings of IEEE ASRU 201
Information Extraction from Scientific Literature for Method Recommendation
As a research community grows, more and more papers are published each year.
As a result there is increasing demand for improved methods for finding
relevant papers, automatically understanding the key ideas and recommending
potential methods for a target problem. Despite advances in search engines, it
is still hard to identify new technologies according to a researcher's need.
Due to the large variety of domains and extremely limited annotated resources,
there has been relatively little work on leveraging natural language processing
in scientific recommendation. In this proposal, we aim at making scientific
recommendations by extracting scientific terms from a large collection of
scientific papers and organizing the terms into a knowledge graph. In
preliminary work, we trained a scientific term extractor using a small amount
of annotated data and obtained state-of-the-art performance by leveraging large
amount of unannotated papers through applying multiple semi-supervised
approaches. We propose to construct a knowledge graph in a way that can make
minimal use of hand annotated data, using only the extracted terms,
unsupervised relational signals such as co-occurrence, and structural external
resources such as Wikipedia. Latent relations between scientific terms can be
learned from the graph. Recommendations will be made through graph inference
for both observed and unobserved relational pairs.Comment: Thesis Proposal. arXiv admin note: text overlap with arXiv:1708.0607
End-to-end Language Identification using NetFV and NetVLAD
In this paper, we apply the NetFV and NetVLAD layers for the end-to-end
language identification task. NetFV and NetVLAD layers are the differentiable
implementations of the standard Fisher Vector and Vector of Locally Aggregated
Descriptors (VLAD) methods, respectively. Both of them can encode a sequence of
feature vectors into a fixed dimensional vector which is very important to
process those variable-length utterances. We first present the relevances and
differences between the classical i-vector and the aforementioned encoding
schemes. Then, we construct a flexible end-to-end framework including a
convolutional neural network (CNN) architecture and an encoding layer (NetFV or
NetVLAD) for the language identification task. Experimental results on the NIST
LRE 2007 close-set task show that the proposed system achieves significant EER
reductions against the conventional i-vector baseline and the CNN temporal
average pooling system, respectively.Comment: Accepted for ISCSLP 201
Multi-Task Network for Noise-Robust Keyword Spotting and Speaker Verification using CTC-based Soft VAD and Global Query Attention
Keyword spotting (KWS) and speaker verification (SV) have been studied
independently although it is known that acoustic and speaker domains are
complementary. In this paper, we propose a multi-task network that performs KWS
and SV simultaneously to fully utilize the interrelated domain information. The
multi-task network tightly combines sub-networks aiming at performance
improvement in challenging conditions such as noisy environments,
open-vocabulary KWS, and short-duration SV, by introducing novel techniques of
connectionist temporal classification (CTC)-based soft voice activity detection
(VAD) and global query attention. Frame-level acoustic and speaker information
is integrated with phonetically originated weights so that forms a word-level
global representation. Then it is used for the aggregation of feature vectors
to generate discriminative embeddings. Our proposed approach shows 4.06% and
26.71% relative improvements in equal error rate (EER) compared to the
baselines for both tasks. We also present a visualization example and results
of ablation experiments.Comment: Accepted to Interspeech 202
An End-to-End Approach to Automatic Speech Assessment for Cantonese-speaking People with Aphasia
Conventional automatic assessment of pathological speech usually follows two
main steps: (1) extraction of pathology-specific features; (2) classification
or regression on extracted features. Given the great variety of speech and
language disorders, feature design is never a straightforward task, and yet it
is most crucial to the performance of assessment. This paper presents an
end-to-end approach to automatic speech assessment for Cantonese-speaking
People With Aphasia (PWA). The assessment is formulated as a binary
classification task to discriminate PWA with high scores of subjective
assessment from those with low scores. The sequence-to-one Recurrent Neural
Network with Gated Recurrent Unit (GRU-RNN) and Convolutional Neural Network
(CNN) models are applied to realize the end-to-end mapping from fundamental
speech features to the classification result. The pathology-specific features
used for assessment can be learned implicitly by the neural network model.
Class Activation Mapping (CAM) method is utilized to visualize how those
features contribute to the assessment result. Our experimental results show
that the end-to-end approach outperforms the conventional two-step approach in
the classification task, and confirm that the CNN model is able to learn
impairment-related features that are similar to human-designed features. The
experimental results also suggest that CNN model performs better than
sequence-to-one GRU-RNN model in this specific task
- …