458 research outputs found
Feature Trajectory Dynamic Time Warping for Clustering of Speech Segments
Dynamic time warping (DTW) can be used to compute the similarity between two
sequences of generally differing length. We propose a modification to DTW that
performs individual and independent pairwise alignment of feature trajectories.
The modified technique, termed feature trajectory dynamic time warping (FTDTW),
is applied as a similarity measure in the agglomerative hierarchical clustering
of speech segments. Experiments using MFCC and PLP parametrisations extracted
from TIMIT and from the Spoken Arabic Digit Dataset (SADD) show consistent and
statistically significant improvements in the quality of the resulting clusters
in terms of F-measure and normalised mutual information (NMI).Comment: 10 page
A summary of the 2012 JHU CLSP Workshop on Zero Resource Speech Technologies and Models of Early Language Acquisition
We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding zero resource (unsupervised) speech technologies and related models of early language acquisition. Centered around the tasks of phonetic and lexical discovery, we consider unified evaluation metrics, present two new approaches for improving speaker independence in the absence of supervision, and evaluate the application of Bayesian word segmentation algorithms to automatic subword unit tokenizations. Finally, we present two strategies for integrating zero resource techniques into supervised settings, demonstrating the potential of unsupervised methods to improve mainstream technologies.5 page(s
Investigation of Frame Alignments for GMM-based Digit-prompted Speaker Verification
Frame alignments can be computed by different methods in GMM-based speaker
verification. By incorporating a phonetic Gaussian mixture model (PGMM), we are
able to compare the performance using alignments extracted from the deep neural
networks (DNN) and the conventional hidden Markov model (HMM) in digit-prompted
speaker verification. Based on the different characteristics of these two
alignments, we present a novel content verification method to improve the
system security without much computational overhead. Our experiments on the
RSR2015 Part-3 digit-prompted task show that, the DNN based alignment performs
on par with the HMM alignment. The results also demonstrate the effectiveness
of the proposed Kullback-Leibler (KL) divergence based scoring to reject speech
with incorrect pass-phrases.Comment: accepted by APSIPA ASC 201
A Real-Time Lyrics Alignment System Using Chroma And Phonetic Features For Classical Vocal Performance
The goal of real-time lyrics alignment is to take live singing audio as input
and to pinpoint the exact position within given lyrics on the fly. The task can
benefit real-world applications such as the automatic subtitling of live
concerts or operas. However, designing a real-time model poses a great
challenge due to the constraints of only using past input and operating within
a minimal latency. Furthermore, due to the lack of datasets for real-time
models for lyrics alignment, previous studies have mostly evaluated with
private in-house datasets, resulting in a lack of standard evaluation methods.
This paper presents a real-time lyrics alignment system for classical vocal
performances with two contributions. First, we improve the lyrics alignment
algorithm by finding an optimal combination of chromagram and phonetic
posteriorgram (PPG) that capture melodic and phonetics features of the singing
voice, respectively. Second, we recast the Schubert Winterreise Dataset (SWD)
which contains multiple performance renditions of the same pieces as an
evaluation set for the real-time lyrics alignment.Comment: To Appear IEEE ICASSP 202
- …