Search CORE

352 research outputs found

Unsupervised Spoken Term Detection with Spoken Queries by Multi-level Acoustic Patterns with Varying Model Granularity

Author: Chan Chun-an
Chung Cheng-Tao
Lee Lin-shan
Publication venue
Publication date: 07/09/2015
Field of study

This paper presents a new approach for unsupervised Spoken Term Detection with spoken queries using multiple sets of acoustic patterns automatically discovered from the target corpus. The different pattern HMM configurations(number of states per model, number of distinct models, number of Gaussians per state)form a three-dimensional model granularity space. Different sets of acoustic patterns automatically discovered on different points properly distributed over this three-dimensional space are complementary to one another, thus can jointly capture the characteristics of the spoken terms. By representing the spoken content and spoken query as sequences of acoustic patterns, a series of approaches for matching the pattern index sequences while considering the signal variations are developed. In this way, not only the on-line computation load can be reduced, but the signal distributions caused by different speakers and acoustic conditions can be reasonably taken care of. The results indicate that this approach significantly outperformed the unsupervised feature-based DTW baseline by 16.16\% in mean average precision on the TIMIT corpus.Comment: Accepted by ICASSP 201

arXiv.org e-Print Archive

Crossref

Personalized Acoustic Modeling by Weakly Supervised Multi-Task Deep Learning using Acoustic Tokens Discovered from Unlabeled Data

Author: Chung Cheng-Tao
Lee Hung-Yi
Lee Lin-Shan
Wei Cheng-Kuan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 23/06/2017
Field of study

It is well known that recognizers personalized to each user are much more effective than user-independent recognizers. With the popularity of smartphones today, although it is not difficult to collect a large set of audio data for each user, it is difficult to transcribe it. However, it is now possible to automatically discover acoustic tokens from unlabeled personal data in an unsupervised way. We therefore propose a multi-task deep learning framework called a phoneme-token deep neural network (PTDNN), jointly trained from unsupervised acoustic tokens discovered from unlabeled data and very limited transcribed data for personalized acoustic modeling. We term this scenario "weakly supervised". The underlying intuition is that the high degree of similarity between the HMM states of acoustic token models and phoneme models may help them learn from each other in this multi-task learning framework. Initial experiments performed over a personalized audio data set recorded from Facebook posts demonstrated that very good improvements can be achieved in both frame accuracy and word accuracy over popularly-considered baselines such as fDLR, speaker code and lightly supervised adaptation. This approach complements existing speaker adaptation approaches and can be used jointly with such techniques to yield improved results.Comment: 5 pages, 5 figures, published in IEEE ICASSP 201

arXiv.org e-Print Archive

Crossref

Intelligent system for spoken term detection using the belief combination

Author: Khan Wasiq
Kuru Kaya
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 13/02/2017
Field of study

Spoken Term Detection (STD) can be considered as a sub-part of the automatic speech recognition which aims to extract the partial information from speech signals in the form of query utterances. A variety of STD techniques available in the literature employ a single source of evidence for the query utterance match/mismatch determination. In this manuscript, we develop an acoustic signal processing based approach for STD that incorporates a number of techniques for silence removal, dynamic noise filtration, and evidence combination using Dempster-Shafer Theory (DST). A ‘spectral-temporal features based voiced segment detection’ and ‘energy and zero cross rate based unvoiced segment detection’ are built to remove the silence segments in the speech signal. Comprehensive experiments have been performed on large speech datasets and consequently satisfactory results have been achieved with the proposed approach. Our approach improves the existing speaker dependent STD approaches, specifically the reliability of query utterance spotting by combining the evidences from multiple belief sources

CLoK

Crossref

E-space: Manchester Metropolitan University's Research Repository

Adaptive framing based similarity measurement between time warped speech signals using Kalman filter

Author: A Akila
B Yegnanarayana
C Chun-An
G Ravindran
J Tejedor
K Thambiratmann
K Wasiq
K Wasiq
Keeley Crockett
MM Pour
Muhammad Bilal
N Dave
OA Soluade
P Sharma
RR Lawrence
RS Tushar
S Dhingra
SA Zahorian
SG Mohinder
SG Mohinder
T Javier
Wasiq Khan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 17/04/2018
Field of study

Similarity measurement between speech signals aims at calculating the degree of similarity using acoustic features that has been receiving much interest due to the processing of large volume of multimedia information. However, dynamic properties of speech signals such as varying silence segments and time warping factor make it more challenging to measure the similarity between speech signals. This manuscript entails further extension of our research towards the adaptive framing based similarity measurement between speech signals using a Kalman filter. Silence removal is enhanced by integrating multiple features for voiced and unvoiced speech segments detection. The adaptive frame size measurement is improved by using the acceleration/deceleration phenomenon of object linear motion. A dominate feature set is used to represent the speech signals along with the pre-calculated model parameters that are set by the offline tuning of a Kalman filter. Performance is evaluated using additional datasets to evaluate the impact of the proposed model and silence removal approach on the time warped speech similarity measurement. Detailed statistical results are achieved indicating the overall accuracy improvement from 91 to 98% that proves the superiority of the extended approach on our previous research work towards the time warped continuous speech similarity measurement

LJMU Research Online (Liverpool John Moores University)

Crossref

E-space: Manchester Metropolitan University's Research Repository