9,337 research outputs found
Wavelet transforms for non-uniform speech recognition
An algorithm for nonuniform speech segmentation and its application in speech recognition systems is presented. A method based on the Modulated Gaussian Wavelet Transform based Speech Analyser (MGWTSA) and the subsequent parametrization block is used to transform a uniform signal into a set of nonuniformly separated frames, with the accurate information being fed into a speech recognition system. The algorithm needs a frame characterizing the signal where necessary, trying to reduce the number of frames per signal as much as possible, without an appreciable reduction in the recognition rate of the system.Peer ReviewedPostprint (published version
Integrating user-centred design in the development of a silent speech interface based on permanent magnetic articulography
Abstract: A new wearable silent speech interface (SSI) based on Permanent Magnetic Articulography (PMA) was developed with the involvement of end users in the design process. Hence, desirable features such as appearance, port-ability, ease of use and light weight were integrated into the prototype. The aim of this paper is to address the challenges faced and the design considerations addressed during the development. Evaluation on both hardware and speech recognition performances are presented here. The new prototype shows a com-parable performance with its predecessor in terms of speech recognition accuracy (i.e. ~95% of word accuracy and ~75% of sequence accuracy), but significantly improved appearance, portability and hardware features in terms of min-iaturization and cost
Speech systems research at Texas Instruments
An assessment of automatic speech processing technology is presented. Fundamental problems in the development and the deployment of automatic speech processing systems are defined and a technology forecast for speech systems is presented
Multimodal One-Shot Learning of Speech and Images
Imagine a robot is shown new concepts visually together with spoken tags,
e.g. "milk", "eggs", "butter". After seeing one paired audio-visual example per
class, it is shown a new set of unseen instances of these objects, and asked to
pick the "milk". Without receiving any hard labels, could it learn to match the
new continuous speech input to the correct visual instance? Although unimodal
one-shot learning has been studied, where one labelled example in a single
modality is given per class, this example motivates multimodal one-shot
learning. Our main contribution is to formally define this task, and to propose
several baseline and advanced models. We use a dataset of paired spoken and
visual digits to specifically investigate recent advances in Siamese
convolutional neural networks. Our best Siamese model achieves twice the
accuracy of a nearest neighbour model using pixel-distance over images and
dynamic time warping over speech in 11-way cross-modal matching.Comment: 5 pages, 1 figure, 3 tables; accepted to ICASSP 201
Robust speech recognition based on a Bayesian prediction approach
We study a category of robust speech recognition problem in which mismatches exist between training and testing conditions, and no accurate knowledge of the mismatch mechanism is available. The only available information is the test data along with a set of pretrained Gaussian mixture continuous density hidden Markov models (CDHMMs). We investigate the problem from the viewpoint of Bayesian prediction. A simple prior distribution, namely constrained uniform distribution, is adopted to characterize the uncertainty of the mean vectors of the CDHMMs. Two methods, namely a model compensation technique based on Bayesian predictive density and a robust decision strategy called Viterbi Bayesian predictive classification are studied. The proposed methods are compared with the conventional Viterbi decoding algorithm in speaker-independent recognition experiments on isolated digits and TI connected digit strings (TIDTGITS), where the mismatches between training and testing conditions are caused by: (1) additive Gaussian white noise, (2) each of 25 types of actual additive ambient noises, and (3) gender difference. The experimental results show that the adopted prior distribution and the proposed techniques help to improve the performance robustness under the examined mismatch conditions.published_or_final_versio
Voice input/output capabilities at Perception Technology Corporation
Condensed resumes of key company personnel at the Perception Technology Corporation are presented. The staff possesses recognition, speech synthesis, speaker authentication, and language identification. Hardware and software engineers' capabilities are included
- …