251 research outputs found
PhonMatchNet: Phoneme-Guided Zero-Shot Keyword Spotting for User-Defined Keywords
This study presents a novel zero-shot user-defined keyword spotting model
that utilizes the audio-phoneme relationship of the keyword to improve
performance. Unlike the previous approach that estimates at utterance level, we
use both utterance and phoneme level information. Our proposed method comprises
a two-stream speech encoder architecture, self-attention-based pattern
extractor, and phoneme-level detection loss for high performance in various
pronunciation environments. Based on experimental results, our proposed model
outperforms the baseline model and achieves competitive performance compared
with full-shot keyword spotting models. Our proposed model significantly
improves the EER and AUC across all datasets, including familiar words, proper
nouns, and indistinguishable pronunciations, with an average relative
improvement of 67% and 80%, respectively. The implementation code of our
proposed model is available at https://github.com/ncsoft/PhonMatchNet
Study to determine potential flight applications and human factors design guidelines for voice recognition and synthesis systems
A study was conducted to determine potential commercial aircraft flight deck applications and implementation guidelines for voice recognition and synthesis. At first, a survey of voice recognition and synthesis technology was undertaken to develop a working knowledge base. Then, numerous potential aircraft and simulator flight deck voice applications were identified and each proposed application was rated on a number of criteria in order to achieve an overall payoff rating. The potential voice recognition applications fell into five general categories: programming, interrogation, data entry, switch and mode selection, and continuous/time-critical action control. The ratings of the first three categories showed the most promise of being beneficial to flight deck operations. Possible applications of voice synthesis systems were categorized as automatic or pilot selectable and many were rated as being potentially beneficial. In addition, voice system implementation guidelines and pertinent performance criteria are proposed. Finally, the findings of this study are compared with those made in a recent NASA study of a 1995 transport concept
Matching Latent Encoding for Audio-Text based Keyword Spotting
Using audio and text embeddings jointly for Keyword Spotting (KWS) has shown
high-quality results, but the key challenge of how to semantically align two
embeddings for multi-word keywords of different sequence lengths remains
largely unsolved. In this paper, we propose an audio-text-based end-to-end
model architecture for flexible keyword spotting (KWS), which builds upon
learned audio and text embeddings. Our architecture uses a novel dynamic
programming-based algorithm, Dynamic Sequence Partitioning (DSP), to optimally
partition the audio sequence into the same length as the word-based text
sequence using the monotonic alignment of spoken content. Our proposed model
consists of an encoder block to get audio and text embeddings, a projector
block to project individual embeddings to a common latent space, and an
audio-text aligner containing a novel DSP algorithm, which aligns the audio and
text embeddings to determine if the spoken content is the same as the text.
Experimental results show that our DSP is more effective than other
partitioning schemes, and the proposed architecture outperformed the
state-of-the-art results on the public dataset in terms of Area Under the ROC
Curve (AUC) and Equal-Error-Rate (EER) by 14.4 % and 28.9%, respectively
Automated speech and audio analysis for semantic access to multimedia
The deployment and integration of audio processing tools can enhance the semantic annotation of multimedia content, and as a consequence, improve the effectiveness of conceptual access tools. This paper overviews the various ways in which automatic speech and audio analysis can contribute to increased granularity of automatically extracted metadata. A number of techniques will be presented, including the alignment of speech and text resources, large vocabulary speech recognition, key word spotting and speaker classification. The applicability of techniques will be discussed from a media crossing perspective. The added value of the techniques and their potential contribution to the content value chain will be illustrated by the description of two (complementary) demonstrators for browsing broadcast news archives
Exploiting Phoneme Similarities in Hybrid HMM-ANN Keyword Spotting
We propose a technique for generating alternative models for keywords in a hybrid hidden Markov model - artificial neural network (HMM-ANN) keyword spotting paradigm. Given a base pronunciation for a keyword from the lookup dictionary, our algorithm generates a new model for a keyword which takes into account the systematic errors made by the neural network and avoiding those models that can be confused with other words in the language. The new keyword model improves the keyword detection rate while minimally increasing the number of false alarms
A phonetic-based approach to query-by-example spoken term detection
The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-41822-8_63Query-by-Example Spoken Term Detection (QbE-STD) tasks are usually addressed by representing speech signals as a sequence of feature vectors by means of a parametrization step, and then using a pattern matching technique to find the candidate detections. In this paper, we propose a phoneme-based approach in which the acoustic frames are first converted into vectors representing the a posteriori probabilities for every phoneme. This strategy is specially useful when the language of the task is a priori known. Then, we show how this representation can be used for QbE-STD using both a Segmental Dynamic Time Warping algorithm and a graph-based method. The proposed approach has been evaluated with a QbE-STD task in Spanish, and the results show that it can be an adequate strategy for tackling this kind of problemsWork partially supported by the Spanish Ministerio de EconomĂa y Competitividad under contract TIN2011-28169-C05-01 and FPU Grant AP2010-4193, and by the Vic. dâInvestigaciĂł of the UPV (PAID-06-10)Hurtado Oliver, LF.; Calvo Lance, M.; GĂłmez Adrian, JA.; GarcĂa Granada, F.; SanchĂs Arnal, E. (2013). A phonetic-based approach to query-by-example spoken term detection. En Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. Springer Verlag (Germany). 8529:504-511. https://doi.org/10.1007/978-3-642-41822-8_63S5045118529Anguera, X., Macrae, R., Oliver, N.: Partial sequence matching using an unbounded dynamic time warping algorithm. In: ICASSP, pp. 3582â3585 (2010)Hazen, T., Shen, W., White, C.: Query-by-example spoken term detection using phonetic posteriorgram templates. In: ASRU, pp. 421â426 (2009)Zhang, Y., Glass, J.: Unsupervised spoken keyword spotting via segmental DTW on gaussian posteriorgrams. In: ASRU, pp. 398â403 (2009)Akbacak, M., Vergyri, D., Stolcke, A.: Open-vocabulary spoken term detection using graphone-based hybrid recognition systems. In: ICASSP, pp. 5240â5243 (2008)Fiscus, J.G., Ajot, J., Garofolo, J.S., Doddingtion, G.: Results of the 2006 spoken term detection evaluation. In: Proceedings of ACM SIGIR Workshop on Searching Spontaneous Conversational, pp. 51â55 (2007)Metze, F., Barnard, E., Davel, M., Van Heerden, C., Anguera, X., Gravier, G., Rajput, N., et al.: The spoken web search task. In: Working Notes Proceedings of the MediaEval 2012 Workshop (2012)GĂłmez, J.A., Castro, M.J.: Automatic segmentation of speech at the phonetic level. In: Caelli, T.M., Amin, A., Duin, R.P.W., Kamel, M.S., de Ridder, D. (eds.) SSPR & SPR 2002. LNCS, vol. 2396, pp. 672â680. Springer, Heidelberg (2002)GĂłmez, J.A., Sanchis, E., Castro-Bleda, M.J.: Automatic speech segmentation based on acoustical clustering. In: Hancock, E.R., Wilson, R.C., Windeatt, T., Ulusoy, I., Escolano, F. (eds.) SSPR & SPR 2010. LNCS, vol. 6218, pp. 540â548. Springer, Heidelberg (2010)Moreno, A., Poch, D., Bonafonte, A., Lleida, E., Llisterri, J., Marino, J., Nadeu, C.: Albayzin speech database: Design of the phonetic corpus. In: Third European Conference on Speech Communication and Technology (1993)Park, A., Glass, J.: Towards unsupervised pattern discovery in speech. In: ASRU, pp. 53â58 (2005)Kullback, S.: Information theory and statistics. Courier Dover Publications (1997)MAVIR corpus, http://www.lllf.uam.es/ESP/CorpusMavir.htm
An acoustic-phonetic approach in automatic Arabic speech recognition
In a large vocabulary speech recognition system the broad phonetic classification
technique is used instead of detailed phonetic analysis to overcome the variability in the
acoustic realisation of utterances. The broad phonetic description of a word is used as a
means of lexical access, where the lexicon is structured into sets of words sharing the
same broad phonetic labelling.
This approach has been applied to a large vocabulary isolated word Arabic speech
recognition system. Statistical studies have been carried out on 10,000 Arabic words
(converted to phonemic form) involving different combinations of broad phonetic
classes. Some particular features of the Arabic language have been exploited. The results
show that vowels represent about 43% of the total number of phonemes. They also show
that about 38% of the words can uniquely be represented at this level by using eight
broad phonetic classes. When introducing detailed vowel identification the percentage of
uniquely specified words rises to 83%. These results suggest that a fully detailed
phonetic analysis of the speech signal is perhaps unnecessary.
In the adopted word recognition model, the consonants are classified into four broad
phonetic classes, while the vowels are described by their phonemic form. A set of 100
words uttered by several speakers has been used to test the performance of the
implemented approach.
In the implemented recognition model, three procedures have been developed, namely
voiced-unvoiced-silence segmentation, vowel detection and identification, and automatic
spectral transition detection between phonemes within a word. The accuracy of both the
V-UV-S and vowel recognition procedures is almost perfect. A broad phonetic
segmentation procedure has been implemented, which exploits information from the
above mentioned three procedures. Simple phonological constraints have been used to
improve the accuracy of the segmentation process. The resultant sequence of labels are
used for lexical access to retrieve the word or a small set of words sharing the same broad
phonetic labelling. For the case of having more than one word-candidates, a verification
procedure is used to choose the most likely one
- âŚ