826 research outputs found
Fast and Accurate OOV Decoder on High-Level Features
This work proposes a novel approach to out-of-vocabulary (OOV) keyword search
(KWS) task. The proposed approach is based on using high-level features from an
automatic speech recognition (ASR) system, so called phoneme posterior based
(PPB) features, for decoding. These features are obtained by calculating
time-dependent phoneme posterior probabilities from word lattices, followed by
their smoothing. For the PPB features we developed a special novel very fast,
simple and efficient OOV decoder. Experimental results are presented on the
Georgian language from the IARPA Babel Program, which was the test language in
the OpenKWS 2016 evaluation campaign. The results show that in terms of maximum
term weighted value (MTWV) metric and computational speed, for single ASR
systems, the proposed approach significantly outperforms the state-of-the-art
approach based on using in-vocabulary proxies for OOV keywords in the indexed
database. The comparison of the two OOV KWS approaches on the fusion results of
the nine different ASR systems demonstrates that the proposed OOV decoder
outperforms the proxy-based approach in terms of MTWV metric given the
comparable processing speed. Other important advantages of the OOV decoder
include extremely low memory consumption and simplicity of its implementation
and parameter optimization.Comment: Interspeech 2017, August 2017, Stockholm, Sweden. 201
Phonetic Searching
An improved method and apparatus is disclosed which uses probabilistic techniques to map an input search string with a prestored audio file, and recognize certain portions of a search string phonetically. An improved interface is disclosed which permits users to input search strings, linguistics, phonetics, or a combination of both, and also allows logic functions to be specified by indicating how far separated specific phonemes are in time.Georgia Tech Research Corporatio
Morphological Segmentation for Keyword Spotting
We explore the impact of morphological segmentation on keyword spotting (KWS). Despite potential benefits, state-of-the-art KWS systems do not use morphological information. In this paper, we augment a state-of-the-art KWS system with sub-word units derived from supervised and unsupervised morphological segmentations, and compare with phonetic and syllabic segmentations. Our experiments demonstrate that morphemes improve overall performance of KWS systems. Syllabic units, however, rival the performance of morphological units when used in KWS. By combining morphological, phonetic and syllabic segmentations, we demonstrate substantial performance gains.United States. Intelligence Advanced Research Projects Activity (United States. Army Research Laboratory Contract W911NF-12-C-0013
Spoken content retrieval: A survey of techniques and technologies
Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR
Spoken term detection ALBAYZIN 2014 evaluation: overview, systems, results, and discussion
The electronic version of this article is the complete one and can be found online at: http://dx.doi.org/10.1186/s13636-015-0063-8Spoken term detection (STD) aims at retrieving data from a speech repository given a textual representation of the search term. Nowadays, it is receiving much interest due to the large volume of multimedia information. STD differs from automatic speech recognition (ASR) in that ASR is interested in all the terms/words that appear in the speech data, whereas STD focuses on a selected list of search terms that must be detected within the speech data. This paper presents the systems submitted to the STD ALBAYZIN 2014 evaluation, held as a part of the ALBAYZIN 2014 evaluation campaign within the context of the IberSPEECH 2014 conference. This is the first STD evaluation that deals with Spanish language. The evaluation consists of retrieving the speech files that contain the search terms, indicating their start and end times within the appropriate speech file, along with a score value that reflects the confidence given to the detection of the search term. The evaluation is conducted on a Spanish spontaneous speech database, which comprises a set of talks from workshops and amounts to about 7 h of speech. We present the database, the evaluation metrics, the systems submitted to the evaluation, the results, and a detailed discussion. Four different research groups took part in the evaluation. Evaluation results show reasonable performance for moderate out-of-vocabulary term rate. This paper compares the systems submitted to the evaluation and makes a deep analysis based on some search term properties (term length, in-vocabulary/out-of-vocabulary terms, single-word/multi-word terms, and in-language/foreign terms).This work has been partly supported by project CMC-V2
(TEC2012-37585-C02-01) from the Spanish Ministry of Economy and
Competitiveness. This research was also funded by the European Regional
Development Fund, the Galician Regional Government (GRC2014/024,
“Consolidation of Research Units: AtlantTIC Project” CN2012/160)
Recommended from our members
Joint decoding of tandem and hybrid systems for improved keyword spotting on low resource languages
Copyright © 2015 ISCA. Keyword spotting (KWS) for low-resource languages has drawn increasing attention in recent years. The state-of-the-art KWS systems are based on lattices or Confusion Networks (CN) generated by Automatic Speech Recognition (ASR) systems. It has been shown that considerable KWS gains can be obtained by combining the keyword detection results from different forms of ASR systems, e.g., Tandem and Hybrid systems. This paper investigates an alternative combination scheme for KWS using joint decoding. This scheme treats a Tandem system and a Hybrid system as two separate streams, and makes a linear combination of individual acoustic model log-likelihoods. Joint decoding is more efficient as it requires just a single pass of decoding and a single pass of keyword search. Experiments on six Babel OP2 development languages show that joint decoding is capable of providing consistent gains over each individual system. Moreover, it is possible to efficiently rescore the joint decoding lattices with Tandem or Hybrid acoustic models, and further KWS gains can be obtained by merging the detection posting lists from the joint decoding lattices and rescored lattices
Attention-Based Models for Text-Dependent Speaker Verification
Attention-based models have recently shown great performance on a range of
tasks, such as speech recognition, machine translation, and image captioning
due to their ability to summarize relevant information that expands through the
entire length of an input sequence. In this paper, we analyze the usage of
attention mechanisms to the problem of sequence summarization in our end-to-end
text-dependent speaker recognition system. We explore different topologies and
their variants of the attention layer, and compare different pooling methods on
the attention weights. Ultimately, we show that attention-based models can
improves the Equal Error Rate (EER) of our speaker verification system by
relatively 14% compared to our non-attention LSTM baseline model.Comment: Submitted to ICASSP 201
- …