6,911 research outputs found
Posterior-based confidence measures for spoken term detection
Confidence measures play a key role in spoken term detection (STD) tasks. The confidence measure expresses the posterior probability of the search term appearing in the detection period, given the speech. Traditional approaches are based on the acoustic and language model scores for candidate detections found using automatic speech recognition, with Bayes' rule being used to compute the desired posterior probability. In this paper, we present a novel direct posterior-based confidence measure which, instead of resorting to the Bayesian formula, calculates posterior probabilities from a multi-layer perceptron (MLP) directly. Compared with traditional Bayesian-based methods, the direct-posterior approach is conceptually and mathematically simpler. Moreover, the MLP-based model does not require assumptions to be made about the acoustic features such as their statistical distribution and the independence of static and dynamic co-efficients. Our experimental results in both English and Spanish demonstrate that the proposed direct posterior-based confidence improves STD performance
Out-of-vocabulary spoken term detection
Spoken term detection (STD) is a fundamental task for multimedia information
retrieval. A major challenge faced by an STD system is the serious performance reduction
when detecting out-of-vocabulary (OOV) terms. The difficulties arise not only
from the absence of pronunciations for such terms in the system dictionaries, but from
intrinsic uncertainty in pronunciations, significant diversity in term properties and a
high degree of weakness in acoustic and language modelling.
To tackle the OOV issue, we first applied the joint-multigram model to predict pronunciations
for OOV terms in a stochastic way. Based on this, we propose a stochastic
pronunciation model that considers all possible pronunciations for OOV terms so that
the high pronunciation uncertainty is compensated for.
Furthermore, to deal with the diversity in term properties, we propose a termdependent
discriminative decision strategy, which employs discriminative models to
integrate multiple informative factors and confidence measures into a classification
probability, which gives rise to minimum decision cost.
In addition, to address the weakness in acoustic and language modelling, we propose
a direct posterior confidence measure which replaces the generative models with
a discriminative model, such as a multi-layer perceptron (MLP), to obtain a robust
confidence for OOV term detection.
With these novel techniques, the STD performance on OOV terms was improved
substantially and significantly in our experiments set on meeting speech data
ASR error management for improving spoken language understanding
This paper addresses the problem of automatic speech recognition (ASR) error
detection and their use for improving spoken language understanding (SLU)
systems. In this study, the SLU task consists in automatically extracting, from
ASR transcriptions , semantic concepts and concept/values pairs in a e.g
touristic information system. An approach is proposed for enriching the set of
semantic labels with error specific labels and by using a recently proposed
neural approach based on word embeddings to compute well calibrated ASR
confidence measures. Experimental results are reported showing that it is
possible to decrease significantly the Concept/Value Error Rate with a state of
the art system, outperforming previously published results performance on the
same experimental data. It also shown that combining an SLU approach based on
conditional random fields with a neural encoder/decoder attention based
architecture , it is possible to effectively identifying confidence islands and
uncertain semantic output segments useful for deciding appropriate error
handling actions by the dialogue manager strategy .Comment: Interspeech 2017, Aug 2017, Stockholm, Sweden. 201
The uncertain representation ranking framework for concept-based video retrieval
Concept based video retrieval often relies on imperfect and uncertain concept detectors. We propose a general ranking framework to define effective and robust ranking functions, through explicitly addressing detector uncertainty. It can cope with multiple concept-based representations per video segment and it allows the re-use of effective text retrieval functions which are defined on similar representations. The final ranking status value is a weighted combination of two components: the expected score of the possible scores, which represents the risk-neutral choice, and the scoresâ standard deviation, which represents the risk or opportunity that the score for the actual representation is higher. The framework consistently improves the search performance in the shot retrieval task and the segment retrieval task over several baselines in five TRECVid collections and two collections which use simulated detectors of varying performance
Taking the bite out of automated naming of characters in TV video
We investigate the problem of automatically labelling appearances of characters in TV or film material
with their names. This is tremendously challenging due to the huge variation in imaged appearance of each character and the weakness and ambiguity of available annotation. However, we demonstrate that high precision can be achieved by combining multiple sources of information, both visual and textual. The principal novelties that we introduce are: (i) automatic generation of time stamped character annotation by aligning subtitles and transcripts; (ii) strengthening the supervisory information by identifying
when characters are speaking. In addition, we incorporate complementary cues of face matching and clothing matching to propose common annotations for face tracks, and consider choices of classifier which can potentially correct errors made in the automatic extraction of training data from the weak textual annotation. Results are presented on episodes of the TV series ââBuffy the Vampire Slayerâ
Evolutionary discriminative confidence estimation for spoken term detection
The final publication is available at Springer via http://dx.doi.org/10.1007/s11042-011-0913-zSpoken term detection (STD) is the task of searching for occurrences
of spoken terms in audio archives. It relies on robust confidence estimation
to make a hit/false alarm (FA) decision. In order to optimize the decision
in terms of the STD evaluation metric, the confidence has to be discriminative.
Multi-layer perceptrons (MLPs) and support vector machines (SVMs) exhibit
good performance in producing discriminative confidence; however they are
severely limited by the continuous objective functions, and are therefore less
capable of dealing with complex decision tasks. This leads to a substantial
performance reduction when measuring detection of out-of-vocabulary (OOV)
terms, where the high diversity in term properties usually leads to a complicated
decision boundary.
In this paper we present a new discriminative confidence estimation approach
based on evolutionary discriminant analysis (EDA). Unlike MLPs and
SVMs, EDA uses the classification error as its objective function, resulting
in a model optimized towards the evaluation metric. In addition, EDA combines
heterogeneous projection functions and classification strategies in decision
making, leading to a highly flexible classifier that is capable of dealing
with complex decision tasks. Finally, the evolutionary strategy of EDA reduces the risk of local minima. We tested the EDA-based confidence with a
state-of-the-art phoneme-based STD system on an English meeting domain
corpus, which employs a phoneme speech recognition system to produce lattices
within which the phoneme sequences corresponding to the enquiry terms
are searched. The test corpora comprise 11 hours of speech data recorded with
individual head-mounted microphones from 30 meetings carried out at several
institutes including ICSI; NIST; ISL; LDC; the Virginia Polytechnic Institute
and State University; and the University of Edinburgh. The experimental results
demonstrate that EDA considerably outperforms MLPs and SVMs on
both classification and confidence measurement in STD, and the advantage
is found to be more significant on OOV terms than on in-vocabulary (INV)
terms. In terms of classification performance, EDA achieved an equal error
rate (EER) of 11% on OOV terms, compared to 34% and 31% with MLPs and
SVMs respectively; for INV terms, an EER of 15% was obtained with EDA
compared to 17% obtained with MLPs and SVMs. In terms of STD performance
for OOV terms, EDA presented a significant relative improvement of
1.4% and 2.5% in terms of average term-weighted value (ATWV) over MLPs
and SVMs respectively.This work was partially supported by the French Ministry of Industry
(Innovative Web call) under contract 09.2.93.0966, âCollaborative Annotation for Video
Accessibilityâ (ACAV) and by âThe Adaptable Ambient Living Assistantâ (ALIAS) project
funded through the joint national Ambient Assisted Living (AAL) programme
- âŠ