1,438 research outputs found

    Augmented set of features for confidence estimation in spoken term detection

    Get PDF
    Discriminative confidence estimation along with confidence normalisation have been shown to construct robust decision maker modules in spoken term detection (STD) systems. Discriminative confidence estimation, making use of termdependent features, has been shown to improve the widely used lattice-based confidence estimation in STD. In this work, we augment the set of these term-dependent features and show a significant improvement in the STD performance both in terms of ATWV and DET curves in experiments conducted on a Spanish geographical corpus. This work also proposes a multiple linear regression analysis to carry out the feature selection. Next, the most informative features derived from it are used within the discriminative confidence on the STD system

    Term-Dependent Confidence for Out-of-Vocabulary Term Detection

    Get PDF
    Within a spoken term detection (STD) system, the decision maker plays an important role in retrieving reliable detections. Most of the state-of-the-art STD systems make decisions based on a confidence measure that is term-independent, which poses a serious problem for out-of-vocabulary (OOV) term detection. In this paper, we study a term-dependent confidence measure based on confidence normalisation and discriminative modelling, particularly focusing on its remarkable effectiveness for detecting OOV terms. Experimental results indicate that the term-dependent confidence provides much more significant improvement for OOV terms than terms in-vocabulary

    Evolutionary discriminative confidence estimation for spoken term detection

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/s11042-011-0913-zSpoken term detection (STD) is the task of searching for occurrences of spoken terms in audio archives. It relies on robust confidence estimation to make a hit/false alarm (FA) decision. In order to optimize the decision in terms of the STD evaluation metric, the confidence has to be discriminative. Multi-layer perceptrons (MLPs) and support vector machines (SVMs) exhibit good performance in producing discriminative confidence; however they are severely limited by the continuous objective functions, and are therefore less capable of dealing with complex decision tasks. This leads to a substantial performance reduction when measuring detection of out-of-vocabulary (OOV) terms, where the high diversity in term properties usually leads to a complicated decision boundary. In this paper we present a new discriminative confidence estimation approach based on evolutionary discriminant analysis (EDA). Unlike MLPs and SVMs, EDA uses the classification error as its objective function, resulting in a model optimized towards the evaluation metric. In addition, EDA combines heterogeneous projection functions and classification strategies in decision making, leading to a highly flexible classifier that is capable of dealing with complex decision tasks. Finally, the evolutionary strategy of EDA reduces the risk of local minima. We tested the EDA-based confidence with a state-of-the-art phoneme-based STD system on an English meeting domain corpus, which employs a phoneme speech recognition system to produce lattices within which the phoneme sequences corresponding to the enquiry terms are searched. The test corpora comprise 11 hours of speech data recorded with individual head-mounted microphones from 30 meetings carried out at several institutes including ICSI; NIST; ISL; LDC; the Virginia Polytechnic Institute and State University; and the University of Edinburgh. The experimental results demonstrate that EDA considerably outperforms MLPs and SVMs on both classification and confidence measurement in STD, and the advantage is found to be more significant on OOV terms than on in-vocabulary (INV) terms. In terms of classification performance, EDA achieved an equal error rate (EER) of 11% on OOV terms, compared to 34% and 31% with MLPs and SVMs respectively; for INV terms, an EER of 15% was obtained with EDA compared to 17% obtained with MLPs and SVMs. In terms of STD performance for OOV terms, EDA presented a significant relative improvement of 1.4% and 2.5% in terms of average term-weighted value (ATWV) over MLPs and SVMs respectively.This work was partially supported by the French Ministry of Industry (Innovative Web call) under contract 09.2.93.0966, ‘Collaborative Annotation for Video Accessibility’ (ACAV) and by ‘The Adaptable Ambient Living Assistant’ (ALIAS) project funded through the joint national Ambient Assisted Living (AAL) programme

    Stochastic Pronunciation Modelling for Out-of-Vocabulary Spoken Term Detection

    Get PDF
    Spoken term detection (STD) is the name given to the task of searching large amounts of audio for occurrences of spoken terms, which are typically single words or short phrases. One reason that STD is a hard task is that search terms tend to contain a disproportionate number of out-of-vocabulary (OOV) words. The most common approach to STD uses subword units. This, in conjunction with some method for predicting pronunciations of OOVs from their written form, enables the detection of OOV terms but performance is considerably worse than for in-vocabulary terms. This performance differential can be largely attributed to the special properties of OOVs. One such property is the high degree of uncertainty in the pronunciation of OOVs. We present a stochastic pronunciation model (SPM) which explicitly deals with this uncertainty. The key insight is to search for all possible pronunciations when detecting an OOV term, explicitly capturing the uncertainty in pronunciation. This requires a probabilistic model of pronunciation, able to estimate a distribution over all possible pronunciations. We use a joint-multigram model (JMM) for this and compare the JMM-based SPM with the conventional soft match approach. Experiments using speech from the meetings domain demonstrate that the SPM performs better than soft match in most operating regions, especially at low false alarm probabilities. Furthermore, SPM and soft match are found to be complementary: their combination provides further performance gains

    I hear you eat and speak: automatic recognition of eating condition and food type, use-cases, and impact on ASR performance

    Get PDF
    We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient

    Frame-level features conveying phonetic information for language and speaker recognition

    Get PDF
    150 p.This Thesis, developed in the Software Technologies Working Group of the Departmentof Electricity and Electronics of the University of the Basque Country, focuseson the research eld of spoken language and speaker recognition technologies.More specically, the research carried out studies the design of a set of featuresconveying spectral acoustic and phonotactic information, searches for the optimalfeature extraction parameters, and analyses the integration and usage of the featuresin language recognition systems, and the complementarity of these approacheswith regard to state-of-the-art systems. The study reveals that systems trained onthe proposed set of features, denoted as Phone Log-Likelihood Ratios (PLLRs), arehighly competitive, outperforming in several benchmarks other state-of-the-art systems.Moreover, PLLR-based systems also provide complementary information withregard to other phonotactic and acoustic approaches, which makes them suitable infusions to improve the overall performance of spoken language recognition systems.The usage of this features is also studied in speaker recognition tasks. In this context,the results attained by the approaches based on PLLR features are not as remarkableas the ones of systems based on standard acoustic features, but they still providecomplementary information that can be used to enhance the overall performance ofthe speaker recognition systems

    Sparse Classifier Fusion for Speaker Verification

    Full text link

    Street slang and schizophrenia

    Get PDF
    We report the case of a 26 year old streetwise young postman who presented with a six month history of reduced occupational and social function, low mood, and lack of motivation. He complained of feeling less sociable and less interested in his friends and of being clumsy and finding it harder to think. He was otherwise fit and healthy, with no physical abnormalities, neurological signs, or objective cognitive impairments. There was no history of a recent stressor that might have precipitated his symptoms. He was referred to a specialist service for patients in the prodromal phase of psychotic illness for further assessment after he had seen his general practitioner and the local community mental health team. The differential diagnosis at this stage was depression, the prodrome of schizophrenia, or no formal clinical disorder. His premorbid occupational and social function had been good. There was no history of abnormal . social, language, and motor development and he left school with two A levels. After three years of service at the post office he had been promoted to a supervisory role. He had a good relationship with his family and had six or so good friends. There has been a number of previous heterosexual relationships, although none in the past year. Aside from smoking cannabis on two occasions when he was 19, there was no history of illicit substance use. Detailed and repeated assessment of his mental state found a normal affect, no delusions, hallucinations, or catatonia, and no cognitive dysfunction. His speech, however, was peppered with what seemed (to his middle class and older psychiatrist) to be an unusual use of words, although he said they were street slang (table).Go It was thus unclear whether he was displaying subtle signs of formal thought disorder (manifest as disorganised speech, including the use of unusual words or phrases, and neologisms) or using a "street" argot. This was a crucial diagnostic distinction as thought disorder is a feature of psychotic illnesses and can indicate a diagnosis of schizophrenia. We sought to verify his explanations using an online dictionary of slang (urbandictionary.com). To our surprise, many of the words he used were listed and the definitions accorded with those he gave (see table). We further investigated whether his speech showed evidence of thought disorder by examining recordings of his speech as he described a series of ambiguous pictures from the thematic apperception test, a procedure that elicits thought disordered speech. His speech was transcribed and rated with the thought and language index, a standardised scale for assessing thought disorder. Slang used in a linguistically appropriate way is not scored as abnormal on this scale. His score was 5.25, primarily reflecting a mild loosening of associations. For example, he described a picture of a boat on a lake thus: "There’s a boat and a tree. There seems to be a reflection. There are no beds, and I wonder why there are no beds. There’s a breeze going through the branches of the tree." His score was outside the normal range (mean for normal controls 0.88, SD 1.15) and indicates subtle thought disorder, equivalent to that evident in remitted patients with schizophrenia (mean in remitted patients 3.89, SD 2.56) but lower than that in patients with formal thought disorder (mean 27.4, SD 8.3). Over the following year his social and occupational functioning deteriorated further, and he developed frank formal thought disorder as well as grandiose and persecutory delusions to the extent that he met DSM-IV criteria for schizophrenia. His speech was assessed as before, and the thought and language index score had increased to 11.75. This mainly reflected abnormalities on items comprising "positive" thought disorder, particularly the use of neologisms such as "chronocolising" and non-sequiturs. To our knowledge this is the first case report to describe difficulties in distinguishing "street" argots from formal thought disorder. It is perhaps not surprising that slang can complicate the assessment of disorganised speech as psychotic illnesses usually develop in young adults, whereas the assessing clinician is often from an older generation (and different sociocultural background) less familiar with contemporary urban slang. Online resources offer a means of distinguishing street argot from neologisms or a peculiar use of words, and linguistic rating scales may be a useful adjunct to clinical assessment when thought disorder is subtle. Differentiating thought disorder from slang can be especially difficult in the context of "prodromal" signs of psychosis, when speech abnormalities, if present, are usually subtle. Nevertheless, accurate speech assessment is important as subtle thought disorder can, as in this case, predate the subsequent onset of schizophrenia, and early detection and treatment of psychosis might be associated with a better long term clinical outcome
    • …
    corecore