    Identification of Non-Linguistic Speech Features

    Over the last decade technological advances have been made which enable us to envision real-world applications of speech technologies. It is possible to foresee applications where the spoken query is to be recognized without even prior knowledge of the language being spoken, for example, information centers in public places such as train stations and airports. Other applications may require accurate identification of the speaker for security reasons, including control of access to confidential information or for telephone-based transactions. Ideally, the speaker's identity can be verified continually during the transaction, in a manner completely transparent to the user. With these views in mind, this paper presents a unified approach to identifying non-linguistic speech features from the recorded signal using phone-based acoustic likelihoods. This technique is shown to be effective for text-independent language, sex, and speaker identification and can enable better and more friendly human-machine interaction. With 2s of speech, the language can be identified with better than 99 % accuracy. Error in sex-identification is about 1% on a per-sentence basis, and speaker identification accuracies of 98.5 % on TIMIT (168 speakers) and 99.2 % on BREF (65 speakers), were obtained with one utterance per speaker, and 100 % with 2 utterances for both corpora. An experiment using unsupervised adaptation for speaker identification on the 168 TIMIT speakers had the same identification accuracies obtained with supervised adaptation

    Investigating techniques for low resource conversational speech recognition

    International audienceIn this paper we investigate various techniques in order to build effective speech to text (STT) and keyword search (KWS) systems for low resource conversational speech. Sub-word decoding and graphemic mappings were assessed in order to detect out-of-vocabulary keywords. To deal with the limited amount of transcribed data, semi-supervised training and data selection methods were investigated. Robust acoustic features produced via data augmentation were evaluated for acoustic modeling. For language modeling, automatically retrieved conversational-like Webdata was used, as well as neural network based models. We report STT improvements with all the techniques, but interestingly only some improve KWS performance. Results are reported for the Swahili language in the context of the 2015 OpenKWS Evaluation

    OLIVE: Speech-Based Video Retrieval

    This paper describes the Olive project which aims to support automated indexing of video material by use of human language technologies. Olive is making use of speech recognition to automatically derive transcriptions of the sound tracks, generating time-coded linguistic elements which serve as the basis for text-based retrieval functionality. The retrieval demonstrator builds on and extends the architecture from the Pop-Eye project, a system applying human language technology on subtitles for the disclosure of video fragments

    A Scalable Video Search Engine Based on Audio Content Indexing and Topic Segmentation

    One important class of online videos is that of news broadcasts. Most news organisations provide near-immediate access to topical news broadcasts over the Internet, through RSS streams or podcasts. Until lately, technology has not made it possible for a user to automatically go to the smaller parts, within a longer broadcast, that might interest them. Recent advances in both speech recognition systems and natural language processing have led to a number of robust tools that allow us to provide users with quicker, more focussed access to relevant segments of one or more news broadcast videos. Here we present our new interface for browsing or searching news broadcasts (video/audio) that exploits these new language processing tools to (i) provide immediate access to topical passages within news broadcasts, (ii) browse news broadcasts by events as well as by people, places and organisations, (iii) perform cross lingual search of news broadcasts, (iv) search for news through a map interface, (v) browse news by trending topics, and (vi) see automatically-generated textual clues for news segments, before listening. Our publicly searchable demonstrator currently indexes daily broadcast news content from 50 sources in English, French, Chinese, Arabic, Spanish, Dutch and Russian.Comment: NEM Summit, Torino : Italy (2011

    Incidence and main factors associated with early unplanned hospital readmission among French medical inpatients aged 75 and over admitted through emergency units

    Background: among elderly patients, readmission in the month following hospital discharge is a frequent occurrence which involves a risk of functional decline, particularly among frail subjects. While previous studies have identified risk factors of early readmission, geriatric syndromes, as markers of frailty have not been assessed as potential predictors. Objective: to evaluate the risk of early unplanned readmission, and to identify predictors in inpatients aged 75 and over, admitted to medical wards through emergency departments. Design: prospective multi-centre study. Setting: nine French hospitals. Subjects: one thousand three hundred and six medical inpatients, aged 75 and older admitted through emergency departments (SAFES cohort). Methods: using logistic regressions, factors associated with early unplanned re-hospitalisation (defined as first unplanned readmission in the thirty days after discharge) were identified using data from the first week of hospital index stay obtained by comprehensive geriatric assessment. Results: data from a thousand out of 1,306 inpatients were analysed. Early unplanned readmission occurred in 14.2% of inpatients and was not related with sociodemographic characteristics, comorbidity burden or cognitive impairment. Pressure sores (OR=2.05, 95% CI = 1.0-3.9), poor overall condition (OR = 2.01, 95% CI = 1.3-3.0), recent loss of ability for self-feeding (OR = 1.9, 95% CI = 1.2-2.9), prior hospitalisation during the last 3 months (OR = 1.6, 95% CI = 1.1-2.5) were found to be risk factors, while sight disorders appeared as negatively associated (OR = 0.5, 95% CI = 0.3--0.8). Conclusions: markers of frailty (poor overall condition, pressure sores, prior hospitalisation) or severe disability (for self-feeding) were the most important predictors of early readmission among elderly medical inpatients. Early identification could facilitate preventive strategies in risk grou

    The Vocapia Research ASR Systems for Evalita 2011

    Abstract. This document describes the speech recognizers submitted by Vocapia Research to the Evalita 2011 evaluation for the open unconstrained automatic speech recognition (ASR) task. The aim of this evaluation was to perform automatic speech recognition of parliament audio sessions in the Italian language. Two systems were submitted. The primary system has a single decoding pass and was optimized to run in real time. The contrastive system, developed in collaboration with LIMSI-CNRS, has two decoding passes and runs in about 5×RT. The case-insensitive word error rates (WER) of these systems on the Evalita development data are respectively 10.2% and 9.3%

    The INRIA-LIM-VocR and AXES submissions to Trecvid 2014 Multimedia Event Detection

    -This paper describes our participation to the 2014 edition of the TrecVid Multimedia Event Detection task. Our system is based on a collection of local visual and audio descriptors, which are aggregated to global descriptors, one for each type of low-level descriptor, using Fisher vectors. Besides these features, we use two features based on convolutional networks: one for the visual channel, and one for the audio channel. Additional high-level featuresare extracted using ASR and OCR features. Finally, we used mid-level attribute features based on object and action detectors trained on external datasets. Our two submissions (INRIA-LIM-VocR and AXES) are identical interms of all the components, except for the ASR system that is used. We present an overview of the features andthe classification techniques, and experimentally evaluate our system on TrecVid MED 2011 data

    Loss of independence in Katz's ADL ability in connection with an acute hospitalization: early clinical markers in French older people

    Background: The preservation of autonomy and the ability of elderly to carry out the basic activities of daily living, beyond the therapeutic care of any pathologies, appears as one of the main objectives of care during hospitalization. Objectives: To identify early clinical markers associated with the loss of independence in elderly people in short stay hospitals. Methods: Among the 1,306 subjects making up the prospective and multicenter SAFEs cohort study (Sujet Agé Fragile: Évolution et suivi—Frail elderly subjects, evaluation and follow-up), 619 medical inpatients, not disabled at baseline and hospitalized through an emergency department were considered. Data used in a multinomial logistic regression were obtained through a comprehensive geriatric assessment (CGA) conducted in the first week of hospitalization. Dependency levels were assessed at baseline, at inclusion and at 30days using Katz's ADL index. Baseline was defined as the dependence level before occurrence of the event motivating hospitalization. To limit the influence of rehabilitation on the level of dependence, only stays shorter than 30days were considered. Results: About 514 patients were eligible, 15 died and 90 were still hospitalized at end point (n=619). Two-thirds of subjects were women, with a mean age of 83. At day 30 162 patients (31%) were not disabled; 61 (12%) were moderately disabled and 291 severely disabled (57%). No socio-demographic variables seemed to influence the day 30 dependence level. Lack of autonomy (odds ratio (OR)=1.9, 95% confidence interval (CI)=1.2-3.6), walking difficulties (OR=2.7, 95% CI=1.3-5.6), fall risk (OR=2.1, 95% CI=1.3-6.8) and malnutrition risk (OR=2.2, 95% CI=1.5-7.6) were found in multifactorial analysis to be clinical markers for loss of independence. Conclusions: Beyond considerations on the designing of preventive policies targeting the populations at risk that have been identified here, the identification of functional factors (lack of autonomy, walking difficulties, risk of falling) suggests above all that consideration needs to be given to the organization per se of the French geriatric hospital care system, and in particular to the relevance of maintaining sector-type segregation between wards for care of acute care and those involved in rehabilitatio

    Acoustic-based speaker diarization

    Cette thèse porte sur la structuration en locuteurs de différents types d'enregistrements audio, en particulier des journaux télévisés ou radiophoniques et des réunions. La structuration en locuteurs a pour objectif de répondre à la question qui a parlé quand'' dans un document audio donné. Cette thèse fait l'hypothèse qu'aucune connaissance a priori sur la voix de locuteurs ou sur leur nombre n'est disponible. La principale originalité du système de structuration en locuteurs pour des journaux télévisés ou radiophoniques présenté est de combiner deux étapes de regroupement en locuteurs: la première étape se fonde sur le Critère d'Information Bayesien (BIC) avec des Gaussiennes à matrice de covariance pleine et la deuxième étape de regroupement recombine les classes résultant en utilisant des techniques proposées pour l'identification du locuteur et utilisant des modèle de mélange de Gaussiennes (GMM) adaptés à partir d'un modèle générique. Ce système a été validé dans l'évaluation internationale NIST RT-04F (Rich Transcription 2004 Fall) et l'évaluation française ESTER 2005 du projet Technolangue EVALDA. Il a obtenu les meilleurs résultats dans les deux évaluations. Le système de structuration en locuteurs conçu pour les journaux télévisés a également été adapté aux réunions. Il intègre un nouveau détecteur de parole fondé sur le rapport de log-vraisemblance. Diverses techniques de normalisation des paramètres acoustiques et différentes représentations acoustiques ont été testées au cours de cette adaptation. Dans la dernière évaluation du NIST sur de réunions, le système adapté a eu un taux d'erreur de 26% environ sur les données de conférences et séminairesThis thesis presents a work focusing on the topic of speaker diarization for different types of audio recordings, especially including broadcast news (BN) and meetings. The speaker diarization task aims to answer the question of who spoke when'' for a given audio stream. This thesis work is carried out following the assumption that no a priori knowledge of the speakers voice or the number of speakers is available. The principal originality of the proposed BN. Speaker diarization system lies in the combination of two speaker clustering stages, where a Bayesian Information Criterion (BIC) based clustering using single full-covariance Gaussian models is performed to provide a under-clustering and the resulting clusters are recombined via a second clustering stage relying on Gaussian Mixture Model (GMM) based speaker identification techniques. The implemented BN speaker diarization system has been examined in both the international NIST Rich Transcription 2004 Fall (RT-04F) evaluation and a French Technolangue ESTER evaluation and provided the best diarization results in both evaluations. The BN speaker diarization system has also been adapted for the meeting domain. The adapted diarization system for meetings integrates a new speech activity detector based on log-likelihood ratio. Various feature normalization techniques and different sets of acoustic features are also explored by the adapted system. In the last NIST RT meeting recognition evaluation, this meeting diarization system had an overlap diarization error of 26% approxiately on the conference and lecture test dataORSAY-PARIS 11-BU Sciences (914712101) / SudocSudocFranceF