71 research outputs found
Towards Affordable Disclosure of Spoken Word Archives
This paper presents and discusses ongoing work aiming at affordable disclosure of real-world spoken word archives in general, and in particular of a collection of recorded interviews with Dutch survivors of World War II concentration camp Buchenwald. Given such collections, the least we want to be able to provide is search at different levels and a flexible way of presenting results. Strategies for automatic annotation based on speech recognition â supporting e.g., within-document searchâ are outlined and discussed with respect to the Buchenwald interview collection. In addition, usability aspects of the spoken word search are discussed on the basis of our experiences with the online Buchenwald web portal. It is concluded that, although user feedback is generally fairly positive, automatic annotation performance is still far from satisfactory, and requires additional research
Access to recorded interviews: A research agenda
Recorded interviews form a rich basis for scholarly inquiry. Examples include oral histories, community memory projects, and interviews conducted for broadcast media. Emerging technologies offer the potential to radically transform the way in which recorded interviews are made accessible, but this vision will demand substantial investments from a broad range of research communities. This article reviews the present state of practice for making recorded interviews available and the state-of-the-art for key component technologies. A large number of important research issues are identified, and from that set of issues, a coherent research agenda is proposed
Unsupervised video indexing on audiovisual characterization of persons
Cette thÚse consiste à proposer une méthode de caractérisation non-supervisée des intervenants dans les documents audiovisuels, en exploitant des données liées à leur apparence physique et à leur voix. De maniÚre générale, les méthodes d'identification automatique, que ce soit en vidéo ou en audio, nécessitent une quantité importante de connaissances a priori sur le contenu. Dans ce travail, le but est d'étudier les deux modes de façon corrélée et d'exploiter leur propriété respective de maniÚre collaborative et robuste, afin de produire un résultat fiable aussi indépendant que possible de toute connaissance a priori. Plus particuliÚrement, nous avons étudié les caractéristiques du flux audio et nous avons proposé plusieurs méthodes pour la segmentation et le regroupement en locuteurs que nous avons évaluées dans le cadre d'une campagne d'évaluation. Ensuite, nous avons mené une étude approfondie sur les descripteurs visuels (visage, costume) qui nous ont servis à proposer de nouvelles approches pour la détection, le suivi et le regroupement des personnes. Enfin, le travail s'est focalisé sur la fusion des données audio et vidéo en proposant une approche basée sur le calcul d'une matrice de cooccurrence qui nous a permis d'établir une association entre l'index audio et l'index vidéo et d'effectuer leur correction. Nous pouvons ainsi produire un modÚle audiovisuel dynamique des intervenants.This thesis consists to propose a method for an unsupervised characterization of persons within audiovisual documents, by exploring the data related for their physical appearance and their voice. From a general manner, the automatic recognition methods, either in video or audio, need a huge amount of a priori knowledge about their content. In this work, the goal is to study the two modes in a correlated way and to explore their properties in a collaborative and robust way, in order to produce a reliable result as independent as possible from any a priori knowledge. More particularly, we have studied the characteristics of the audio stream and we have proposed many methods for speaker segmentation and clustering and that we have evaluated in a french competition. Then, we have carried a deep study on visual descriptors (face, clothing) that helped us to propose novel approches for detecting, tracking, and clustering of people within the document. Finally, the work was focused on the audiovisual fusion by proposing a method based on computing the cooccurrence matrix that allowed us to establish an association between audio and video indexes, and to correct them. That will enable us to produce a dynamic audiovisual model for each speaker
Spoken content retrieval: A survey of techniques and technologies
Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR
Segmentation, Diarization and Speech Transcription: Surprise Data Unraveled
In this thesis, research on large vocabulary continuous speech recognition for unknown audio conditions is presented. For automatic speech recognition systems based on statistical methods, it is important that the conditions of the audio used for training the statistical models match the conditions of the audio to be processed. Any mismatch will decrease the accuracy of the recognition. If it is unpredictable what kind of data can be expected, or in other words if the conditions of the audio to be processed are unknown, it is impossible to tune the models. If the material consists of `surprise data' the output of the system is likely to be poor. In this thesis methods are presented for which no external training data is required for training models. These novel methods have been implemented in a large vocabulary continuous speech recognition system called SHoUT. This system consists of three subsystems: speech/non-speech classification, speaker diarization and automatic speech recognition. The speech/non-speech classification subsystem separates speech from silence and unknown audible non-speech events. The type of non-speech present in audio recordings can vary from paper shuffling in recordings of meetings to sound effects in television shows. Because it is unknown what type of non-speech needs to be detected, it is not possible to train high quality statistical models for each type of non-speech sound. The speech/non-speech classification subsystem, also called the speech activity detection subsystem, does not attempt to classify all audible non-speech in a single run. Instead, first a bootstrap speech/silence classification is obtained using a standard speech activity component. Next, the models for speech, silence and audible non-speech are trained on the target audio using the bootstrap classification. This approach makes it possible to classify speech and non-speech with high accuracy, without the need to know what kinds of sound are present in the audio recording. Once all non-speech is filtered out of the audio, it is the task of the speaker diarization subsystem to determine how many speakers occur in the recording and exactly when they are speaking. The speaker diarization subsystem applies agglomerative clustering to create clusters of speech fragments for each speaker in the recording. First, statistical speaker models are created on random chunks of the recording and by iteratively realigning the data, retraining the models and merging models that represent the same speaker, accurate speaker models are obtained for speaker clustering. This method does not require any statistical models developed on a training set, which makes the diarization subsystem insensitive for variation in audio conditions. Unfortunately, because the algorithm is of complexity , this clustering method is slow for long recordings. Two variations of the subsystem are presented that reduce the needed computational effort, so that the subsystem is applicable for long audio recordings as well. The automatic speech recognition subsystem developed for this research, is based on Viterbi decoding on a fixed pronunciation prefix tree. Using the fixed tree, a flexible modular decoder could be developed, but it was not straightforward to apply full language model look-ahead efficiently. In this thesis a novel method is discussed that makes it possible to apply language model look-ahead effectively on the fixed tree. Also, to obtain higher speech recognition accuracy on audio with unknown acoustical conditions, a selection from the numerous known methods that exist for robust automatic speech recognition is applied and evaluated in this thesis. The three individual subsystems as well as the entire system have been successfully evaluated on three international benchmarks. The diarization subsystem has been evaluated at the NIST RT06s benchmark and the speech activity detection subsystem has been tested at RT07s. The entire system was evaluated at N-Best, the first automatic speech recognition benchmark for Dutch
Adaptation of speech recognition systems to selected real-world deployment conditions
Tato habilitaÄnĂ prĂĄce se zabĂœvĂĄ problematikou adaptace systĂ©mĆŻ
rozpoznĂĄvĂĄnĂ ĆeÄi na vybranĂ© reĂĄlnĂ© podmĂnky nasazenĂ. Je koncipovĂĄna
jako sbornĂk celkem dvanĂĄcti ÄlĂĄnkĆŻ, kterĂ© se touto problematikou
zabĂœvajĂ. Jde o publikace, jejichĆŸ jsem hlavnĂm autorem
nebo spoluatorem, a kterĂ© vznikly v rĂĄmci nÄkolika navazujĂcĂch
vĂœzkumnĂœch projektĆŻ. Na ĆeĆĄenĂ tÄchto projektĆŻ jsem se
podĂlel jak v roli Älena vĂœzkumnĂ©ho tĂœmu, tak i v roli ĆeĆĄitele nebo
spoluĆeĆĄitele.
Publikace zaĆazenĂ© do tohoto sbornĂku lze rozdÄlit podle tĂ©matu
do tĆĂ hlavnĂch skupin. Jejich spoleÄnĂœm jmenovatelem je
snaha pĆizpĆŻsobit danĂœ rozpoznĂĄvacĂ systĂ©m novĂœm podmĂnkĂĄm Äi
konkrĂ©tnĂmu faktoru, kterĂœ vĂœznamnĂœm zpĆŻsobem ovlivĆuje jeho
funkci Äi pĆesnost.
PrvnĂ skupina ÄlĂĄnkĆŻ se zabĂœvĂĄ Ășlohou neĆĂzenĂ© adaptace na
mluvÄĂho, kdy systĂ©m pĆizpĆŻsobuje svoje parametry specifickĂœm
hlasovĂœm charakteristikĂĄm danĂ© mluvĂcĂ osoby. DruhĂĄ ÄĂĄst prĂĄce
se pak vÄnuje problematice identifikace neĆeÄovĂœch udĂĄlostĂ na vstupu
do systĂ©mu a souvisejĂcĂ Ășloze rozpoznĂĄvĂĄnĂ ĆeÄi s hlukem
(a zejmĂ©na hudbou) na pozadĂ. KoneÄnÄ tĆetĂ ÄĂĄst prĂĄce se zabĂœvĂĄ
pĆĂstupy, kterĂ© umoĆŸĆujĂ pĆepis audio signĂĄlu obsahujĂcĂho promluvy
ve vĂce neĆŸ v jednom jazyce. Jde o metody adaptace existujĂcĂho
rozpoznĂĄvacĂho systĂ©mu na novĂœ jazyk a metody identifikace
jazyka z audio signĂĄlu.
ObÄ zmĂnÄnĂ© identifikaÄnĂ Ășlohy jsou pĆitom vyĆĄetĆovĂĄny zejmĂ©na
v nĂĄroÄnĂ©m a mĂ©nÄ probĂĄdanĂ©m reĆŸimu zpracovĂĄnĂ po jednotlivĂœch
rĂĄmcĂch vstupnĂho signĂĄlu, kterĂœ je jako jedinĂœ vhodnĂœ pro on-line
nasazenĂ, napĆ. pro streamovanĂĄ data.This habilitation thesis deals with adaptation of automatic speech
recognition (ASR) systems to selected real-world deployment conditions.
It is presented in the form of a collection of twelve articles
dealing with this task; I am the main author or a co-author of these
articles. They were published during my work on several consecutive
research projects. I have participated in the solution of them
as a member of the research team as well as the investigator or a
co-investigator.
These articles can be divided into three main groups according to
their topics. They have in common the effort to adapt a particular
ASR system to a specific factor or deployment condition that affects
its function or accuracy.
The first group of articles is focused on an unsupervised speaker
adaptation task, where the ASR system adapts its parameters to
the specific voice characteristics of one particular speaker. The second
part deals with a) methods allowing the system to identify
non-speech events on the input, and b) the related task of recognition
of speech with non-speech events, particularly music, in the
background. Finally, the third part is devoted to the methods
that allow the transcription of an audio signal containing multilingual
utterances. It includes a) approaches for adapting the existing
recognition system to a new language and b) methods for identification
of the language from the audio signal.
The two mentioned identification tasks are in particular investigated
under the demanding and less explored frame-wise scenario,
which is the only one suitable for processing of on-line data streams
- âŠ