11 research outputs found

    Different Approaches for Speaker Diarization

    Get PDF
    Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources and other signal source/channel characteristics. Speaker diarization is the task of determining “who spoke when?†in an audio or video recording that contains an unknown amount of speech and also an unknown number of speakers. Diarization can be used for helping speech recognition, facilitating the searching and indexing of audio archives and increasing the richness of automatic transcriptions, making them more readable. Over recent years, however, speaker diarization has become an important key technology for many tasks, such as navigation, retrieval or higher-level inference on audio data. Accordingly, many important improvements in accuracy and robustness have been reported in the area of conferences. The application domains, from broadcast news, to lectures and meetings, vary greatly and pose different problems, such as access to multiple microphones and multimodal information or overlapping speech

    Tuning-Robust Initialization Methods for Speaker Diarization

    Get PDF
    This paper investigates a typical speaker diarization system regarding its robustness against initialization parameter variation and presents a method to reduce manual tuning of these values significantly. The behavior of an agglomerative hierarchical clustering system is studied to determine which initialization parameters impact accuracy most. We show that the accuracy of typical systems is indeed very sensitive to the values chosen for the initialization parameters and factors such as the duration of speech in the recording. We then present a solution that reduces the sensitivity of the initialization values and therefore reduces the need for manual tuning significantly while at the same time increasing the accuracy of the system. For short meetings extracted from the previous (2006, 2007, and 2009) National Institute of Standards and Technology (NIST) Rich Transcription (RT) evaluation data, the decrease of the diarization error rate is up to 50% relative. The approach consists of a novel initialization parameter estimation method for speaker diarization that uses agglomerative clustering with Bayesian information criterion (BIC) and Gaussian mixture models (GMMs) of frame-based cepstral features (MFCCs). The estimation method balances the relationship between the optimal value of the seconds of speech data per Gaussian and the duration of the speech data and is combined with a novel nonuniform initialization method. This approach results in a system that performs better than the current ICSI baseline engine on datasets of the NIST RT evaluations of the years 2006, 2007, and 2009

    Robust Speaker Diarization for Short Speech Recordings

    Get PDF
    We investigate a state-of-the-art Speaker Diarization system regarding its behavior on meetings that are much shorter (from 500 seconds down to 100 seconds) than those typically analyzed in Speaker Diarization benchmarks. First, the problems inherent to this task are analyzed. Then, we propose an approach that consists of a novel initialization parameter estimation method for typical state-of-the-art diarization approaches. The estimation method balances the relationship between the optimal value of the duration of speech data per Gaussian and the duration of the speech data, which is verified experimentally for the first time in this article. As a result, the Diarization Error Rate for short meetings extracted from the 2006, 2007, and 2009 NIST RT evaluation data is decreased by up to 50% relative

    An Information Theoretic Combination of MFCC and TDOA Features for Speaker Diarization

    Full text link

    Towards a Reference Architecture for Context-Aware Services

    Get PDF
    This Chapter describes an infrastructure for multi-modal perceptual systems which aims at developing and realizing computer services that are delivered to humans in an implicit and unobtrusive way. The framework presented here supports the implementation of humancentric context-aware applications providing non-obtrusive assistance to participants in events such as meetings, lectures, conferences and presentations taking place in indoor "smart spaces". We emphasize on the design and implementation of an agent-based framework that supports "pluggable" service logic in the sense that the service developer can concentrate on the service logic independently of the underlying middleware. Furthermore, we give an example of the architecture’s ability to support the cooperation of multiple services in a meeting scenario using an intelligent connector service and a semantic web oriented travel service. The framework was developed as part of the project CHIL (Computers in the Human Interaction Loop). The vision of CHIL was to be able to provide context-aware human centric services which will operate in the background, provide assistance to the participants in the CHIL spaces and undertake tedious tasks in an unobtrusive way. To achieve this, significant effort had to be put in designing efficient context extraction components so that the CHIL system can acquire an accurate perspective of the current state of the CHIL space. However, the CHIL services required a much more sophisticated modelling of the actual event, rather than simple and fluctuating impressions of it. Furthermore, by nature the CHIL spaces are highly dynamic and heterogeneous; people join or leave, sensors fail or are restarted, user devices connect to the network, etc. To manage this diverse infrastructure, sophisticated techniques were necessary that can map all entities present in the CHIL system and provide information to all components which may require it. From these facts, one can easily understand that in addition to highly sophisticated components at an individual level, another mechanism (or a combination of mechanisms) should be present which can handle this infrastructure. The CHIL Reference Architecture for Multi Modal Systems lies in the background, and provides the solid, high performance and robust backbone for the CHIL services. Each individual need is assigned to a specially designed and integrated layer which is docked to the individual component, and provides all the necessary actions to enable the component to be plugged in the CHIL framework

    Face pose estimation in monocular images

    Get PDF
    People use orientation of their faces to convey rich, inter-personal information. For example, a person will direct his face to indicate who the intended target of the conversation is. Similarly in a conversation, face orientation is a non-verbal cue to listener when to switch role and start speaking, and a nod indicates that a person has understands, or agrees with, what is being said. Further more, face pose estimation plays an important role in human-computer interaction, virtual reality applications, human behaviour analysis, pose-independent face recognition, driver s vigilance assessment, gaze estimation, etc. Robust face recognition has been a focus of research in computer vision community for more than two decades. Although substantial research has been done and numerous methods have been proposed for face recognition, there remain challenges in this field. One of these is face recognition under varying poses and that is why face pose estimation is still an important research area. In computer vision, face pose estimation is the process of inferring the face orientation from digital imagery. It requires a serious of image processing steps to transform a pixel-based representation of a human face into a high-level concept of direction. An ideal face pose estimator should be invariant to a variety of image-changing factors such as camera distortion, lighting condition, skin colour, projective geometry, facial hairs, facial expressions, presence of accessories like glasses and hats, etc. Face pose estimation has been a focus of research for about two decades and numerous research contributions have been presented in this field. Face pose estimation techniques in literature have still some shortcomings and limitations in terms of accuracy, applicability to monocular images, being autonomous, identity and lighting variations, image resolution variations, range of face motion, computational expense, presence of facial hairs, presence of accessories like glasses and hats, etc. These shortcomings of existing face pose estimation techniques motivated the research work presented in this thesis. The main focus of this research is to design and develop novel face pose estimation algorithms that improve automatic face pose estimation in terms of processing time, computational expense, and invariance to different conditions

    Face pose estimation in monocular images

    Get PDF
    People use orientation of their faces to convey rich, inter-personal information. For example, a person will direct his face to indicate who the intended target of the conversation is. Similarly in a conversation, face orientation is a non-verbal cue to listener when to switch role and start speaking, and a nod indicates that a person has understands, or agrees with, what is being said. Further more, face pose estimation plays an important role in human-computer interaction, virtual reality applications, human behaviour analysis, pose-independent face recognition, driver s vigilance assessment, gaze estimation, etc. Robust face recognition has been a focus of research in computer vision community for more than two decades. Although substantial research has been done and numerous methods have been proposed for face recognition, there remain challenges in this field. One of these is face recognition under varying poses and that is why face pose estimation is still an important research area. In computer vision, face pose estimation is the process of inferring the face orientation from digital imagery. It requires a serious of image processing steps to transform a pixel-based representation of a human face into a high-level concept of direction. An ideal face pose estimator should be invariant to a variety of image-changing factors such as camera distortion, lighting condition, skin colour, projective geometry, facial hairs, facial expressions, presence of accessories like glasses and hats, etc. Face pose estimation has been a focus of research for about two decades and numerous research contributions have been presented in this field. Face pose estimation techniques in literature have still some shortcomings and limitations in terms of accuracy, applicability to monocular images, being autonomous, identity and lighting variations, image resolution variations, range of face motion, computational expense, presence of facial hairs, presence of accessories like glasses and hats, etc. These shortcomings of existing face pose estimation techniques motivated the research work presented in this thesis. The main focus of this research is to design and develop novel face pose estimation algorithms that improve automatic face pose estimation in terms of processing time, computational expense, and invariance to different conditions.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Face pose estimation in monocular images

    Get PDF
    People use orientation of their faces to convey rich, inter-personal information. For example, a person will direct his face to indicate who the intended target of the conversation is. Similarly in a conversation, face orientation is a non-verbal cue to listener when to switch role and start speaking, and a nod indicates that a person has understands, or agrees with, what is being said. Further more, face pose estimation plays an important role in human-computer interaction, virtual reality applications, human behaviour analysis, pose-independent face recognition, driver s vigilance assessment, gaze estimation, etc. Robust face recognition has been a focus of research in computer vision community for more than two decades. Although substantial research has been done and numerous methods have been proposed for face recognition, there remain challenges in this field. One of these is face recognition under varying poses and that is why face pose estimation is still an important research area. In computer vision, face pose estimation is the process of inferring the face orientation from digital imagery. It requires a serious of image processing steps to transform a pixel-based representation of a human face into a high-level concept of direction. An ideal face pose estimator should be invariant to a variety of image-changing factors such as camera distortion, lighting condition, skin colour, projective geometry, facial hairs, facial expressions, presence of accessories like glasses and hats, etc. Face pose estimation has been a focus of research for about two decades and numerous research contributions have been presented in this field. Face pose estimation techniques in literature have still some shortcomings and limitations in terms of accuracy, applicability to monocular images, being autonomous, identity and lighting variations, image resolution variations, range of face motion, computational expense, presence of facial hairs, presence of accessories like glasses and hats, etc. These shortcomings of existing face pose estimation techniques motivated the research work presented in this thesis. The main focus of this research is to design and develop novel face pose estimation algorithms that improve automatic face pose estimation in terms of processing time, computational expense, and invariance to different conditions.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Modelling the nonstationarity of speech in the maximum negentropy beamformer

    Get PDF
    State-of-the-art automatic speech recognition (ASR) systems can achieve very low word error rates (WERs) of below 5% on data recorded with headsets. However, in many situations such as ASR at meetings or in the car, far field microphones on the table, walls or devices such as laptops are preferable to microphones that have to be worn close to the user\u27s mouths. Unfortunately, the distance between speakers and microphones introduces significant noise and reverberation, and as a consequence the WERs of current ASR systems on this data tend to be unacceptably high (30-50% upwards). The use of a microphone array, i.e. several microphones, can alleviate the problem somewhat by performing spatial filtering: beamforming techniques combine the sensors\u27 output in a way that focuses the processing on a particular direction. Assuming that the signal of interest comes from a different direction than the noise, this can improve the signal quality and reduce the WER by filtering out sounds coming from non-relevant directions. Historically, array processing techniques developed from research on non-speech data, e.g. in the fields of sonar and radar, and as a consequence most techniques were not created to specifically address beamforming in the context of ASR. While this generality can be seen as an advantage in theory, it also means that these methods ignore characteristics which could be used to improve the process in a way that benefits ASR. An example of beamforming adapted to speech processing is the recently proposed maximum negentropy beamformer (MNB), which exploits the statistical characteristics of speech as follows. "Clean" headset speech differs from noisy or reverberant speech in its statistical distribution, which is much less Gaussian in the clean case. Since negentropy is a measure of non-Gaussianity, choosing beamformer weights that maximise the negentropy of the output leads to speech that is closer to clean speech in its distribution, and this in turn has been shown to lead to improved WERs [Kumatani et al., 2009]. In this thesis several refinements of the MNB algorithm are proposed and evaluated. Firstly, a number of modifications to the original MNB configuration are proposed based on theoretical or practical concerns. These changes concern the probability density function (pdf) used to model speech, the estimation of the pdf parameters, and the method of calculating the negentropy. Secondly, a further step is taken to reflect the characteristics of speech by introducing time-varying pdf parameters. The original MNB uses fixed estimates per utterance, which do not account for the nonstationarity of speech. Several time-dependent variance estimates are therefore proposed, beginning with a simple moving average window and including the HMM-MNB, which derives the variance estimate from a set of auxiliary hidden Markov models. All beamformer algorithms presented in this thesis are evaluated through far-field ASR experiments on the Multi-Channel Wall Street Journal Audio-Visual Corpus, a database of utterances captured with real far-field sensors, in a realistic acoustic environment, and spoken by real speakers. While the proposed methods do not lead to an improvement in ASR performance, a more efficient MNB algorithm is developed, and it is shown that comparable results can be achieved with significantly less data than all frames of the utterance, a result which is of particular relevance for real-time implementations.Automatische Spracherkennungssysteme können heutzutage sehr niedrige Wortfehlerraten (WER) unter 5% erreichen, wenn die Sprachdaten mit einem Headset oder anderem Nahbesprechungsmikrofon aufgezeichnet wurden. Allerdings hat das Tragen eines mundnahen Mikrofons in vielen Situationen, wie z.B. der Spracherkennung im Auto oder während einer Besprechung, praktische Nachteile, und ein auf dem Tisch, an der Wand oder am Laptop befestigtes Mikrofon wäre in dem Fall vorteilhaft. Bei einer größeren Distanz zwischen Mikrofon und Sprecher werden andererseits aber verstärkt Hintergrundgeräusche und Hall aufgenommen, wodurch die Wortfehlerraten häufig in einen unakzeptablen Bereich von 30—50% und höher steigen. Ein Mikrofonarray, d.h. eine Gruppe von Mikrofonen, kann hierbei durch räumliches Filtern in gewissem Maße Abhilfe schaffen: sogenannte Beamforming-Methoden können die Daten der einzelnen Sensoren so kombinieren, dass der Fokus auf eine bestimmte Richtung gerichtet wird. Wenn nun ein Zielsignal aus einer anderen Richtung als die Störgeräusche kommt, kann dieser Prozess die Signalqualität erhöhen und WER-Werte reduzieren, indem die Geräusche aus den nicht-relevanten Richtungen herausgefiltert werden. Da Beamforming-Techniken sich aus der Forschung an nicht-sprachlichen Daten wie Sonar und Radar entwickelt haben, sind die wenigsten Methoden in diesem Bereich speziell auf das Problem der Spracherkennung ausgerichtet. Während eine Anwendungsunabhängigkeit von Vorteil sein kann, bedeutet sie aber auch, dass Eigenschaften der Spracherkennung ignoriert werden, die zur Verbesserung des Ergebnisses genutzt werden könnten. Ein Beispiel für einen Beamforming-Algorithmus, der speziell für die Verarbeitung von Sprache entwickelt wurde, ist der Maximum Negentropy Beamformer (MNB). Der MNB nutzt die Tatsache, dass "saubere" Sprache, die mit einem Nahbesprechungsmikrofon aufgenommen wurde, eine andere Wahrscheinlichkeitsverteilung aufweist als verrauschte oder verhallte Sprache: Die Verteilung sauberer Sprache unterscheidet sich von der Normalverteilung sehr viel stärker als die von fern aufgezeichneter Sprache. Der MNB wählt Beamforming-Gewichte, die den Negentropy-Wert maximieren, und da Negentropy misst, wie sehr sich eine Verteilung von der Normalverteilung unterscheidet, ähnelt die vom MNB produzierte Sprache statistisch gesehen sauberer Sprache, was zu verbesserten WER-Werten geführt hat [Kumatani et al., 2009]. Das Thema dieser Dissertation ist die Entwicklung und Evaluierung von verschiedenen Modifikationen des MNB. Erstens wird eine Anzahl von praktisch und theoretisch motivierten Veränderungen vorgeschlagen, die die Form der Wahrscheinlichkeitsverteilung zur Sprachmodellierung, die Schätzung der Parameter dieser Verteilung und die Berechnung der Negentropy-Werte betreffen. Zweitens wird ein weiterer Schritt zur Berücksichtigung der Eigenschaften von Sprache unternommen, indem die Zeitabhängigkeit der Verteilungsparameter eingeführt wird; im ursprünglichen MNB-Algorithmus sind diese für eine Äußerung konstant, was im Gegensatz zur nicht-konstanten Eigenschaft von Sprache steht. Mehrere zeitabhängige Varianz-Schätzungmethoden werden beschrieben und evaluiert, von einem einfachen gleitenden Durchschnittswert bis zum komplexeren HMM-MNB, der die Varianz aus Hidden-Markov-Modellen ableitet. Alle Beamforming-Algorithmen, die in dieser Arbeit vorgestellt werden, werden durch Spracherkennungsexperimente mit dem Multi-Channel Wall Street Journal Audio-Visual Corpus evaluiert. Dieser Korpus wurde nicht durch Simulation erstellt, sondern besteht aus Äußerungen von Personen, die mit echten Sensoren in einer realistischen akustischen Umgebung aufgenommen wurden. Die Ergebnisse zeigen, dass mit den bisher entwickelten Methoden keine Verbesserung der Wortfehlerrate erreicht werden kann. Allerdings wurde ein effizienterer MNB-Algorithmus entwickelt, der vergleichbare Erkennungsraten mit deutlich weniger Sprachdaten erreichen kann, was vor allem für eine Echtzeitimplementierung relevant ist

    New insights into hierarchical clustering and linguistic normalization for speaker diarization

    Get PDF
    Face au volume croissant de données audio et multimédia, les technologies liées à l'indexation de données et à l'analyse de contenu ont suscité beaucoup d'intérêt dans la communauté scientifique. Parmi celles-ci, la segmentation et le regroupement en locuteurs, répondant ainsi à la question 'Qui parle quand ?' a émergé comme une technique de pointe dans la communauté de traitement de la parole. D'importants progrès ont été réalisés dans le domaine ces dernières années principalement menés par les évaluations internationales du NIST. Tout au long de ces évaluations, deux approches se sont démarquées : l'une est bottom-up et l'autre top-down. L'ensemble des systèmes les plus performants ces dernières années furent essentiellement des systèmes types bottom-up, cependant nous expliquons dans cette thèse que l'approche top-down comporte elle aussi certains avantages. En effet, dans un premier temps, nous montrons qu'après avoir introduit une nouvelle composante de purification des clusters dans l'approche top-down, nous obtenons des performances comparables à celles de l'approche bottom-up. De plus, en étudiant en détails les deux types d'approches nous montrons que celles-ci se comportent différemment face à la discrimination des locuteurs et la robustesse face à la composante lexicale. Ces différences sont alors exploitées au travers d'un nouveau système combinant les deux approches. Enfin, nous présentons une nouvelle technologie capable de limiter l'influence de la composante lexicale, source potentielle d'artefacts dans le regroupement et la segmentation en locuteurs. Notre nouvelle approche se nomme Phone Adaptive Training par analogie au Speaker Adaptive TrainingThe ever-expanding volume of available audio and multimedia data has elevated technologies related to content indexing and structuring to the forefront of research. Speaker diarization, commonly referred to as the who spoke when?' task, is one such example and has emerged as a prominent, core enabling technology in the wider speech processing research community. Speaker diarization involves the detection of speaker turns within an audio document (segmentation) and the grouping together of all same-speaker segments (clustering). Much progress has been made in the field over recent years partly spearheaded by the NIST Rich Transcription evaluations focus on meeting domain, in the proceedings of which are found two general approaches: top-down and bottom-up. Even though the best performing systems over recent years have all been bottom-up approaches we show in this thesis that the top-down approach is not without significant merit. Indeed we first introduce a new purification component leading to competitive performance to the bottom-up approach. Moreover, while investigating the two diarization approaches more thoroughly we show that they behave differently in discriminating between individual speakers and in normalizing unwanted acoustic variation, i.e.\ that which does not pertain to different speakers. This difference of behaviours leads to a new top-down/bottom-up system combination outperforming the respective baseline system. Finally, we introduce a new technology able to limit the influence of linguistic effects, responsible for biasing the convergence of the diarization system. Our novel approach is referred to as Phone Adaptive Training (PAT).PARIS-Télécom ParisTech (751132302) / SudocSudocFranceF
    corecore