26 research outputs found

    Bird species recognition using unsupervised modeling of individual vocalization elements

    Get PDF

    ROBUST SPEAKER RECOGNITION BASED ON LATENT VARIABLE MODELS

    Get PDF
    Automatic speaker recognition in uncontrolled environments is a very challenging task due to channel distortions, additive noise and reverberation. To address these issues, this thesis studies probabilistic latent variable models of short-term spectral information that leverage large amounts of data to achieve robustness in challenging conditions. Current speaker recognition systems represent an entire speech utterance as a single point in a high-dimensional space. This representation is known as "supervector". This thesis starts by analyzing the properties of this representation. A novel visualization procedure of supervectors is presented by which qualitative insight about the information being captured is obtained. We then propose the use of an overcomplete dictionary to explicitly decompose a supervector into a speaker-specific component and an undesired variability component. An algorithm to learn the dictionary from a large collection of data is discussed and analyzed. A subset of the entries of the dictionary is learned to represent speaker-specific information and another subset to represent distortions. After encoding the supervector as a linear combination of the dictionary entries, the undesired variability is removed by discarding the contribution of the distortion components. This paradigm is closely related to the previously proposed paradigm of Joint Factor Analysis modeling of supervectors. We establish a connection between the two approaches and show how our proposed method provides improvements in terms of computation and recognition accuracy. An alternative way to handle undesired variability in supervector representations is to first project them into a lower dimensional space and then to model them in the reduced subspace. This low-dimensional projection is known as "i-vector". Unfortunately, i-vectors exhibit non-Gaussian behavior, and direct statistical modeling requires the use of heavy-tailed distributions for optimal performance. These approaches lack closed-form solutions, and therefore are hard to analyze. Moreover, they do not scale well to large datasets. Instead of directly modeling i-vectors, we propose to first apply a non-linear transformation and then use a linear-Gaussian model. We present two alternative transformations and show experimentally that the transformed i-vectors can be optimally modeled by a simple linear-Gaussian model (factor analysis). We evaluate our method on a benchmark dataset with a large amount of channel variability and show that the results compare favorably against the competitors. Also, our approach has closed-form solutions and scales gracefully to large datasets. Finally, a multi-classifier architecture trained on a multicondition fashion is proposed to address the problem of speaker recognition in the presence of additive noise. A large number of experiments are conducted to analyze the proposed architecture and to obtain guidelines for optimal performance in noisy environments. Overall, it is shown that multicondition training of multi-classifier architectures not only produces great robustness in the anticipated conditions, but also generalizes well to unseen conditions

    A survey on artificial intelligence-based acoustic source identification

    Get PDF
    The concept of Acoustic Source Identification (ASI), which refers to the process of identifying noise sources has attracted increasing attention in recent years. The ASI technology can be used for surveillance, monitoring, and maintenance applications in a wide range of sectors, such as defence, manufacturing, healthcare, and agriculture. Acoustic signature analysis and pattern recognition remain the core technologies for noise source identification. Manual identification of acoustic signatures, however, has become increasingly challenging as dataset sizes grow. As a result, the use of Artificial Intelligence (AI) techniques for identifying noise sources has become increasingly relevant and useful. In this paper, we provide a comprehensive review of AI-based acoustic source identification techniques. We analyze the strengths and weaknesses of AI-based ASI processes and associated methods proposed by researchers in the literature. Additionally, we did a detailed survey of ASI applications in machinery, underwater applications, environment/event source recognition, healthcare, and other fields. We also highlight relevant research directions

    Advanced Biometrics with Deep Learning

    Get PDF
    Biometrics, such as fingerprint, iris, face, hand print, hand vein, speech and gait recognition, etc., as a means of identity management have become commonplace nowadays for various applications. Biometric systems follow a typical pipeline, that is composed of separate preprocessing, feature extraction and classification. Deep learning as a data-driven representation learning approach has been shown to be a promising alternative to conventional data-agnostic and handcrafted pre-processing and feature extraction for biometric systems. Furthermore, deep learning offers an end-to-end learning paradigm to unify preprocessing, feature extraction, and recognition, based solely on biometric data. This Special Issue has collected 12 high-quality, state-of-the-art research papers that deal with challenging issues in advanced biometric systems based on deep learning. The 12 papers can be divided into 4 categories according to biometric modality; namely, face biometrics, medical electronic signals (EEG and ECG), voice print, and others

    A speaker classification framework for non-intrusive user modeling : speech-based personalization of in-car services

    Get PDF
    Speaker Classification, i.e. the automatic detection of certain characteristics of a person based on his or her voice, has a variety of applications in modern computer technology and artificial intelligence: As a non-intrusive source for user modeling, it can be employed for personalization of human-machine interfaces in numerous domains. This dissertation presents a principled approach to the design of a novel Speaker Classification system for automatic age and gender recognition which meets these demands. Based on literature studies, methods and concepts dealing with the underlying pattern recognition task are developed. The final system consists of an incremental GMM-SVM supervector architecture with several optimizations. An extensive data-driven experiment series explores the parameter space and serves as evaluation of the component. Further experiments investigate the language-independence of the approach. As an essential part of this thesis, a framework is developed that implements all tasks associated with the design and evaluation of Speaker Classification in an integrated development environment that is able to generate efficient runtime modules for multiple platforms. Applications from the automotive field and other domains demonstrate the practical benefit of the technology for personalization, e.g. by increasing local danger warning lead time for elderly drivers.Die Sprecherklassifikation, also die automatische Erkennung bestimmter Merkmale einer Person anhand ihrer Stimme, besitzt eine Vielzahl von Anwendungsmöglichkeiten in der modernen Computertechnik und Künstlichen Intelligenz: Als nicht-intrusive Wissensquelle für die Benutzermodellierung kann sie zur Personalisierung in vielen Bereichen eingesetzt werden. In dieser Dissertation wird ein fundierter Ansatz zum Entwurf eines neuartigen Sprecherklassifikationssystems zur automatischen Bestimmung von Alter und Geschlecht vorgestellt, welches diese Anforderungen erfüllt. Ausgehend von Literaturstudien werden Konzepte und Methoden zur Behandlung des zugrunde liegenden Mustererkennungsproblems entwickelt, welche zu einer inkrementell arbeitenden GMM-SVM-Supervector-Architektur mit diversen Optimierungen führen. Eine umfassende datengetriebene Experimentalreihe dient der Erforschung des Parameterraumes und zur Evaluierung der Komponente. Weitere Studien untersuchen die Sprachunabhängigkeit des Ansatzes. Als wesentlicher Bestandteil der Arbeit wird ein Framework entwickelt, das alle im Zusammenhang mit Entwurf und Evaluierung von Sprecherklassifikation anfallenden Aufgaben in einer integrierten Entwicklungsumgebung implementiert, welche effiziente Laufzeitmodule für verschiedene Plattformen erzeugen kann. Anwendungen aus dem Automobilbereich und weiteren Domänen demonstrieren den praktischen Nutzen der Technologie zur Personalisierung, z.B. indem die Vorlaufzeit von lokalen Gefahrenwarnungen für ältere Fahrer erhöht wird

    Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge

    Get PDF
    More than a decade has passed since research on automatic recognition of emotion from speech has become a new field of research in line with its 'big brothers' speech and speaker recognition. This article attempts to provide a short overview on where we are today, how we got there and what this can reveal us on where to go next and how we could arrive there. In a first part, we address the basic phenomenon reflecting the last fifteen years, commenting on databases, modelling and annotation, the unit of analysis and prototypicality. We then shift to automatic processing including discussions on features, classification, robustness, evaluation, and implementation and system integration. From there we go to the first comparative challenge on emotion recognition from speech-the INTERSPEECH 2009 Emotion Challenge, organised by (part of) the authors, including the description of the Challenge's database, Sub-Challenges, participants and their approaches, the winners, and the fusion of results to the actual learnt lessons before we finally address the ever-lasting problems and future promising attempts. (C) 2011 Elsevier B.V. All rights reserved.Schuller B., Batliner A., Steidl S., Seppi D., ''Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge'', Speech communication, vol. 53, no. 9-10, pp. 1062-1087, November 2011.status: publishe

    Deep Learning for Action and Gesture Recognition in Image Sequences: A Survey

    Get PDF
    Interest in automatic action and gesture recognition has grown considerably in the last few years. This is due in part to the large number of application domains for this type of technology. As in many other computer vision areas, deep learning based methods have quickly become a reference methodology for obtaining state-of-the-art performance in both tasks. This chapter is a survey of current deep learning based methodologies for action and gesture recognition in sequences of images. The survey reviews both fundamental and cutting edge methodologies reported in the last few years. We introduce a taxonomy that summarizes important aspects of deep learning for approaching both tasks. Details of the proposed architectures, fusion strategies, main datasets, and competitions are reviewed. Also, we summarize and discuss the main works proposed so far with particular interest on how they treat the temporal dimension of data, their highlighting features, and opportunities and challenges for future research. To the best of our knowledge this is the first survey in the topic. We foresee this survey will become a reference in this ever dynamic field of research

    Application of automatic speech recognition technologies to singing

    Get PDF
    The research field of Music Information Retrieval is concerned with the automatic analysis of musical characteristics. One aspect that has not received much attention so far is the automatic analysis of sung lyrics. On the other hand, the field of Automatic Speech Recognition has produced many methods for the automatic analysis of speech, but those have rarely been employed for singing. This thesis analyzes the feasibility of applying various speech recognition methods to singing, and suggests adaptations. In addition, the routes to practical applications for these systems are described. Five tasks are considered: Phoneme recognition, language identification, keyword spotting, lyrics-to-audio alignment, and retrieval of lyrics from sung queries. The main bottleneck in almost all of these tasks lies in the recognition of phonemes from sung audio. Conventional models trained on speech do not perform well when applied to singing. Training models on singing is difficult due to a lack of annotated data. This thesis offers two approaches for generating such data sets. For the first one, speech recordings are made more “song-like”. In the second approach, textual lyrics are automatically aligned to an existing singing data set. In both cases, these new data sets are then used for training new acoustic models, offering considerable improvements over models trained on speech. Building on these improved acoustic models, speech recognition algorithms for the individual tasks were adapted to singing by either improving their robustness to the differing characteristics of singing, or by exploiting the specific features of singing performances. Examples of improving robustness include the use of keyword-filler HMMs for keyword spotting, an i-vector approach for language identification, and a method for alignment and lyrics retrieval that allows highly varying durations. Features of singing are utilized in various ways: In an approach for language identification that is well-suited for long recordings; in a method for keyword spotting based on phoneme durations in singing; and in an algorithm for alignment and retrieval that exploits known phoneme confusions in singing.Das Gebiet des Music Information Retrieval befasst sich mit der automatischen Analyse von musikalischen Charakteristika. Ein Aspekt, der bisher kaum erforscht wurde, ist dabei der gesungene Text. Auf der anderen Seite werden in der automatischen Spracherkennung viele Methoden für die automatische Analyse von Sprache entwickelt, jedoch selten für Gesang. Die vorliegende Arbeit untersucht die Anwendung von Methoden aus der Spracherkennung auf Gesang und beschreibt mögliche Anpassungen. Zudem werden Wege zur praktischen Anwendung dieser Ansätze aufgezeigt. Fünf Themen werden dabei betrachtet: Phonemerkennung, Sprachenidentifikation, Schlagwortsuche, Text-zu-Gesangs-Alignment und Suche von Texten anhand von gesungenen Anfragen. Das größte Hindernis bei fast allen dieser Themen ist die Erkennung von Phonemen aus Gesangsaufnahmen. Herkömmliche, auf Sprache trainierte Modelle, bieten keine guten Ergebnisse für Gesang. Das Trainieren von Modellen auf Gesang ist schwierig, da kaum annotierte Daten verfügbar sind. Diese Arbeit zeigt zwei Ansätze auf, um solche Daten zu generieren. Für den ersten wurden Sprachaufnahmen künstlich gesangsähnlicher gemacht. Für den zweiten wurden Texte automatisch zu einem vorhandenen Gesangsdatensatz zugeordnet. Die neuen Datensätze wurden zum Trainieren neuer Modelle genutzt, welche deutliche Verbesserungen gegenüber sprachbasierten Modellen bieten. Auf diesen verbesserten akustischen Modellen aufbauend wurden Algorithmen aus der Spracherkennung für die verschiedenen Aufgaben angepasst, entweder durch das Verbessern der Robustheit gegenüber Gesangscharakteristika oder durch das Ausnutzen von hilfreichen Besonderheiten von Gesang. Beispiele für die verbesserte Robustheit sind der Einsatz von Keyword-Filler-HMMs für die Schlagwortsuche, ein i-Vector-Ansatz für die Sprachenidentifikation sowie eine Methode für das Alignment und die Textsuche, die stark schwankende Phonemdauern nicht bestraft. Die Besonderheiten von Gesang werden auf verschiedene Weisen genutzt: So z.B. in einem Ansatz für die Sprachenidentifikation, der lange Aufnahmen benötigt; in einer Methode für die Schlagwortsuche, die bekannte Phonemdauern in Gesang mit einbezieht; und in einem Algorithmus für das Alignment und die Textsuche, der bekannte Phonemkonfusionen verwertet

    Deliverable D1.4 Visual, text and audio information analysis for hypervideo, final release

    Get PDF
    Having extensively evaluated the performance of the technologies included in the first release of WP1 multimedia analysis tools, using content from the LinkedTV scenarios and by participating in international benchmarking activities, concrete decisions regarding the appropriateness and the importance of each individual method or combination of methods were made, which, combined with an updated list of information needs for each scenario, led to a new set of analysis requirements that had to be addressed through the release of the final set of analysis techniques of WP1. To this end, coordinated efforts on three directions, including (a) the improvement of a number of methods in terms of accuracy and time efficiency, (b) the development of new technologies and (c) the definition of synergies between methods for obtaining new types of information via multimodal processing, resulted in the final bunch of multimedia analysis methods for video hyperlinking. Moreover, the different developed analysis modules have been integrated into a web-based infrastructure, allowing the fully automatic linking of the multitude of WP1 technologies and the overall LinkedTV platform
    corecore