25 research outputs found

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Computational Intelligence and Human- Computer Interaction: Modern Methods and Applications

    Get PDF
    The present book contains all of the articles that were accepted and published in the Special Issue of MDPI’s journal Mathematics titled "Computational Intelligence and Human–Computer Interaction: Modern Methods and Applications". This Special Issue covered a wide range of topics connected to the theory and application of different computational intelligence techniques to the domain of human–computer interaction, such as automatic speech recognition, speech processing and analysis, virtual reality, emotion-aware applications, digital storytelling, natural language processing, smart cars and devices, and online learning. We hope that this book will be interesting and useful for those working in various areas of artificial intelligence, human–computer interaction, and software engineering as well as for those who are interested in how these domains are connected in real-life situations

    Robust Distributed Multi-Source Detection and Labeling in Wireless Acoustic Sensor Networks

    Get PDF
    The growing demand in complex signal processing methods associated with low-energy large scale wireless acoustic sensor networks (WASNs) urges the shift to a new information and communication technologies (ICT) paradigm. The emerging research perception aspires for an appealing wireless network communication where multiple heterogeneous devices with different interests can cooperate in various signal processing tasks (MDMT). Contributions in this doctoral thesis focus on distributed multi-source detection and labeling applied to audio enhancement scenarios pursuing an MDMT fashioned node-specific source-of-interest signal enhancement in WASNs. In fact, an accurate detection and labeling is a pre-requisite to pursue the MDMT paradigm where nodes in the WASN communicate effectively their sources-of-interest and, therefore, multiple signal processing tasks can be enhanced via cooperation. First, a novel framework based on a dominant source model in distributed WASNs for resolving the activity detection of multiple speech sources in a reverberant and noisy environment is introduced. A preliminary rank-one multiplicative non-negative independent component analysis (M-NICA) for unique dominant energy source extraction given associated node clusters is presented. Partitional algorithms that minimize the within-cluster mean absolute deviation (MAD) and weighted MAD objectives are proposed to determine the cluster membership of the unmixed energies, and thus establish a source specific voice activity recognition. In a second study, improving the energy signal separation to alleviate the multiple source activity discrimination task is targeted. Sparsity inducing penalties are enforced on iterative rank-one singular value decomposition layers to extract sparse right rotations. Then, sparse non-negative blind energy separation is realized using multiplicative updates. Hence, the multiple source detection problem is converted into a sparse non-negative source energy decorrelation. Sparsity tunes the supposedly non-active energy signatures to exactly zero-valued energies so that it is easier to identify active energies and an activity detector can be constructed in a straightforward manner. In a centralized scenario, the activity decision is controlled by a fusion center that delivers the binary source activity detection for every participating energy source. This strategy gives precise detection results for small source numbers. With a growing number of interfering sources, the distributed detection approach is more promising. Conjointly, a robust distributed energy separation algorithm for multiple competing sources is proposed. A robust and regularized tνMt_{\nu}M-estimation of the covariance matrix of the mixed energies is employed. This approach yields a simple activity decision using only the robustly unmixed energy signatures of the sources in the WASN. The performance of the robust activity detector is validated with a distributed adaptive node-specific signal estimation method for speech enhancement. The latter enhances the quality and intelligibility of the signal while exploiting the accurately estimated multi-source voice decision patterns. In contrast to the original M-NICA for source separation, the extracted binary activity patterns with the robust energy separation significantly improve the node-specific signal estimation. Due to the increased computational complexity caused by the additional step of energy signal separation, a new approach to solving the detection question of multi-device multi-source networks is presented. Stability selection for iterative extraction of robust right singular vectors is considered. The sub-sampling selection technique provides transparency in properly choosing the regularization variable in the Lasso optimization problem. In this way, the strongest sparse right singular vectors using a robust ℓ1\ell_1-norm and stability selection are the set of basis vectors that describe the input data efficiently. Active/non-active source classification is achieved based on a robust Mahalanobis classifier. For this, a robust MM-estimator of the covariance matrix in the Mahalanobis distance is utilized. Extensive evaluation in centralized and distributed settings is performed to assess the effectiveness of the proposed approach. Thus, overcoming the computationally demanding source separation scheme is possible via exploiting robust stability selection for sparse multi-energy feature extraction. With respect to the labeling problem of various sources in a WASN, a robust approach is introduced that exploits the direction-of-arrival of the impinging source signals. A short-time Fourier transform-based subspace method estimates the angles of locally stationary wide band signals using a uniform linear array. The median of angles estimated at every frequency bin is utilized to obtain the overall angle for each participating source. The features, in this case, exploit the similarity across devices in the particular frequency bins that produce reliable direction-of-arrival estimates for each source. Reliability is defined with respect to the median across frequencies. All source-specific frequency bands that contribute to correct estimated angles are selected. A feature vector is formed for every source at each device by storing the frequency bin indices that lie within the upper and lower interval of the median absolute deviation scale of the estimated angle. Labeling is accomplished by a distributed clustering of the extracted angle-based feature vectors using consensus averaging

    Acoustic event detection and localization using distributed microphone arrays

    Get PDF
    Automatic acoustic scene analysis is a complex task that involves several functionalities: detection (time), localization (space), separation, recognition, etc. This thesis focuses on both acoustic event detection (AED) and acoustic source localization (ASL), when several sources may be simultaneously present in a room. In particular, the experimentation work is carried out with a meeting-room scenario. Unlike previous works that either employed models of all possible sound combinations or additionally used video signals, in this thesis, the time overlapping sound problem is tackled by exploiting the signal diversity that results from the usage of multiple microphone array beamformers. The core of this thesis work is a rather computationally efficient approach that consists of three processing stages. In the first, a set of (null) steering beamformers is used to carry out diverse partial signal separations, by using multiple arbitrarily located linear microphone arrays, each of them composed of a small number of microphones. In the second stage, each of the beamformer output goes through a classification step, which uses models for all the targeted sound classes (HMM-GMM, in the experiments). Then, in a third stage, the classifier scores, either being intra- or inter-array, are combined using a probabilistic criterion (like MAP) or a machine learning fusion technique (fuzzy integral (FI), in the experiments). The above-mentioned processing scheme is applied in this thesis to a set of complexity-increasing problems, which are defined by the assumptions made regarding identities (plus time endpoints) and/or positions of sounds. In fact, the thesis report starts with the problem of unambiguously mapping the identities to the positions, continues with AED (positions assumed) and ASL (identities assumed), and ends with the integration of AED and ASL in a single system, which does not need any assumption about identities or positions. The evaluation experiments are carried out in a meeting-room scenario, where two sources are temporally overlapped; one of them is always speech and the other is an acoustic event from a pre-defined set. Two different databases are used, one that is produced by merging signals actually recorded in the UPCÂżs department smart-room, and the other consists of overlapping sound signals directly recorded in the same room and in a rather spontaneous way. From the experimental results with a single array, it can be observed that the proposed detection system performs better than either the model based system or a blind source separation based system. Moreover, the product rule based combination and the FI based fusion of the scores resulting from the multiple arrays improve the accuracies further. On the other hand, the posterior position assignment is performed with a very small error rate. Regarding ASL and assuming an accurate AED system output, the 1-source localization performance of the proposed system is slightly better than that of the widely-used SRP-PHAT system, working in an event-based mode, and it even performs significantly better than the latter one in the more complex 2-source scenario. Finally, though the joint system suffers from a slight degradation in terms of classification accuracy with respect to the case where the source positions are known, it shows the advantage of carrying out the two tasks, recognition and localization, with a single system, and it allows the inclusion of information about the prior probabilities of the source positions. It is worth noticing also that, although the acoustic scenario used for experimentation is rather limited, the approach and its formalism were developed for a general case, where the number and identities of sources are not constrained

    Contextual Social Networking

    Get PDF
    The thesis centers around the multi-faceted research question of how contexts may be detected and derived that can be used for new context aware Social Networking services and for improving the usefulness of existing Social Networking services, giving rise to the notion of Contextual Social Networking. In a first foundational part, we characterize the closely related fields of Contextual-, Mobile-, and Decentralized Social Networking using different methods and focusing on different detailed aspects. A second part focuses on the question of how short-term and long-term social contexts as especially interesting forms of context for Social Networking may be derived. We focus on NLP based methods for the characterization of social relations as a typical form of long-term social contexts and on Mobile Social Signal Processing methods for deriving short-term social contexts on the basis of geometry of interaction and audio. We furthermore investigate, how personal social agents may combine such social context elements on various levels of abstraction. The third part discusses new and improved context aware Social Networking service concepts. We investigate special forms of awareness services, new forms of social information retrieval, social recommender systems, context aware privacy concepts and services and platforms supporting Open Innovation and creative processes. This version of the thesis does not contain the included publications because of copyrights of the journals etc. Contact in terms of the version with all included publications: Georg Groh, [email protected] zentrale Gegenstand der vorliegenden Arbeit ist die vielschichtige Frage, wie Kontexte detektiert und abgeleitet werden können, die dazu dienen können, neuartige kontextbewusste Social Networking Dienste zu schaffen und bestehende Dienste in ihrem Nutzwert zu verbessern. Die (noch nicht abgeschlossene) erfolgreiche Umsetzung dieses Programmes fĂĽhrt auf ein Konzept, das man als Contextual Social Networking bezeichnen kann. In einem grundlegenden ersten Teil werden die eng zusammenhängenden Gebiete Contextual Social Networking, Mobile Social Networking und Decentralized Social Networking mit verschiedenen Methoden und unter Fokussierung auf verschiedene Detail-Aspekte näher beleuchtet und in Zusammenhang gesetzt. Ein zweiter Teil behandelt die Frage, wie soziale Kurzzeit- und Langzeit-Kontexte als fĂĽr das Social Networking besonders interessante Formen von Kontext gemessen und abgeleitet werden können. Ein Fokus liegt hierbei auf NLP Methoden zur Charakterisierung sozialer Beziehungen als einer typischen Form von sozialem Langzeit-Kontext. Ein weiterer Schwerpunkt liegt auf Methoden aus dem Mobile Social Signal Processing zur Ableitung sinnvoller sozialer Kurzzeit-Kontexte auf der Basis von Interaktionsgeometrien und Audio-Daten. Es wird ferner untersucht, wie persönliche soziale Agenten Kontext-Elemente verschiedener Abstraktionsgrade miteinander kombinieren können. Der dritte Teil behandelt neuartige und verbesserte Konzepte fĂĽr kontextbewusste Social Networking Dienste. Es werden spezielle Formen von Awareness Diensten, neue Formen von sozialem Information Retrieval, Konzepte fĂĽr kontextbewusstes Privacy Management und Dienste und Plattformen zur UnterstĂĽtzung von Open Innovation und Kreativität untersucht und vorgestellt. Diese Version der Habilitationsschrift enthält die inkludierten Publikationen zurVermeidung von Copyright-Verletzungen auf Seiten der Journals u.a. nicht. Kontakt in Bezug auf die Version mit allen inkludierten Publikationen: Georg Groh, [email protected]

    Multimedia Context Awareness for Smart Mobile Environments

    Get PDF
    openNowadays the development of the IoT framework and the resulting huge number of smart connected devices opens the door to exploit the presence of multiple smart nodes to accomplish a variety of tasks. Multimedia context awareness, together with the concept of ambient intelligence, is tightly related to the IoT framework, and it can be applied to a large number of smart scenarios. In this thesis, the aim is to study and analyze the role of context awareness in different applications related to smart mobile environments, such as future smart spaces and connected cities. Indeed, this research work focuses on different aspects of ambient intelligence, such as audio-awareness and wireless-awareness. In particular, this thesis tackles two main research topics: the first one, related to the framework of audio-awareness, concerns a multiple observations approach for smart speaker recognition in mobile environments; the second one, tied to the concept of wireless-awareness, regards Unmanned Aerial Vehicle (UAV) detection based on WiFi statistical fingerprint analysis.openXXXI CICLO - SC. E TECN. ING. ELETTR. E DELLE TEL. - Ambienti cognitivi interattiviGaribotto, Chiar

    Sound Source Localization and Modeling: Spherical Harmonics Domain Approaches

    Get PDF
    Sound source localization has been an important research topic in the acoustic signal processing community because of its wide use in many acoustic applications, including speech separation, speech enhancement, sound event detection, automatic speech recognition, automated camera steering, and virtual reality. In the recent decade, there is a growing interest in the research of sound source localization using higher-order microphone arrays, which are capable of recording and analyzing the soundfield over a target spatial area. This thesis studies a novel source feature called the relative harmonic coefficient, that easily estimated from the higher-order microphone measurements. This source feature has direct applications for sound source localization due to its sole dependence on the source position. This thesis proposes two novel sound source localization algorithms using the relative harmonic coefficients: (i) a low-complexity single source localization approach that localizes the source' elevation and azimuth separately. This approach is also appliable to acoustic enhancement for the higher-order microphone array recordings; (ii) a semi-supervised multi-source localization algorithm in a noisy and reverberant environment. Although this approach uses a learning schema, it still has a strong potential to be implemented in practice because only a limited number of labeled measurements are required. However, this algorithm has an inherent limitation as it requires the availability of single-source components. Thus, it is unusable in scenarios where the original recordings have limited single-source components (e.g., multiple sources simultaneously active). To address this issue, we develop a novel MUSIC framework based approach that directly uses simultaneous multi-source recordings. This developed MUSIC approach uses robust measurements of relative sound pressure from the higher-order microphone and is shown to be more suitable in noisy environments than the traditional MUSIC method. While the proposed approaches address the source localization problems, in practice, the broader problem of source localization has some more common challenges, which have received less attention. One such challenge is the common assumption of the sound sources being omnidirectional, which is hardly the case with a typical commercial loudspeaker. Therefore, in this thesis, we analyze the broader problem of analyzing directional characteristics of the commercial loudspeakers by deriving equivalent theoretical acoustic models. Several acoustic models are investigated, including plane waves decomposition, point source decomposition, and mixed source decomposition. We finally conduct extensive experimental examinations to see which acoustic model has more similar characteristics with commercial loudspeakers

    Pre-processing of Speech Signals for Robust Parameter Estimation

    Get PDF
    corecore