41 research outputs found

    Spatial, Spectral, and Perceptual Nonlinear Noise Reduction for Hands-free Microphones in a Car

    Get PDF
    Speech enhancement in an automobile is a challenging problem because interference can come from engine noise, fans, music, wind, road noise, reverberation, echo, and passengers engaging in other conversations. Hands-free microphones make the situation worse because the strength of the desired speech signal reduces with increased distance between the microphone and talker. Automobile safety is improved when the driver can use a hands-free interface to phones and other devices instead of taking his eyes off the road. The demand for high quality hands-free communication in the automobile requires the introduction of more powerful algorithms. This thesis shows that a unique combination of five algorithms can achieve superior speech enhancement for a hands-free system when compared to beamforming or spectral subtraction alone. Several different designs were analyzed and tested before converging on the configuration that achieved the best results. Beamforming, voice activity detection, spectral subtraction, perceptual nonlinear weighting, and talker isolation via pitch tracking all work together in a complementary iterative manner to create a speech enhancement system capable of significantly enhancing real world speech signals. The following conclusions are supported by the simulation results using data recorded in a car and are in strong agreement with theory. Adaptive beamforming, like the Generalized Side-lobe Canceller (GSC), can be effectively used if the filters only adapt during silent data frames because too much of the desired speech is cancelled otherwise. Spectral subtraction removes stationary noise while perceptual weighting prevents the introduction of offensive audible noise artifacts. Talker isolation via pitch tracking can perform better when used after beamforming and spectral subtraction because of the higher accuracy obtained after initial noise removal. Iterating the algorithm once increases the accuracy of the Voice Activity Detection (VAD), which improves the overall performance of the algorithm. Placing the microphone(s) on the ceiling above the head and slightly forward of the desired talker appears to be the best location in an automobile based on the experiments performed in this thesis. Objective speech quality measures show that the algorithm removes a majority of the stationary noise in a hands-free environment of an automobile with relatively minimal speech distortion

    GCC-PHAT based head orientation estimation

    Get PDF
    This work presents a novel two-step algorithm to estimate the orientation of speakers in a smart-room environment equipped with microphone arrays. First the position of the speaker is estimated by the SRP-PHAT algorithm, and the time delay of arrival for each microphone pair with respect to the detected position is computed. In the second step, the value of the cross- correlation at the estimated time delay is used as the fundamen- tal characteristic from where to derive the speaker orientation. The proposed method performs consistently better than other state-of-the-art acoustic techniques with a purposely recorded database and the CLEAR head pose database.Peer ReviewedPostprint (author’s final draft

    Acoustic event detection and localization using distributed microphone arrays

    Get PDF
    Automatic acoustic scene analysis is a complex task that involves several functionalities: detection (time), localization (space), separation, recognition, etc. This thesis focuses on both acoustic event detection (AED) and acoustic source localization (ASL), when several sources may be simultaneously present in a room. In particular, the experimentation work is carried out with a meeting-room scenario. Unlike previous works that either employed models of all possible sound combinations or additionally used video signals, in this thesis, the time overlapping sound problem is tackled by exploiting the signal diversity that results from the usage of multiple microphone array beamformers. The core of this thesis work is a rather computationally efficient approach that consists of three processing stages. In the first, a set of (null) steering beamformers is used to carry out diverse partial signal separations, by using multiple arbitrarily located linear microphone arrays, each of them composed of a small number of microphones. In the second stage, each of the beamformer output goes through a classification step, which uses models for all the targeted sound classes (HMM-GMM, in the experiments). Then, in a third stage, the classifier scores, either being intra- or inter-array, are combined using a probabilistic criterion (like MAP) or a machine learning fusion technique (fuzzy integral (FI), in the experiments). The above-mentioned processing scheme is applied in this thesis to a set of complexity-increasing problems, which are defined by the assumptions made regarding identities (plus time endpoints) and/or positions of sounds. In fact, the thesis report starts with the problem of unambiguously mapping the identities to the positions, continues with AED (positions assumed) and ASL (identities assumed), and ends with the integration of AED and ASL in a single system, which does not need any assumption about identities or positions. The evaluation experiments are carried out in a meeting-room scenario, where two sources are temporally overlapped; one of them is always speech and the other is an acoustic event from a pre-defined set. Two different databases are used, one that is produced by merging signals actually recorded in the UPCÂżs department smart-room, and the other consists of overlapping sound signals directly recorded in the same room and in a rather spontaneous way. From the experimental results with a single array, it can be observed that the proposed detection system performs better than either the model based system or a blind source separation based system. Moreover, the product rule based combination and the FI based fusion of the scores resulting from the multiple arrays improve the accuracies further. On the other hand, the posterior position assignment is performed with a very small error rate. Regarding ASL and assuming an accurate AED system output, the 1-source localization performance of the proposed system is slightly better than that of the widely-used SRP-PHAT system, working in an event-based mode, and it even performs significantly better than the latter one in the more complex 2-source scenario. Finally, though the joint system suffers from a slight degradation in terms of classification accuracy with respect to the case where the source positions are known, it shows the advantage of carrying out the two tasks, recognition and localization, with a single system, and it allows the inclusion of information about the prior probabilities of the source positions. It is worth noticing also that, although the acoustic scenario used for experimentation is rather limited, the approach and its formalism were developed for a general case, where the number and identities of sources are not constrained

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Multimodal assessment of emotional responses by physiological monitoring: novel auditory and visual elicitation strategies in traditional and virtual reality environments

    Get PDF
    This doctoral thesis explores novel strategies to quantify emotions and listening effort through monitoring of physiological signals. Emotions are a complex aspect of the human experience, playing a crucial role in our survival and adaptation to the environment. The study of emotions fosters important applications, such as Human-Computer and Human-Robot interaction or clinical assessment and treatment of mental health conditions such as depression, anxiety, stress, chronic anger, and mood disorders. Listening effort is also an important area of study, as it provides insight into the listeners’ challenges that are usually not identified by traditional audiometric measures. The research is divided into three lines of work, each with a unique emphasis on the methods of emotion elicitation and the stimuli that are most effective in producing emotional responses, with a specific focus on auditory stimuli. The research fostered the creation of three experimental protocols, as well as the use of an available online protocol for studying emotional responses including monitoring of both peripheral and central physiological signals, such as skin conductance, respiration, pupil dilation, electrocardiogram, blood volume pulse, and electroencephalography. An emotional protocol was created for the study of listening effort using a speech-in-noise test designed to be short and not induce fatigue. The results revealed that the listening effort is a complex problem that cannot be studied with a univariate approach, thus necessitating the use of multiple physiological markers to study different physiological dimensions. Specifically, the findings demonstrate a strong association between the level of auditory exertion, the amount of attention and involvement directed towards stimuli that are readily comprehensible compared to those that demand greater exertion. Continuing with the auditory domain, peripheral physiological signals were studied in order to discriminate four emotions elicited in a subject who listened to music for 21 days, using a previously designed and publicly available protocol. Surprisingly, the processed physiological signals were able to clearly separate the four emotions at the physiological level, demonstrating that music, which is not typically studied extensively in the literature, can be an effective stimulus for eliciting emotions. Following these results, a flat-screen protocol was created to compare physiological responses to purely visual, purely auditory, and combined audiovisual emotional stimuli. The results show that auditory stimuli are more effective in separating emotions at the physiological level. The subjects were found to be much more attentive during the audio-only phase. In order to overcome the limitations of emotional protocols carried out in a laboratory environment, which may elicit fewer emotions due to being an unnatural setting for the subjects under study, a final emotional elicitation protocol was created using virtual reality. Scenes similar to reality were created to elicit four distinct emotions. At the physiological level, it was noted that this environment is more effective in eliciting emotions. To our knowledge, this is the first protocol specifically designed for virtual reality that elicits diverse emotions. Furthermore, even in terms of classification, the use of virtual reality has been shown to be superior to traditional flat-screen protocols, opening the doors to virtual reality for the study of conditions related to emotional control
    corecore