109 research outputs found

    Comparison for Improvements of Singing Voice Detection System Based on Vocal Separation

    Full text link
    Singing voice detection is the task to identify the frames which contain the singer vocal or not. It has been one of the main components in music information retrieval (MIR), which can be applicable to melody extraction, artist recognition, and music discovery in popular music. Although there are several methods which have been proposed, a more robust and more complete system is desired to improve the detection performance. In this paper, our motivation is to provide an extensive comparison in different stages of singing voice detection. Based on the analysis a novel method was proposed to build a more efficiently singing voice detection system. In the proposed system, there are main three parts. The first is a pre-process of singing voice separation to extract the vocal without the music. The improvements of several singing voice separation methods were compared to decide the best one which is integrated to singing voice detection system. And the second is a deep neural network based classifier to identify the given frames. Different deep models for classification were also compared. The last one is a post-process to filter out the anomaly frame on the prediction result of the classifier. The median filter and Hidden Markov Model (HMM) based filter as the post process were compared. Through the step by step module extension, the different methods were compared and analyzed. Finally, classification performance on two public datasets indicates that the proposed approach which based on the Long-term Recurrent Convolutional Networks (LRCN) model is a promising alternative.Comment: 15 page

    EigenScape : A Database of Spatial Acoustic Scene Recordings

    Get PDF
    The classification of acoustic scenes and events is an emerging area of research in the field of machine listening. Most of the research conducted so far uses spectral features extracted from monaural or stereophonic audio rather than spatial features extracted from multichannel recordings. This is partly due to the lack thus far of a substantial body of spatial recordings of acoustic scenes. This paper formally introduces EigenScape, a new database of fourth-order Ambisonic recordings of eight different acoustic scene classes. The potential applications of a spatial machine listening system are discussed before detailed information on the recording process and dataset are provided. A baseline spatial classification system using directional audio coding (DirAC) techniques is detailed and results from this classifier are presented. The classifier is shown to give good overall scene classification accuracy across the dataset, with 7 of 8 scenes being classified with an accuracy of greater than 60% with an 11% improvement in overall accuracy compared to use of Mel-frequency cepstral coefficient (MFCC) features. Further analysis of the results shows potential improvements to the classifier. It is concluded that the results validate the new database and show that spatial features can characterise acoustic scenes and as such are worthy of further investigatio

    An Imperceptible Method to Monitor Human Activity by Using Sensor Data with CNN and Bi-directional LSTM

    Get PDF
    Deep learning (DL) algorithms have substantially increased research in recognizing day-to-day human activities All methods for recognizing human activities that are found through DL methods will only be useful if they work better in real-time applications.  Activities of elderly people need to be monitored to detect any abnormalities in their health and to suggest healthy life style based on their day to day activities. Most of the existing approaches used videos, static photographs for recognizing the activities. Those methods make the individual to feel anxious that they are being monitored. To address this limitation we utilized the cognitive outcomes of DL algorithms and used sensor data as input to the proposed model which is collected from smart home dataset for recognizing elderly people activity, without any interference in their privacy. At early stages human activities the input for human activity recognition by DL models are done using single sensor data which are static and lack in recognizing dynamic and multi sensor data. We propose a DL architecture based on the blend of deep Convolutional Neural Network (CNN) and Bi-directional Long Short-Term Memory (Bi-LSTM) in this research which replaces human intervention by automatically extracting features from multifunctional sensing devices to reliably recognize the activities. During the entire investigation process we utilized Tulum, a benchmark dataset that contains the logs of sensor data. We exhibit that our methodology outperforms by marking its accuracy as 98.76% and F1 score as 0.98

    Hearing What You Cannot See: Acoustic Vehicle Detection Around Corners

    Full text link
    This work proposes to use passive acoustic perception as an additional sensing modality for intelligent vehicles. We demonstrate that approaching vehicles behind blind corners can be detected by sound before such vehicles enter in line-of-sight. We have equipped a research vehicle with a roof-mounted microphone array, and show on data collected with this sensor setup that wall reflections provide information on the presence and direction of occluded approaching vehicles. A novel method is presented to classify if and from what direction a vehicle is approaching before it is visible, using as input Direction-of-Arrival features that can be efficiently computed from the streaming microphone array data. Since the local geometry around the ego-vehicle affects the perceived patterns, we systematically study several environment types, and investigate generalization across these environments. With a static ego-vehicle, an accuracy of 0.92 is achieved on the hidden vehicle classification task. Compared to a state-of-the-art visual detector, Faster R-CNN, our pipeline achieves the same accuracy more than one second ahead, providing crucial reaction time for the situations we study. While the ego-vehicle is driving, we demonstrate positive results on acoustic detection, still achieving an accuracy of 0.84 within one environment type. We further study failure cases across environments to identify future research directions.Comment: Accepted to IEEE Robotics & Automation Letters (2021), DOI: 10.1109/LRA.2021.3062254. Code, Data & Video: https://github.com/tudelft-iv/occluded_vehicle_acoustic_detectio

    GM-TCNet: Gated Multi-scale Temporal Convolutional Network using Emotion Causality for Speech Emotion Recognition

    Full text link
    In human-computer interaction, Speech Emotion Recognition (SER) plays an essential role in understanding the user's intent and improving the interactive experience. While similar sentimental speeches own diverse speaker characteristics but share common antecedents and consequences, an essential challenge for SER is how to produce robust and discriminative representations through causality between speech emotions. In this paper, we propose a Gated Multi-scale Temporal Convolutional Network (GM-TCNet) to construct a novel emotional causality representation learning component with a multi-scale receptive field. GM-TCNet deploys a novel emotional causality representation learning component to capture the dynamics of emotion across the time domain, constructed with dilated causal convolution layer and gating mechanism. Besides, it utilizes skip connection fusing high-level features from different gated convolution blocks to capture abundant and subtle emotion changes in human speech. GM-TCNet first uses a single type of feature, mel-frequency cepstral coefficients, as inputs and then passes them through the gated temporal convolutional module to generate the high-level features. Finally, the features are fed to the emotion classifier to accomplish the SER task. The experimental results show that our model maintains the highest performance in most cases compared to state-of-the-art techniques.Comment: The source code is available at: https://github.com/Jiaxin-Ye/GM-TCNe

    Acoustic sensor network geometry calibration and applications

    Get PDF
    In the modern world, we are increasingly surrounded by computation devices with communication links and one or more microphones. Such devices are, for example, smartphones, tablets, laptops or hearing aids. These devices can work together as nodes in an acoustic sensor network (ASN). Such networks are a growing platform that opens the possibility for many practical applications. ASN based speech enhancement, source localization, and event detection can be applied for teleconferencing, camera control, automation, or assisted living. For this kind of applications, the awareness of auditory objects and their spatial positioning are key properties. In order to provide these two kinds of information, novel methods have been developed in this thesis. Information on the type of auditory objects is provided by a novel real-time sound classification method. Information on the position of human speakers is provided by a novel localization and tracking method. In order to localize with respect to the ASN, the relative arrangement of the sensor nodes has to be known. Therefore, different novel geometry calibration methods were developed. Sound classification The first method addresses the task of identification of auditory objects. A novel application of the bag-of-features (BoF) paradigm on acoustic event classification and detection was introduced. It can be used for event and speech detection as well as for speaker identification. The use of both mel frequency cepstral coefficient (MFCC) and Gammatone frequency cepstral coefficient (GFCC) features improves the classification accuracy. By using soft quantization and introducing supervised training for the BoF model, superior accuracy is achieved. The method generalizes well from limited training data. It is working online and can be computed in a fraction of real-time. By a dedicated training strategy based on a hierarchy of stationarity, the detection of speech in mixtures with noise was realized. This makes the method robust against severe noises levels corrupting the speech signal. Thus it is possible to provide control information to a beamformer in order to realize blind speech enhancement. A reliable improvement is achieved in the presence of one or more stationary noise sources. Speaker localization The localization method enables each node to determine the direction of arrival (DoA) of concurrent sound sources. The author's neuro-biologically inspired speaker localization method for microphone arrays was refined for the use in ASNs. By implementing a dedicated cochlear and midbrain model, it is robust against the reverberation found in indoor rooms. In order to better model the unknown number of concurrent speakers, an application of the EM algorithm that realizes probabilistic clustering according to auditory scene analysis (ASA) principles was introduced. Based on this approach, a system for Euclidean tracking in ASNs was designed. Each node applies the node wise localization method and shares probabilistic DoA estimates together with an estimate of the spectral distribution with the network. As this information is relatively sparse, it can be transmitted with low bandwidth. The system is robust against jitter and transmission errors. The information from all nodes is integrated according to spectral similarity to correctly associate concurrent speakers. By incorporating the intersection angle in the triangulation, the precision of the Euclidean localization is improved. Tracks of concurrent speakers are computed over time, as is shown with recordings in a reverberant room. Geometry calibration The central task of geometry calibration has been solved with special focus on sensor nodes equipped with multiple microphones. Novel methods were developed for different scenarios. An audio-visual method was introduced for the calibration of ASNs in video conferencing scenarios. The DoAs estimates are fused with visual speaker tracking in order to provide sensor positions in a common coordinate system. A novel acoustic calibration method determines the relative positioning of the nodes from ambient sounds alone. Unlike previous methods that only infer the positioning of distributed microphones, the DoA is incorporated and thus it becomes possible to calibrate the orientation of the nodes with a high accuracy. This is very important for all applications using the spatial information, as the triangulation error increases dramatically with bad orientation estimates. As speech events can be used, the calibration becomes possible without the requirement of playing dedicated calibration sounds. Based on this, an online method employing a genetic algorithm with incremental measurements was introduced. By using the robust speech localization method, the calibration is computed in parallel to the tracking. The online method is be able to calibrate ASNs in real time, as is shown with recordings of natural speakers in a reverberant room. The informed acoustic sensor network All new methods are important building blocks for the use of ASNs. The online methods for localization and calibration both make use of the neuro-biologically inspired processing in the nodes which leads to state-of-the-art results, even in reverberant enclosures. The high robustness and reliability can be improved even more by including the event detection method in order to exclude non-speech events. When all methods are combined, both semantic information on what is happening in the acoustic scene as well as spatial information on the positioning of the speakers and sensor nodes is automatically acquired in real time. This realizes truly informed audio processing in ASNs. Practical applicability is shown by application to recordings in reverberant rooms. The contribution of this thesis is thus not only to advance the state-of-the-art in automatically acquiring information on the acoustic scene, but also pushing the practical applicability of such methods

    Binaural sound source localization using machine learning with spiking neural networks features extraction

    Get PDF
    Human and animal binaural hearing systems are able take advantage of a variety of cues to localise sound-sources in a 3D space using only two sensors. This work presents a bionic system that utilises aspects of binaural hearing in an automated source localisation task. A head and torso emulator (KEMAR) are used to acquire binaural signals and a spiking neural network is used to compare signals from the two sensors. The firing rates of coincidence-neurons in the spiking neural network model provide information as to the location of a sound source. Previous methods have used a winner-takesall approach, where the location of the coincidence-neuron with the maximum firing rate is used to indicate the likely azimuth and elevation. This was shown to be accurate for single sources, but when multiple sources are present the accuracy significantly reduces. To improve the robustness of the methodology, an alternative approach is developed where the spiking neural network is used as a feature pre-processor. The firing rates of all coincidence-neurons are then used as inputs to a Machine Learning model which is trained to predict source location for both single and multiple sources. A novel approach that applied spiking neural networks as a binaural feature extraction method was presented. These features were processed using deep neural networks to localise multisource sound signals that were emitted from different locations. Results show that the proposed bionic binaural emulator can accurately localise sources including multiple and complex sources to 99% correctly predicted angles from single-source localization model and 91% from multi-source localization model. The impact of background noise on localisation performance has also been investigated and shows significant degradation of performance. The multisource localization model was trained with multi-condition background noise at SNRs of 10dB, 0dB, and -10dB and tested at controlled SNRs. The findings demonstrate an enhancement in the model performance in compared with noise free training data
    • …
    corecore