7 research outputs found

    Unsupervised environment recognition and modeling using sound sensing

    Full text link

    Context-dependent sound event detection

    Get PDF
    The work presented in this article studies how the context information can be used in the automatic sound event detection process, and how the detection system can benefit from such information. Humans are using context information to make more accurate predictions about the sound events and ruling out unlikely events given the context. We propose a similar utilization of context information in the automatic sound event detection process. The proposed approach is composed of two stages: automatic context recognition stage and sound event detection stage. Contexts are modeled using Gaussian mixture models and sound events are modeled using three-state left-to-right hidden Markov models. In the first stage, audio context of the tested signal is recognized. Based on the recognized context, a context-specific set of sound event classes is selected for the sound event detection stage. The event detection stage also uses context-dependent acoustic models and count-based event priors. Two alternative event detection approaches are studied. In the first one, a monophonic event sequence is outputted by detecting the most prominent sound event at each time instance using Viterbi decoding. The second approach introduces a new method for producing polyphonic event sequence by detecting multiple overlapping sound events using multiple restricted Viterbi passes. A new metric is introduced to evaluate the sound event detection performance with various level of polyphony. This combines the detection accuracy and coarse time-resolution error into one metric, making the comparison of the performance of detection algorithms simpler. The two-step approach was found to improve the results substantially compared to the context-independent baseline system. In the block-level, the detection accuracy can be almost doubled by using the proposed context-dependent event detection.publishedVersionPeer reviewe

    Masked Conditional Neural Networks for Sound Recognition

    Get PDF
    Sound recognition has been studied for decades to grant machines the human hearing ability. The advances in this field help in a range of applications, from industrial ones such as fault detection in machines and noise monitoring to household applications such as surveillance and hearing aids. The problem of sound recognition like any pattern recognition task involves the reliability of the extracted features and the recognition model. The problem has been approached through decades of crafted features used collaboratively with models based on neural networks or statistical models such as Gaussian Mixtures and Hidden Markov models. Neural networks are currently being considered as a method to automate the feature extraction stage together with the already incorporated role of recognition. The performance of such models is approaching handcrafted features. Current neural network based models are not primarily designed for the nature of the sound signal, which may not optimally harness distinctive properties of the signal. This thesis proposes neural network models that exploit the nature of the time-frequency representation of the sound signal. We propose the ConditionaL Neural Network (CLNN) and the Masked ConditionaL Neural Network (MCLNN). The CLNN is designed to account for the temporal dimension of a signal and behaves as the framework for the MCLNN. The MCLNN allows a filterbank-like behaviour to be embedded within the network using a specially designed binary mask. The masking subdivides the frequency range of a signal into bands and allows concurrent consideration of different feature combinations analogous to the manual handcrafting of the optimum set of features for a recognition task. The proposed models have been evaluated through an extensive set of experiments using a range of publicly available datasets of music genres and environmental sounds, where they surpass state-of-the-art Convolutional Neural Networks and several hand-crafted attempts

    Detección e identificación de señales sonoras en entornos asistivos.

    Get PDF
    150 p.El trabajo desarrollado en este documento de Tesis Doctoral tiene como principal objetivo el estudio y aplicabilidad de técnicas de reconocimiento de sonidos no relacionados con el habla, tales como timbres de puerta, grifos abiertos, despertadores, etc., que ayuden a mejorar la independencia y calidad de vida de las personas con discapacidad auditiva.En esta investigación se han desarrollado sistemas de reconocimiento capaces de trabajar en tiempo real utilizando micrófonos profesionales con una localización fija. Estos sistemas han sido diseñados tanto para avisar a las personas con problemas auditivos de sonidos de interés como para su uso en sistemas inteligentes que utilicen esta información para el reconocimiento de actividades de la vida diaria de la persona. No obstante, la principal contribución de esta tesis reside en la investigación de este tipo de sistemas en teléfonos móviles donde las prestaciones hardware están más limitadas y las condiciones de entrenamiento de los sonidos y las de validación o testeo varían. Se ha demostrado cómo optimizando los algoritmos de detección y clasificación, estos sistemas pueden ser funcionales en dispositivos móviles en tiempo real. El trabajo en este campo ha derivado en el desarrollo de una aplicación funcional para teléfonos móviles, capaz de funcionar en tiempo real y diseñada en base a pautas de accesibilidad para el apoyo de personas con discapacidad auditiva

    Spatial and Content-based Audio Processing using Stochastic Optimization Methods

    Get PDF
    Stochastic optimization (SO) represents a category of numerical optimization approaches, in which the search for the optimal solution involves randomness in a constructive manner. As shown also in this thesis, the stochastic optimization techniques and models have become an important and notable paradigm in a wide range of application areas, including transportation models, financial instruments, and network design. Stochastic optimization is especially developed for solving the problems that are either too difficult or impossible to solve analytically by deterministic optimization approaches. In this thesis, the focus is put on applying several stochastic optimization algorithms to two audio-specific application areas, namely sniper positioning and content-based audio classification and retrieval. In short, the first application belongs to an area of spatial audio, whereas the latter is a topic of machine learning and, more specifically, multimedia information retrieval. The SO algorithms considered in the thesis are particle filtering (PF), particle swarm optimization (PSO), and simulated annealing (SA), which are extended, combined and applied to the specified problems in a novel manner. Based on their iterative and evolving nature, especially the PSO algorithms are often included to the category of evolutionary algorithms. Considering the sniper positioning application, in this thesis the PF and SA algorithms are employed to optimize the parameters of a mathematical shock wave model based on observed firing event wavefronts. Such an inverse problem is suitable for Bayesian approach, which is the main motivation for including the PF approach among the considered optimization methods. It is shown – also with SA – that by applying the stated shock wave model, the proposed stochastic parameter estimation approach provides statistically reliable and qualified results. The content-based audio classification part of the thesis is based on a dedicated framework consisting of several individual binary classifiers. In this work, artificial neural networks (ANNs) are used within the framework, for which the parameters and network structures are optimized based the desired item outputs, i.e. the ground truth class labels. The optimization process is carried out using a multi-dimensional extension of the regular PSO algorithm (MD PSO). The audio retrieval experiments are performed in the context of feature generation (synthesis), which is an approach for generating new audio features/attributes based on some conventional features originally extracted from a particular audio database. Here the MD PSO algorithm is applied to optimize the parameters of the feature generation process, wherein the dimensionality of the generated feature vector is also optimized. Both from practical perspective and the viewpoint of complexity theory, stochastic optimization techniques are often computationally demanding. Because of this, the practical implementations discussed in this thesis are designed as directly applicable to parallel computing. This is an important and topical issue considering the continuous increase of computing grids and cloud services. Indeed, many of the results achieved in this thesis are computed using a grid of several computers. Furthermore, since also personal computers and mobile handsets include an increasing number of processor cores, such parallel implementations are not limited to grid servers only

    The Impact of Sound on Virtual Landscape Perception: An Empirical Evaluation of Aural-Visual Interaction for 3D Visualization

    Get PDF
    An understanding of quantitative and qualitative landscape characteristics is necessary to successfully articulate intervention or change in the landscape. In landscape planning and design 3D visualizations have been used to successfully communicate various aspects of landscape to a diverse population, though they have been shown to lag behind real-world experience in perceptual experiments. There is evidence that engaging other senses can alter the perception of 3D visualizations, which this thesis used as a departure point for the research project. Three research questions guide the investigation. The first research question is: How do fundamental elements in visualizations (i.e. terrain, vegetation and built form) interact with fundamental sound types (i.e. anthropogenic, mechanical and natural) to affect perceived realism of, and preference for, 3D landscape visualization? The research used empirical methods of a controlled experiment and statistical analysis of quantitative survey responses to examine the perceptual responses to the interaction aural and visual stimuli in St. James’s Park, London, UK. The visualizations were sourced from Google Earth, and the sounds recorded in situ, with Google Earth chosen as it is being used more frequently in landscape planning and design processes, though has received very little perceptual research focus. The second research question is: Do different user characteristics interact with combined aural-visual stimuli to alter perceived realism and preferences for 3D visualization? The final research question emerged out of the experiment design concentrating on research methodology: How effective is the Internet for aural-visual data collection compared to the laboratory setting? The results of the quantitative analysis can be summarized as follows: For research question 1 the results show that sound alters 3D visualization perception both positively and negatively, which varies by landscape element. For all visual conditions mechanical sound significantly lowers preference. For visualizations showing terrain only perceived realism and preference are significantly lowered by anthropogenic sound and significantly raised by natural sound for both realism and preference. For visualizations showing a combination of terrain with built form anthropogenic and mechanical sound significantly raises perceived realism. For visualizations showing a combination of terrain, vegetation and some built form a more complicated interaction occurs for realism, which is moderated by the amount of built form in the scene, e.g. with no buildings in the scene traffic and speech significantly lower realism ratings in similar ways while a small amount of built form visible resulted in speech significantly raising realism ratings. Preference was significantly lowered by anthropogenic and mechanical sound the most out of all three visual conditions. For research question 2 the results confirm that perception can vary for realism by gender and first language differences, and preference by age, first language, cultural and professional background and 3D familiarity. Finally for research question 3 and implications for Internet-based multisensory experiments there is strong evidence that audio hardware and experimental condition (laboratory vs. online) do not significantly alter realism and preference ratings, though larger display sizes can have a significant but very small effect on preference ratings (+/- 0.08 on a 5-point scale). The results indicate that sound significantly alters the perception of realism and preference for landscape simulated via 3D visualizations, with the congruence of aural and visual stimuli having a strong impact on both perceptual responses. The results provide important empirical evidence for future research to build upon, and raise important questions relating to authenticity of landscape experience, particularly when relying solely on visual material as visuals alone do not accurately simulate landscape experience. In addition the research confirms the cross-sensory nature of perception in virtual environments. As a result the inclusion of sound for landscape visualization and aesthetic research is concluded to be of critical importance. The research results suggest that when using sound with 3D visualizations the sound content match the visualized material, and to avoid using sounds that contain human speech unless there is a very strong reason to do so (e.g. there are humans in the visualization). The final chapter discusses opportunities for integrating sound with 3D visualizations in order to increase the perception of realism and preference in landscape planning and design processes, and concludes with areas for future research
    corecore