329 research outputs found
Machine Learning for Auditory Hierarchy
Coleman, W. (2021). Machine Learning for Auditory Hierarchy. This dissertation is submitted for the degree of Doctor of Philosophy, Technological University Dublin. Audio content is predominantly delivered in a stereo audio file of a static, pre-formed mix. The content creator makes volume, position and effects decisions, generally for presentation in stereo speakers, but has no control ultimately over how the content will be consumed. This leads to poor listener experience when, for example, a feature film is mixed such that the dialogue is at a low level relative to the sound effects. Consumers can complain that they must turn the volume up to hear the words, but back down again because the effects levels are too loud. Addressing this problem requires a television mix optimised for the stereo speakers used in the vast majority of homes, which is not always available
Proceedings of the 6th International Workshop on Folk Music Analysis, 15-17 June, 2016
The Folk Music Analysis Workshop brings together computational music analysis and ethnomusicology. Both symbolic and audio representations of music are considered, with a broad range of scientific approaches being applied (signal processing, graph theory, deep learning). The workshop features a range of interesting talks from international researchers in areas such as Indian classical music, Iranian singing, Ottoman-Turkish Makam music scores, Flamenco singing, Irish traditional music, Georgian traditional music and Dutch folk songs. Invited guest speakers were Anja Volk, Utrecht University and Peter Browne, Technological University Dublin
Representation of speech in the primary auditory cortex and its implications for robust speech processing
Speech has evolved as a primary form of communication between humans. This most used means of communication has been the subject of intense study for years, but there is still a lot that we do not know about it. It is an oft repeated fact, that even the performance of the best speech processing algorithms still lags far behind that of the average human, It seems inescapable that unless we know more about the way the brain performs this task, our machines can not go much further. This thesis focuses on the question of speech representation in the brain, both from a physiological and technological perspective. We explore the representation of speech through the encoding of its smallest elements - phonemic features - in the primary auditory cortex. We report on how population of neurons with diverse tuning properties respond discriminately to phonemes resulting in explicit encoding of their parameters. Next, we show that this sparse encoding of the phonemic features is a simple consequence of the linear spectro-temporal properties of the auditory cortical neurons and that a Spectro-Temporal receptive field model can predict similar patterns of activation. This is an important step toward the realization of systems that operate based on the same principles as the cortex. Using an inverse method of reconstruction, we shall also explore the extent to which phonemic features are preserved in the cortical representation of noisy speech. The results suggest that the cortical responses are more robust to noise and that the important features of phonemes are preserved in the cortical representation even in noise. Finally, we explain how a model of this cortical representation can be used for speech processing and enhancement applications to improve their robustness and performance
An Automatic audio segmentation system for radio newscast
Current web search engines generally do not enable searches into audio files. Informative
metadata would allow searches into audio files, but producing such metadata is a tedious manual task. Tools for automatic production of metadata are therefore needed. This project describes the work done on the development of an automatic audio segmentation system which can be used for this metadata extraction. In this work the radio newscast are divided into segments in which there is only one speaker. Audio features used in this project include
Mel Frequency Cepstral Coefficients. This feature was extracted from audio files that were stored in a WAV format, using CLAM. Model-Selection-Based segmentation is used to
segment audio signals using this feature
Recommended from our members
Human-based percussion and self-similarity detection in electroacoustic music
textElectroacoustic music is music that uses electronic technology for the compositional manipulation of sound, and is a unique genre of music for many reasons. Analyzing electroacoustic music requires special measures, some of which are integrated into the design of a preliminary percussion analysis tool set for electroacoustic music. This tool set is designed to incorporate the human processing of music and sound. Models of the human auditory periphery are used as a front end to the analysis algorithms. The audio properties of percussivity and self-similarity are chosen as the focus because these properties are computable and informative. A collection of human judgments about percussion was undertaken to acquire clearly specified, sound-event dimensions that humans use as a percussive cue. A total of 29 participants was asked to make judgments about the percussivity of 360 pairs of synthesized snare-drum sounds. The grouped results indicate that of the dimensions tested rise time is the strongest cue for percussivity. String resonance also has a strong effect, but because of the complex nature of string resonance, it is not a fundamental dimension of a sound event. Gross spectral filtering also has an effect on the judgment of percussivity but the effect is weaker than for rise time and string resonance. Gross spectral filtering also has less effect when the stronger cue of rise time is modified simultaneously. A percussivity-profile algorithm (PPA) is designed to identify those instants in pieces of music that humans also would identify as percussive. The PPA is implemented using a time-domain, channel-based approach and psychoacoustic models. The input parameters are tuned to maximize performance at matching participants’ choices in the percussion-judgment collection. After the PPA is tuned, the PPA then is used to analyze pieces of electroacoustic music. Real electroacoustic music introduces new challenges for the PPA, though those same challenges might affect human judgment as well. A similarity matrix is combined with the PPA in order to find self-similarity in the percussive sounds of electroacoustic music. This percussive similarity matrix is then used to identify structural characteristics in two pieces of electroacoustic music.Electrical and Computer Engineerin
Models and Analysis of Vocal Emissions for Biomedical Applications
The International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications (MAVEBA) came into being in 1999 from the particularly felt need of sharing know-how, objectives and results between areas that until then seemed quite distinct such as bioengineering, medicine and singing. MAVEBA deals with all aspects concerning the study of the human voice with applications ranging from the neonate to the adult and elderly. Over the years the initial issues have grown and spread also in other aspects of research such as occupational voice disorders, neurology, rehabilitation, image and video analysis. MAVEBA takes place every two years always in Firenze, Italy. This edition celebrates twenty years of uninterrupted and succesfully research in the field of voice analysis
Sound Object Recognition
Humans are constantly exposed to a variety of acoustic stimuli ranging from music and speech to more complex acoustic scenes like a noisy marketplace. The human auditory perception mechanism is able to analyze these different kinds of sounds and extract meaningful information suggesting that the same processing mechanism is capable of representing different sound classes. In this thesis, we test this hypothesis by proposing a high dimensional sound object representation framework, that captures the various modulations of sound by performing a multi-resolution mapping. We then show that this model is able to capture a wide variety of sound classes (speech, music, soundscapes) by applying it to the tasks of speech recognition, speaker verification, musical instrument recognition and acoustic soundscape recognition.
We propose a multi-resolution analysis approach that captures the detailed variations in the spectral characterists as a basis for recognizing sound objects. We then show how such a system can be fine tuned to capture both the message information (speech content) and the messenger information (speaker identity). This system is shown to outperform state-of-art system for noise robustness at both automatic speech recognition and speaker verification tasks.
The proposed analysis scheme with the included ability to analyze temporal modulations was used to capture musical sound objects. We showed that using a model of cortical processing, we were able to accurately replicate the human perceptual similarity judgments and also were able to get a good classification performance on a large set of musical instruments. We also show that neither just the spectral feature or the marginals of the proposed model are sufficient to capture human perception. Moreover, we were able to extend this model to continuous musical recordings by proposing a new method to extract notes from the recordings.
Complex acoustic scenes like a sports stadium have multiple sources producing sounds at the same time. We show that the proposed representation scheme can not only capture these complex acoustic scenes, but provides a flexible mechanism to adapt to target sources of interest. The human auditory perception system is known to be a complex system where there are both bottom-up analysis pathways and top-down feedback mechanisms. The top-down feedback enhances the output of the bottom-up system to better realize the target sounds. In this thesis we propose an implementation of top-down attention module which is complimentary to the high dimensional acoustic feature extraction mechanism. This attention module is a distributed system operating at multiple stages of representation, effectively acting as a retuning mechanism, that adapts the same system to different tasks. We showed that such an adaptation mechanism is able to tremendously improve the performance of the system at detecting the target source in the presence of various distracting background sources
- …