7,335 research outputs found

    Models and analysis of vocal emissions for biomedical applications: 5th International Workshop: December 13-15, 2007, Firenze, Italy

    Get PDF
    The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies. The Workshop has the sponsorship of: Ente Cassa Risparmio di Firenze, COST Action 2103, Biomedical Signal Processing and Control Journal (Elsevier Eds.), IEEE Biomedical Engineering Soc. Special Issues of International Journals have been, and will be, published, collecting selected papers from the conference

    Deep Learning for Audio Signal Processing

    Full text link
    Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

    Models and Analysis of Vocal Emissions for Biomedical Applications

    Get PDF
    The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies

    Know2Look: Commonsense Knowledge for Visual Search

    No full text
    With the rise in popularity of social media, images accompanied by contextual text form a huge section of the web. However, search and retrieval of documents are still largely dependent on solely textual cues. Although visual cues have started to gain focus, the imperfection in object/scene detection do not lead to significantly improved results. We hypothesize that the use of background commonsense knowledge on query terms can significantly aid in retrieval of documents with associated images. To this end we deploy three different modalities - text, visual cues, and commonsense knowledge pertaining to the query - as a recipe for efficient search and retrieval

    Audio-visual football video analysis, from structure detection to attention analysis

    Get PDF
    Sport video is an important video genre. Content-based sports video analysis attracts great interest from both industry and academic fields. A sports video is characterised by repetitive temporal structures, relatively plain contents, and strong spatio-temporal variations, such as quick camera switches and swift local motions. It is necessary to develop specific techniques for content-based sports video analysis to utilise these characteristics. For an efficient and effective sports video analysis system, there are three fundamental questions: (1) what are key stories for sports videos; (2) what incurs viewer’s interest; and (3) how to identify game highlights. This thesis is developed around these questions. We approached these questions from two different perspectives and in turn three research contributions are presented, namely, replay detection, attack temporal structure decomposition, and attention-based highlight identification. Replay segments convey the most important contents in sports videos. It is an efficient approach to collect game highlights by detecting replay segments. However, replay is an artefact of editing, which improves with advances in video editing tools. The composition of replay is complex, which includes logo transitions, slow motions, viewpoint switches and normal speed video clips. Since logo transition clips are pervasive in game collections of FIFA World Cup 2002, FIFA World Cup 2006 and UEFA Championship 2006, we take logo transition detection as an effective replacement of replay detection. A two-pass system was developed, including a five-layer adaboost classifier and a logo template matching throughout an entire video. The five-layer adaboost utilises shot duration, average game pitch ratio, average motion, sequential colour histogram and shot frequency between two neighbouring logo transitions, to filter out logo transition candidates. Subsequently, a logo template is constructed and employed to find all transition logo sequences. The precision and recall of this system in replay detection is 100% in a five-game evaluation collection. An attack structure is a team competition for a score. Hence, this structure is a conceptually fundamental unit of a football video as well as other sports videos. We review the literature of content-based temporal structures, such as play-break structure, and develop a three-step system for automatic attack structure decomposition. Four content-based shot classes, namely, play, focus, replay and break were identified by low level visual features. A four-state hidden Markov model was trained to simulate transition processes among these shot classes. Since attack structures are the longest repetitive temporal unit in a sports video, a suffix tree is proposed to find the longest repetitive substring in the label sequence of shot class transitions. These occurrences of this substring are regarded as a kernel of an attack hidden Markov process. Therefore, the decomposition of attack structure becomes a boundary likelihood comparison between two Markov chains. Highlights are what attract notice. Attention is a psychological measurement of “notice ”. A brief survey of attention psychological background, attention estimation from vision and auditory, and multiple modality attention fusion is presented. We propose two attention models for sports video analysis, namely, the role-based attention model and the multiresolution autoregressive framework. The role-based attention model is based on the perception structure during watching video. This model removes reflection bias among modality salient signals and combines these signals by reflectors. The multiresolution autoregressive framework (MAR) treats salient signals as a group of smooth random processes, which follow a similar trend but are filled with noise. This framework tries to estimate a noise-less signal from these coarse noisy observations by a multiple resolution analysis. Related algorithms are developed, such as event segmentation on a MAR tree and real time event detection. The experiment shows that these attention-based approach can find goal events at a high precision. Moreover, results of MAR-based highlight detection on the final game of FIFA 2002 and 2006 are highly similar to professionally labelled highlights by BBC and FIFA

    Incongruent Visual Cues Affect the Perception of Mandarin Vowel But Not Tone

    Get PDF
    Over the recent few decades, a large number of audiovisual speech studies have been focusing on the visual cues of consonants and vowels but neglecting those relating to lexical tones. In this study, we investigate whether incongruent audiovisual information interfered with the perception of lexical tones. We found that, for both Chinese and English speakers, incongruence between auditory and visemic mouth shape (i.e., visual form information) significantly interfered with reaction time and reduced the identification accuracy of vowels. However, incongruent lip movements (i.e., visual timing information) did not interfere with the perception of auditory lexical tone. We conclude that, in contrast to vowel perception, auditory tone perception seems relatively impervious to visual congruence cues, at least under these restricted laboratory conditions. The salience of visual form and timing information is discussed based on this finding

    Functional imaging studies of visual-auditory integration in man.

    Get PDF
    This thesis investigates the central nervous system's ability to integrate visual and auditory information from the sensory environment into unified conscious perception. It develops the possibility that the principle of functional specialisation may be applicable in the multisensory domain. The first aim was to establish the neuroanatomical location at which visual and auditory stimuli are integrated in sensory perception. The second was to investigate the neural correlates of visual-auditory synchronicity, which would be expected to play a vital role in establishing which visual and auditory stimuli should be perceptually integrated. Four functional Magnetic Resonance Imaging studies identified brain areas specialised for: the integration of dynamic visual and auditory cues derived from the same everyday environmental events (Experiment 1), discriminating relative synchronicity between dynamic, cyclic, abstract visual and auditory stimuli (Experiment 2 & 3) and the aesthetic evaluation of visually and acoustically perceived art (Experiment 4). Experiment 1 provided evidence to suggest that the posterior temporo-parietal junction may be an important site of crossmodal integration. Experiment 2 revealed for the first time significant activation of the right anterior frontal operculum (aFO) when visual and auditory stimuli cycled asynchronously. Experiment 3 confirmed and developed this observation as the right aFO was activated only during crossmodal (visual-auditory), but not intramodal (visual-visual, auditory-auditory) asynchrony. Experiment 3 also demonstrated activation of the amygdala bilaterally during crossmodal synchrony. Experiment 4 revealed the neural correlates of supramodal, contemplative, aesthetic evaluation within the medial fronto-polar cortex. Activity at this locus varied parametrically according to the degree of subjective aesthetic beauty, for both visual art and musical extracts. The most robust finding of this thesis is that activity in the right aFO increases when concurrently perceived visual and auditory sensory stimuli deviate from crossmodal synchrony, which may veto the crossmodal integration of unrelated stimuli into unified conscious perception

    Emotional Prosody Processing in the Schizophrenia Spectrum.

    Get PDF
    THESIS ABSTRACT Emotional prosody processing impairment is proposed to be a main contributing factor for the formation of auditory verbal hallucinations in patients with schizophrenia. In order to evaluate such assumption, five experiments in healthy, highly schizotypal and schizophrenia populations are presented. The first part of the thesis seeks to reveal the neural underpinnings of emotional prosody comprehension (EPC) in a non-clinical population as well as the modulation of prosodic abilities by hallucination traits. By revealing the brain representation of EPC, an overlap at the neural level between EPC and auditory verbal hallucinations (AVH) was strongly suggested. By assessing the influence of hallucinatory traits on EPC abilities, a continuum in the schizophrenia spectrum in which high schizotypal population mirrors the neurocognitive profile of schizophrenia patients was established. Moreover, by studying the relation between AVH and EPC in non-clinical population, potential confounding effects of medication influencing the findings were minimized. The second part of the thesis assessed two EPC related abilities in schizophrenia patients with and without hallucinations. Firstly, voice identity recognition, a skill which relies on the analysis of some of the same acoustical features as EPC, has been evaluated in patients and controls. Finally, the last study presented in the current thesis, assessed the influence that implicit processing of emotional prosody has on selective attention in patients and controls. Both patients studies demonstrate that voice identity recognition deficits as well as abnormal modulation of selective attention by implicit emotion prosody are related to hallucinations exclusively and not to schizophrenia in general. In the final discussion, a model in which EPC deficits are a crucial factor in the formation of AVH is evaluated. Experimental findings presented in the previous chapters strongly suggests that the perception of prosodic features is impaired in patients with AVH, resulting in aberrant perception of irrelevant auditory objects with emotional prosody salience which captures the attention of the hearer and which sources (speaker identity) cannot be recognized. Such impairments may be due to structural and functional abnormalities in a network which comprises the superior temporal gyrus as a central element

    Automatic transcription of polyphonic music exploiting temporal evolution

    Get PDF
    PhDAutomatic music transcription is the process of converting an audio recording into a symbolic representation using musical notation. It has numerous applications in music information retrieval, computational musicology, and the creation of interactive systems. Even for expert musicians, transcribing polyphonic pieces of music is not a trivial task, and while the problem of automatic pitch estimation for monophonic signals is considered to be solved, the creation of an automated system able to transcribe polyphonic music without setting restrictions on the degree of polyphony and the instrument type still remains open. In this thesis, research on automatic transcription is performed by explicitly incorporating information on the temporal evolution of sounds. First efforts address the problem by focusing on signal processing techniques and by proposing audio features utilising temporal characteristics. Techniques for note onset and offset detection are also utilised for improving transcription performance. Subsequent approaches propose transcription models based on shift-invariant probabilistic latent component analysis (SI-PLCA), modeling the temporal evolution of notes in a multiple-instrument case and supporting frequency modulations in produced notes. Datasets and annotations for transcription research have also been created during this work. Proposed systems have been privately as well as publicly evaluated within the Music Information Retrieval Evaluation eXchange (MIREX) framework. Proposed systems have been shown to outperform several state-of-the-art transcription approaches. Developed techniques have also been employed for other tasks related to music technology, such as for key modulation detection, temperament estimation, and automatic piano tutoring. Finally, proposed music transcription models have also been utilized in a wider context, namely for modeling acoustic scenes
    corecore