1,459 research outputs found

    Towards An Intelligent Fuzzy Based Multimodal Two Stage Speech Enhancement System

    Get PDF
    This thesis presents a novel two stage multimodal speech enhancement system, making use of both visual and audio information to filter speech, and explores the extension of this system with the use of fuzzy logic to demonstrate proof of concept for an envisaged autonomous, adaptive, and context aware multimodal system. The design of the proposed cognitively inspired framework is scalable, meaning that it is possible for the techniques used in individual parts of the system to be upgraded and there is scope for the initial framework presented here to be expanded. In the proposed system, the concept of single modality two stage filtering is extended to include the visual modality. Noisy speech information received by a microphone array is first pre-processed by visually derived Wiener filtering employing the novel use of the Gaussian Mixture Regression (GMR) technique, making use of associated visual speech information, extracted using a state of the art Semi Adaptive Appearance Models (SAAM) based lip tracking approach. This pre-processed speech is then enhanced further by audio only beamforming using a state of the art Transfer Function Generalised Sidelobe Canceller (TFGSC) approach. This results in a system which is designed to function in challenging noisy speech environments (using speech sentences with different speakers from the GRID corpus and a range of noise recordings), and both objective and subjective test results (employing the widely used Perceptual Evaluation of Speech Quality (PESQ) measure, a composite objective measure, and subjective listening tests), showing that this initial system is capable of delivering very encouraging results with regard to filtering speech mixtures in difficult reverberant speech environments. Some limitations of this initial framework are identified, and the extension of this multimodal system is explored, with the development of a fuzzy logic based framework and a proof of concept demonstration implemented. Results show that this proposed autonomous,adaptive, and context aware multimodal framework is capable of delivering very positive results in difficult noisy speech environments, with cognitively inspired use of audio and visual information, depending on environmental conditions. Finally some concluding remarks are made along with proposals for future work

    Speaker Normalization Using Cortical Strip Maps: A Neural Model for Steady State vowel Categorization

    Full text link
    Auditory signals of speech are speaker-dependent, but representations of language meaning are speaker-independent. The transformation from speaker-dependent to speaker-independent language representations enables speech to be learned and understood from different speakers. A neural model is presented that performs speaker normalization to generate a pitch-independent representation of speech sounds, while also preserving information about speaker identity. This speaker-invariant representation is categorized into unitized speech items, which input to sequential working memories whose distributed patterns can be categorized, or chunked, into syllable and word representations. The proposed model fits into an emerging model of auditory streaming and speech categorization. The auditory streaming and speaker normalization parts of the model both use multiple strip representations and asymmetric competitive circuits, thereby suggesting that these two circuits arose from similar neural designs. The normalized speech items are rapidly categorized and stably remembered by Adaptive Resonance Theory circuits. Simulations use synthesized steady-state vowels from the Peterson and Barney [J. Acoust. Soc. Am. 24, 175-184 (1952)] vowel database and achieve accuracy rates similar to those achieved by human listeners. These results are compared to behavioral data and other speaker normalization models.National Science Foundation (SBE-0354378); Office of Naval Research (N00014-01-1-0624

    CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines

    Get PDF
    Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective. The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines. From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research

    The aceToolbox: low-level audiovisual feature extraction for retrieval and classification

    Get PDF
    In this paper we present an overview of a software platform that has been developed within the aceMedia project, termed the aceToolbox, that provides global and local lowlevel feature extraction from audio-visual content. The toolbox is based on the MPEG-7 eXperimental Model (XM), with extensions to provide descriptor extraction from arbitrarily shaped image segments, thereby supporting local descriptors reflecting real image content. We describe the architecture of the toolbox as well as providing an overview of the descriptors supported to date. We also briefly describe the segmentation algorithm provided. We then demonstrate the usefulness of the toolbox in the context of two different content processing scenarios: similarity-based retrieval in large collections and scene-level classification of still images

    The role of auditory information in audiovisual speech integration

    Get PDF
    Communication between two people involves collecting and integrating information from different senses. An example in speech perception is when a listener relies on auditory inputs to hear spoken words and on visual input to read lips, making it easier to communicate in a noisy environment. Listeners are able to make use of visual cues to fill in missing auditory information when the auditory signal has been compromised in some way (e.g., hearing loss or noisy environment). Interestingly, listeners integrate auditory and visual information during the perception of speech, even when one of those senses proves to be more than sufficient. Grant and Seitz (1998) found a great deal of variability in the performance of listeners on perception tasks of auditory-visual speech. These discoveries have posed a number of questions about why and how multi-sensory integration occurs. Research in “optimal integration” suggests the possibility that listener, talker, or acoustic characteristics may influence auditory-visual integration. The present study focused on characteristics of the auditory signal that might promote auditory-visual integration, specifically looking at whether removal of information from the signal would produce greater use of the visual input and thus greater integration. CVC syllables from 5 talkers were degraded by selectively removing spectral fine-structure but maintaining temporal envelope characteristics of the waveform. The resulting stimuli were output through 2-.4-, 6-, and 8-channel bandpass filters. Results for 10 normal-hearing listeners showed auditory-visual integration for all conditions, but the amount of integration did not vary across different auditory signal manipulations. In addition, substantial across-talker differences were observed in auditory intelligibility in the 2-channel condition. Interestingly, the degree of audiovisual integration produced by different talkers was unrelated to auditory intelligibility. Implications of these results for our understanding of the processes underlying auditory-visual integration are discussed. Advisor: Janet M. WeisenbergerArts and Sciences Collegiate Undergraduate ScholarshipSocial and Behavioral Sciences Undergraduate Research Scholarshi

    Visual and Auditory Characteristics of Talkers in Multimodal Integration

    Get PDF
    3rd place at 2009 Denman Undergraduate Research ForumIn perceiving speech, there are three different elements of the interaction that can affect how the signal is interpreted: the talker, the signal (both the visual and auditory) and the listener. Each of these elements inherently contains substantial variability, which will, in turn, affect the audio-visual speech percept. Since the work of McGurk in the 1960s, which showed that speech perception is a multimodal process that incorporates both auditory and visual cues, there have been numerous investigations on the impact of these elements on multimodal integration of speech. The impact of talker characteristics on audio-visual integration has received the least amount of attention to date. A recent study by Andrews (2007) provided an initial look at talker characteristics. In her study, audiovisual integration produced by 14 talkers was examined, and substantial differences across talkers were found in both auditory and audiovisual intelligibility. However, talker characteristics that promoted audiovisual integration were not specifically identified. The present study began to address this question by analyzing audiovisual integration performance using two types of reduced-information speech syllables produced by five talkers. In one reduction, fine-structure information was replaced with band-limited noise but the temporal envelope was retained, and in the other, the syllables were reduced to a set of three sine waves that followed the formant structure of the syllable (sine-wave speech). Syllables were presented under audio-visual conditions to 10 listeners. Results indicated substantial across-talker differences, with the pattern of talker differences not affected by the type of reduction of the auditory signal. Analysis of confusion matrices provided directions for further analysis of specific auditory and visual speech tokens.College of the Arts and Sciences Undergraduate ScholarshipSocial and Behavioral Sciences Undergraduate Research ScholarshipNo embarg
    • 

    corecore