202 research outputs found

    Exploiting correlogram structure for robust speech recognition with multiple speech sources

    Get PDF
    This paper addresses the problem of separating and recognising speech in a monaural acoustic mixture with the presence of competing speech sources. The proposed system treats sound source separation and speech recognition as tightly coupled processes. In the first stage sound source separation is performed in the correlogram domain. For periodic sounds, the correlogram exhibits symmetric tree-like structures whose stems are located on the delay that corresponds to multiple pitch periods. These pitch-related structures are exploited in the study to group spectral components at each time frame. Local pitch estimates are then computed for each spectral group and are used to form simultaneous pitch tracks for temporal integration. These processes segregate a spectral representation of the acoustic mixture into several time-frequency regions such that the energy in each region is likely to have originated from a single periodic sound source. The identified time-frequency regions, together with the spectral representation, are employed by a `speech fragment decoder' which employs `missing data' techniques with clean speech models to simultaneously search for the acoustic evidence that best matches model sequences. The paper presents evaluations based on artificially mixed simultaneous speech utterances. A coherence-measuring experiment is first reported which quantifies the consistency of the identified fragments with a single source. The system is then evaluated in a speech recognition task and compared to a conventional fragment generation approach. Results show that the proposed system produces more coherent fragments over different conditions, which results in significantly better recognition accuracy

    Audio-coupled video content understanding of unconstrained video sequences

    Get PDF
    Unconstrained video understanding is a difficult task. The main aim of this thesis is to recognise the nature of objects, activities and environment in a given video clip using both audio and video information. Traditionally, audio and video information has not been applied together for solving such complex task, and for the first time we propose, develop, implement and test a new framework of multi-modal (audio and video) data analysis for context understanding and labelling of unconstrained videos. The framework relies on feature selection techniques and introduces a novel algorithm (PCFS) that is faster than the well-established SFFS algorithm. We use the framework for studying the benefits of combining audio and video information in a number of different problems. We begin by developing two independent content recognition modules. The first one is based on image sequence analysis alone, and uses a range of colour, shape, texture and statistical features from image regions with a trained classifier to recognise the identity of objects, activities and environment present. The second module uses audio information only, and recognises activities and environment. Both of these approaches are preceded by detailed pre-processing to ensure that correct video segments containing both audio and video content are present, and that the developed system can be made robust to changes in camera movement, illumination, random object behaviour etc. For both audio and video analysis, we use a hierarchical approach of multi-stage classification such that difficult classification tasks can be decomposed into simpler and smaller tasks. When combining both modalities, we compare fusion techniques at different levels of integration and propose a novel algorithm that combines advantages of both feature and decision-level fusion. The analysis is evaluated on a large amount of test data comprising unconstrained videos collected for this work. We finally, propose a decision correction algorithm which shows that further steps towards combining multi-modal classification information effectively with semantic knowledge generates the best possible results

    Seeing sound: a new way to illustrate auditory objects and their neural correlates

    Full text link
    This thesis develops a new method for time-frequency signal processing and examines the relevance of the new representation in studies of neural coding in songbirds. The method groups together associated regions of the time-frequency plane into objects defined by time-frequency contours. By combining information about structurally stable contour shapes over multiple time-scales and angles, a signal decomposition is produced that distributes resolution adaptively. As a result, distinct signal components are represented in their own most parsimonious forms.  Next, through neural recordings in singing birds, it was found that activity in song premotor cortex is significantly correlated with the objects defined by this new representation of sound. In this process, an automated way of finding sub-syllable acoustic transitions in birdsongs was first developed, and then increased spiking probability was found at the boundaries of these acoustic transitions. Finally, a new approach to study auditory cortical sequence processing more generally is proposed. In this approach, songbirds were trained to discriminate Morse-code-like sequences of clicks, and the neural correlates of this behavior were examined in primary and secondary auditory cortex. It was found that a distinct transformation of auditory responses to the sequences of clicks exists as information transferred from primary to secondary auditory areas. Neurons in secondary auditory areas respond asynchronously and selectively -- in a manner that depends on the temporal context of the click. This transformation from a temporal to a spatial representation of sound provides a possible basis for the songbird's natural ability to discriminate complex temporal sequences

    Audio-coupled video content understanding of unconstrained video sequences

    Get PDF
    Unconstrained video understanding is a difficult task. The main aim of this thesis is to recognise the nature of objects, activities and environment in a given video clip using both audio and video information. Traditionally, audio and video information has not been applied together for solving such complex task, and for the first time we propose, develop, implement and test a new framework of multi-modal (audio and video) data analysis for context understanding and labelling of unconstrained videos. The framework relies on feature selection techniques and introduces a novel algorithm (PCFS) that is faster than the well-established SFFS algorithm. We use the framework for studying the benefits of combining audio and video information in a number of different problems. We begin by developing two independent content recognition modules. The first one is based on image sequence analysis alone, and uses a range of colour, shape, texture and statistical features from image regions with a trained classifier to recognise the identity of objects, activities and environment present. The second module uses audio information only, and recognises activities and environment. Both of these approaches are preceded by detailed pre-processing to ensure that correct video segments containing both audio and video content are present, and that the developed system can be made robust to changes in camera movement, illumination, random object behaviour etc. For both audio and video analysis, we use a hierarchical approach of multi-stage classification such that difficult classification tasks can be decomposed into simpler and smaller tasks. When combining both modalities, we compare fusion techniques at different levels of integration and propose a novel algorithm that combines advantages of both feature and decision-level fusion. The analysis is evaluated on a large amount of test data comprising unconstrained videos collected for this work. We finally, propose a decision correction algorithm which shows that further steps towards combining multi-modal classification information effectively with semantic knowledge generates the best possible results.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    IMPROVING THE QUALITY, ANALYSIS AND INTERPRETATION OF BODY SOUNDS ACQUIRED IN CHALLENGING CLINICAL SETTINGS

    Get PDF
    Despite advances in medicine and technology, Acute Lower Respiratory Diseases are a leading cause of sickness and mortality worldwide, highly affecting countries where access to appropriate medical technology and expertise is scarce. Chest auscultation provides a low-cost, non-invasive, widely available tool for the examination of pulmonary health. Despite universal adoption, its use is riddled by a number of issues including subjectivity in interpretation and vulnerability to ambient noise, limiting its diagnostic capability. Digital auscultation and computerized methods come as a natural aid towards overcoming such imposed limitations. Focused on the challenges, we address the demanding real-life scenario of pediatric lung auscultation in busy clinical settings. Two major objectives lead to our contributions: 1) Can we improve the quality of the delicate auscultated sounds and reduce unwanted noise contamination; 2) Can we augment the screening capabilities of current stethoscopes using computerized lung sound analysis to capture the presence of abnormal breaths, and can we standardize findings. To address the first objective, we developed an adaptive noise suppression scheme that tackles contamination coming from a variety of sources, including subject-centric and electronic artifacts, and environmental noise. The proposed method was validated using objective and subjective measures including an expert reviewer panel and objective signal quality metrics. Results revealed the ability and superiority of the proposed method to i) suppress unwanted noise when compared to state-of-the-art technology, and ii) faithfully maintain the signature of the delicate body sounds. The second objective was addressed by exploring appropriate feature representations that capture distinct characteristics of body sounds. A biomimetic approach was employed, and the acoustic signal was projected onto high-dimensional spaces spanning time, frequency, temporal dynamics and spectral modulations. Trained classifiers produced localized decisions on these breath content features, indicating lung diseases. Unlike existing literature, our proposed scheme is further able to combine and integrate the localized decisions into individual, patient-level evaluation. A large corpus of annotated patient data was used to validate our approach, demonstrating the superiority of the proposed features and patient evaluation scheme. Overall findings indicate that improved accessible auscultation care is possible, towards creating affordable health care solutions with worldwide impact

    Recent Advances in Signal Processing

    Get PDF
    The signal processing task is a very critical issue in the majority of new technological inventions and challenges in a variety of applications in both science and engineering fields. Classical signal processing techniques have largely worked with mathematical models that are linear, local, stationary, and Gaussian. They have always favored closed-form tractability over real-world accuracy. These constraints were imposed by the lack of powerful computing tools. During the last few decades, signal processing theories, developments, and applications have matured rapidly and now include tools from many areas of mathematics, computer science, physics, and engineering. This book is targeted primarily toward both students and researchers who want to be exposed to a wide variety of signal processing techniques and algorithms. It includes 27 chapters that can be categorized into five different areas depending on the application at hand. These five categories are ordered to address image processing, speech processing, communication systems, time-series analysis, and educational packages respectively. The book has the advantage of providing a collection of applications that are completely independent and self-contained; thus, the interested reader can choose any chapter and skip to another without losing continuity

    Guided Matching Pursuit and its Application to Sound Source Separation

    Get PDF
    In the last couple of decades there has been an increasing interest in the application of source separation technologies to musical signal processing. Given a signal that consists of a mixture of musical sources, source separation aims at extracting and/or isolating the signals that correspond to the original sources. A system capable of high quality source separation could be an invaluable tool for the sound engineer as well as the end user. Applications of source separation include, but are not limited to, remixing, up-mixing, spatial re-configuration, individual source modification such as filtering, pitch detection/correction and time stretching, music transcription, voice recognition and source-specific audio coding to name a few. Of particular interest is the problem of separating sources from a mixture comprising two channels (2.0 format) since this is still the most commonly used format in the music industry and most domestic listening environments. When the number of sources is greater than the number of mixtures (which is usually the case with stereophonic recordings) then the problem of source separation becomes under-determined and traditional source separation techniques, such as “Independent Component Analysis” (ICA) cannot be successfully applied. In such cases a family of techniques known as “Sparse Component Analysis” (SCA) are better suited. In short a mixture signal is decomposed into a new domain were the individual sources are sparsely represented which implies that their corresponding coefficients will have disjoint (or almost) disjoint supports. Taking advantage of this property along with the spatial information within the mixture and other prior information that could be available, it is possible to identify the sources in the new domain and separate them by going back to the time domain. It is a fact that sparse representations lead to higher quality separation. Regardless, the most commonly used front-end for a SCA system is the ubiquitous short-time Fourier transform (STFT) which although is a sparsifying transform it is not the best choice for this job. A better alternative is the matching pursuit (MP) decomposition. MP is an iterative algorithm that decomposes a signal into a set of elementary waveforms called atoms chosen from an over-complete dictionary in such a way so that they represent the inherent signal structures. A crucial part of MP is the creation of the dictionary which directly affects the results of the decomposition and subsequently the quality of source separation. Selecting an appropriate dictionary could prove a difficult task and an adaptive approach would be appropriate. This work proposes a new MP variant termed guided matching pursuit (GMP) which adds a new pre-processing step into the main sequence of the MP algorithm. The purpose of this step is to perform an analysis of the signal and extract important features, termed guide maps, that are used to create dynamic mini-dictionaries comprising atoms which are expected to correlate well with the underlying signal structures thus leading to focused and more efficient searches around particular supports of the signal. This algorithm is accompanied by a modular and highly flexible MATLAB implementation which is suited to the processing of long duration audio signals. Finally the new algorithm is applied to the source separation of two-channel linear instantaneous mixtures and preliminary testing demonstrates that the performance of GMP is on par with the performance of state of the art systems
    corecore