43,659 research outputs found

    FaceFilter: Audio-visual speech separation using still images

    Full text link
    The objective of this paper is to separate a target speaker's speech from a mixture of two speakers using a deep audio-visual speech separation network. Unlike previous works that used lip movement on video clips or pre-enrolled speaker information as an auxiliary conditional feature, we use a single face image of the target speaker. In this task, the conditional feature is obtained from facial appearance in cross-modal biometric task, where audio and visual identity representations are shared in latent space. Learnt identities from facial images enforce the network to isolate matched speakers and extract the voices from mixed speech. It solves the permutation problem caused by swapped channel outputs, frequently occurred in speech separation tasks. The proposed method is far more practical than video-based speech separation since user profile images are readily available on many platforms. Also, unlike speaker-aware separation methods, it is applicable on separation with unseen speakers who have never been enrolled before. We show strong qualitative and quantitative results on challenging real-world examples.Comment: Under submission as a conference paper. Video examples: https://youtu.be/ku9xoLh62

    Differences in hearing acuity among “normal-hearing” young adults modulate the neural basis for speech comprehension

    Get PDF
    AbstractIn this paper, we investigate how subtle differences in hearing acuity affect the neural systems supporting speech processing in young adults. Auditory sentence comprehension requires perceiving a complex acoustic signal and performing linguistic operations to extract the correct meaning. We used functional MRI to monitor human brain activity while adults aged 18–41 years listened to spoken sentences. The sentences varied in their level of syntactic processing demands, containing either a subject-relative or object-relative center-embedded clause. All participants self-reported normal hearing, confirmed by audiometric testing, with some variation within a clinically normal range. We found that participants showed activity related to sentence processing in a left-lateralized frontotemporal network. Although accuracy was generally high, participants still made some errors, which were associated with increased activity in bilateral cingulo-opercular and frontoparietal attention networks. A whole-brain regression analysis revealed that activity in a right anterior middle frontal gyrus (aMFG) component of the frontoparietal attention network was related to individual differences in hearing acuity, such that listeners with poorer hearing showed greater recruitment of this region when successfully understanding a sentence. The activity in right aMFGs for listeners with poor hearing did not differ as a function of sentence type, suggesting a general mechanism that is independent of linguistic processing demands. Our results suggest that even modest variations in hearing ability impact the systems supporting auditory speech comprehension, and that auditory sentence comprehension entails the coordination of a left perisylvian network that is sensitive to linguistic variation with an executive attention network that responds to acoustic challenge.</jats:p

    Data-Driven Representation Learning in Multimodal Feature Fusion

    Get PDF
    abstract: Modern machine learning systems leverage data and features from multiple modalities to gain more predictive power. In most scenarios, the modalities are vastly different and the acquired data are heterogeneous in nature. Consequently, building highly effective fusion algorithms is at the core to achieve improved model robustness and inferencing performance. This dissertation focuses on the representation learning approaches as the fusion strategy. Specifically, the objective is to learn the shared latent representation which jointly exploit the structural information encoded in all modalities, such that a straightforward learning model can be adopted to obtain the prediction. We first consider sensor fusion, a typical multimodal fusion problem critical to building a pervasive computing platform. A systematic fusion technique is described to support both multiple sensors and descriptors for activity recognition. Targeted to learn the optimal combination of kernels, Multiple Kernel Learning (MKL) algorithms have been successfully applied to numerous fusion problems in computer vision etc. Utilizing the MKL formulation, next we describe an auto-context algorithm for learning image context via the fusion with low-level descriptors. Furthermore, a principled fusion algorithm using deep learning to optimize kernel machines is developed. By bridging deep architectures with kernel optimization, this approach leverages the benefits of both paradigms and is applied to a wide variety of fusion problems. In many real-world applications, the modalities exhibit highly specific data structures, such as time sequences and graphs, and consequently, special design of the learning architecture is needed. In order to improve the temporal modeling for multivariate sequences, we developed two architectures centered around attention models. A novel clinical time series analysis model is proposed for several critical problems in healthcare. Another model coupled with triplet ranking loss as metric learning framework is described to better solve speaker diarization. Compared to state-of-the-art recurrent networks, these attention-based multivariate analysis tools achieve improved performance while having a lower computational complexity. Finally, in order to perform community detection on multilayer graphs, a fusion algorithm is described to derive node embedding from word embedding techniques and also exploit the complementary relational information contained in each layer of the graph.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

    Grounding semantics in robots for Visual Question Answering

    Get PDF
    In this thesis I describe an operational implementation of an object detection and description system that incorporates in an end-to-end Visual Question Answering system and evaluated it on two visual question answering datasets for compositional language and elementary visual reasoning
    • …
    corecore