Search CORE

43,659 research outputs found

FaceFilter: Audio-visual speech separation using still images

Author: Choe Soyeon
Chung Joon Son
Chung Soo-Whan
Kang Hong-Goo
Publication venue: 'International Speech Communication Association'
Publication date: 14/05/2020
Field of study

The objective of this paper is to separate a target speaker's speech from a mixture of two speakers using a deep audio-visual speech separation network. Unlike previous works that used lip movement on video clips or pre-enrolled speaker information as an auxiliary conditional feature, we use a single face image of the target speaker. In this task, the conditional feature is obtained from facial appearance in cross-modal biometric task, where audio and visual identity representations are shared in latent space. Learnt identities from facial images enforce the network to isolate matched speakers and extract the voices from mixed speech. It solves the permutation problem caused by swapped channel outputs, frequently occurred in speech separation tasks. The proposed method is far more practical than video-based speech separation since user profile images are readily available on many platforms. Also, unlike speaker-aware separation methods, it is applicable on separation with unseen speakers who have never been enrolled before. We show strong qualitative and quantitative results on challenging real-world examples.Comment: Under submission as a conference paper. Video examples: https://youtu.be/ku9xoLh62

arXiv.org e-Print Archive

Crossref

Differences in hearing acuity among “normal-hearing” young adults modulate the neural basis for speech comprehension

Author: Grossman Murray
Kotloff Ethan
Lee Yune S
Min Nam-Eun
Peelle Jonathan E
Wingfield Arthur
Publication venue: Digital Commons@Becker
Publication date: 01/01/2018
Field of study

AbstractIn this paper, we investigate how subtle differences in hearing acuity affect the neural systems supporting speech processing in young adults. Auditory sentence comprehension requires perceiving a complex acoustic signal and performing linguistic operations to extract the correct meaning. We used functional MRI to monitor human brain activity while adults aged 18–41 years listened to spoken sentences. The sentences varied in their level of syntactic processing demands, containing either a subject-relative or object-relative center-embedded clause. All participants self-reported normal hearing, confirmed by audiometric testing, with some variation within a clinically normal range. We found that participants showed activity related to sentence processing in a left-lateralized frontotemporal network. Although accuracy was generally high, participants still made some errors, which were associated with increased activity in bilateral cingulo-opercular and frontoparietal attention networks. A whole-brain regression analysis revealed that activity in a right anterior middle frontal gyrus (aMFG) component of the frontoparietal attention network was related to individual differences in hearing acuity, such that listeners with poorer hearing showed greater recruitment of this region when successfully understanding a sentence. The activity in right aMFGs for listeners with poor hearing did not differ as a function of sentence type, suggesting a general mechanism that is independent of linguistic processing demands. Our results suggest that even modest variations in hearing ability impact the systems supporting auditory speech comprehension, and that auditory sentence comprehension entails the coordination of a left perisylvian network that is sensitive to linguistic variation with an executive attention network that responds to acoustic challenge.</jats:p

Crossref

Digital Commons@Becker

REPRESENTATION LEARNING FOR EMOTION RECOGNITION AND MENTAL HEALTH ANALYSIS

Author: Yang Kailai
Publication venue
Publication date: 01/08/2023
Field of study

The University of Manchester - Institutional Repository

Data-Driven Representation Learning in Multimodal Feature Fusion

Author
Publication venue
Publication date: 01/01/2018
Field of study

abstract: Modern machine learning systems leverage data and features from multiple modalities to gain more predictive power. In most scenarios, the modalities are vastly different and the acquired data are heterogeneous in nature. Consequently, building highly effective fusion algorithms is at the core to achieve improved model robustness and inferencing performance. This dissertation focuses on the representation learning approaches as the fusion strategy. Specifically, the objective is to learn the shared latent representation which jointly exploit the structural information encoded in all modalities, such that a straightforward learning model can be adopted to obtain the prediction. We first consider sensor fusion, a typical multimodal fusion problem critical to building a pervasive computing platform. A systematic fusion technique is described to support both multiple sensors and descriptors for activity recognition. Targeted to learn the optimal combination of kernels, Multiple Kernel Learning (MKL) algorithms have been successfully applied to numerous fusion problems in computer vision etc. Utilizing the MKL formulation, next we describe an auto-context algorithm for learning image context via the fusion with low-level descriptors. Furthermore, a principled fusion algorithm using deep learning to optimize kernel machines is developed. By bridging deep architectures with kernel optimization, this approach leverages the benefits of both paradigms and is applied to a wide variety of fusion problems. In many real-world applications, the modalities exhibit highly specific data structures, such as time sequences and graphs, and consequently, special design of the learning architecture is needed. In order to improve the temporal modeling for multivariate sequences, we developed two architectures centered around attention models. A novel clinical time series analysis model is proposed for several critical problems in healthcare. Another model coupled with triplet ranking loss as metric learning framework is described to better solve speaker diarization. Compared to state-of-the-art recurrent networks, these attention-based multivariate analysis tools achieve improved performance while having a lower computational complexity. Finally, in order to perform community detection on multilayer graphs, a fusion algorithm is described to derive node embedding from word embedding techniques and also exploit the complementary relational information contained in each layer of the graph.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

ASU Digital Repository

Grounding semantics in robots for Visual Question Answering

Author: Wahle Björn
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2019
Field of study

In this thesis I describe an operational implementation of an object detection and description system that incorporates in an end-to-end Visual Question Answering system and evaluated it on two visual question answering datasets for compositional language and elementary visual reasoning

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC