147 research outputs found
Cycle-Consistent Deep Generative Hashing for Cross-Modal Retrieval
In this paper, we propose a novel deep generative approach to cross-modal
retrieval to learn hash functions in the absence of paired training samples
through the cycle consistency loss. Our proposed approach employs adversarial
training scheme to lean a couple of hash functions enabling translation between
modalities while assuming the underlying semantic relationship. To induce the
hash codes with semantics to the input-output pair, cycle consistency loss is
further proposed upon the adversarial training to strengthen the correlations
between inputs and corresponding outputs. Our approach is generative to learn
hash functions such that the learned hash codes can maximally correlate each
input-output correspondence, meanwhile can also regenerate the inputs so as to
minimize the information loss. The learning to hash embedding is thus performed
to jointly optimize the parameters of the hash functions across modalities as
well as the associated generative models. Extensive experiments on a variety of
large-scale cross-modal data sets demonstrate that our proposed method achieves
better retrieval results than the state-of-the-arts.Comment: To appeared on IEEE Trans. Image Processing. arXiv admin note: text
overlap with arXiv:1703.10593 by other author
Towards uncertainty-aware and label-efficient machine learning of human expressive behaviour
The ability to recognise emotional expressions from non-verbal behaviour plays a key role in human-human interaction. Endowing machines with the same ability is critical to enriching human-computer interaction. Despite receiving widespread attention so far, human-level automatic recognition of affective expressions is still an elusive task for machines. Towards improving the current state of machine learning methods applied to affect recognition, this thesis identifies two challenges: label ambiguity and label scarcity.
Firstly, this thesis notes that it is difficult to establish a clear one-to-one mapping between inputs (face images or speech segments) and their target emotion labels, considering that emotion perception is inherently subjective. As a result, the problem of label ambiguity naturally arises in the manual annotations of affect. Ignoring this fundamental problem, most existing affect recognition methods implicitly assume a one-to-one input-target mapping and use deterministic function learning. In contrast, this thesis proposes to learn non-deterministic functions based on uncertainty-aware probabilistic models, as they can naturally accommodate the one-to-many input-target mapping. Besides improving the affect recognition performance, the proposed uncertainty-aware models in this thesis demonstrate three important applications: adaptive multimodal affect fusion, human-in-the-loop learning of affect, and improved performance on downstream behavioural analysis tasks like personality traits estimation.
Secondly, this thesis aims to address the challenge of scarcity of affect labelled datasets, caused by the cumbersome and time-consuming nature of the affect annotation process. To this end, this thesis notes that audio and visual feature encoders used in the existing models are label-inefficient i.e. learning them requires large amounts of labelled training data. As a solution, this thesis proposes to pre-train the feature encoders using unlabelled data to make them more label-efficient i.e. using as few labelled training examples as possible to achieve good emotion recognition performance. A novel self-supervised pre-training method is proposed in this thesis by posing hand-engineered emotion features as task-specific representation learning priors. By leveraging large amounts of unlabelled audiovisual data, the proposed self-supervised pre-training method demonstrates much better label efficiency compared to the commonly employed pre-training methods
Words have a weight: language as a source of inner grounding and flexibility in abstract concepts
The role played by language in our cognitive lives is a topic at the centre of contemporary debates in cognitive (neuro)science. In this paper we illustrate and compare two theories that offer embodied explanations of this role: the WAT (Words As social Tools) and the LENS (Language is an Embodied Neuroenhancement and Scaffold) theories. WAT and LENS differ from other current proposals because they connect the impact of the neurologically realized language system on our cognition to the ways in which language shapes our interaction with the physical and social environment. Examining these theories together, their tenets and supporting evidence, sharpens our understanding of each, but also contributes to a better understanding of the contribution that language might make to the acquisition, representation and use of abstract concepts. Here we focus on how language provides a source of inner grounding, especially metacognition and inner speech, and supports the flexibility of our thought. Overall, the paper outlines a promising research program focused on the importance of language to abstract concepts within the context of a flexible, multimodal, and multilevel conception of embodied cognition
Connectionist perspectives on language learning, representation and processing.
The field of formal linguistics was founded on the premise that language is mentally represented as a deterministic symbolic grammar. While this approach has captured many important characteristics of the world\u27s languages, it has also led to a tendency to focus theoretical questions on the correct formalization of grammatical rules while also de-emphasizing the role of learning and statistics in language development and processing. In this review we present a different approach to language research that has emerged from the parallel distributed processing or \u27connectionist\u27 enterprise. In the connectionist framework, mental operations are studied by simulating learning and processing within networks of artificial neurons. With that in mind, we discuss recent progress in connectionist models of auditory word recognition, reading, morphology, and syntactic processing. We argue that connectionist models can capture many important characteristics of how language is learned, represented, and processed, as well as providing new insights about the source of these behavioral patterns. Just as importantly, the networks naturally capture irregular (non-rule-like) patterns that are common within languages, something that has been difficult to reconcile with rule-based accounts of language without positing separate mechanisms for rules and exceptions
Sensitivity of human auditory cortex to rapid frequency modulation revealed by multivariate representational similarity analysis.
Functional Magnetic Resonance Imaging (fMRI) was used to investigate the extent, magnitude, and pattern of brain activity in response to rapid frequency-modulated sounds. We examined this by manipulating the direction (rise vs. fall) and the rate (fast vs. slow) of the apparent pitch of iterated rippled noise (IRN) bursts. Acoustic parameters were selected to capture features used in phoneme contrasts, however the stimuli themselves were not perceived as speech per se. Participants were scanned as they passively listened to sounds in an event-related paradigm. Univariate analyses revealed a greater level and extent of activation in bilateral auditory cortex in response to frequency-modulated sweeps compared to steady-state sounds. This effect was stronger in the left hemisphere. However, no regions showed selectivity for either rate or direction of frequency modulation. In contrast, multivoxel pattern analysis (MVPA) revealed feature-specific encoding for direction of modulation in auditory cortex bilaterally. Moreover, this effect was strongest when analyses were restricted to anatomical regions lying outside Heschl\u27s gyrus. We found no support for feature-specific encoding of frequency modulation rate. Differential findings of modulation rate and direction of modulation are discussed with respect to their relevance to phonetic discrimination
Channel Exchanging Networks for Multimodal and Multitask Dense Image Prediction
Multimodal fusion and multitask learning are two vital topics in machine
learning. Despite the fruitful progress, existing methods for both problems are
still brittle to the same challenge -- it remains dilemmatic to integrate the
common information across modalities (resp. tasks) meanwhile preserving the
specific patterns of each modality (resp. task). Besides, while they are
actually closely related to each other, multimodal fusion and multitask
learning are rarely explored within the same methodological framework before.
In this paper, we propose Channel-Exchanging-Network (CEN) which is
self-adaptive, parameter-free, and more importantly, applicable for both
multimodal fusion and multitask learning. At its core, CEN dynamically
exchanges channels between subnetworks of different modalities. Specifically,
the channel exchanging process is self-guided by individual channel importance
that is measured by the magnitude of Batch-Normalization (BN) scaling factor
during training. For the application of dense image prediction, the validity of
CEN is tested by four different scenarios: multimodal fusion, cycle multimodal
fusion, multitask learning, and multimodal multitask learning. Extensive
experiments on semantic segmentation via RGB-D data and image translation
through multi-domain input verify the effectiveness of our CEN compared to
current state-of-the-art methods. Detailed ablation studies have also been
carried out, which provably affirm the advantage of each component we propose.Comment: 18 pages. arXiv admin note: substantial text overlap with
arXiv:2011.0500
- …