417 research outputs found

    Redundancy-Adaptive Multimodal Learning for Imperfect Data

    Full text link
    Multimodal models trained on complete modality data often exhibit a substantial decrease in performance when faced with imperfect data containing corruptions or missing modalities. To address this robustness challenge, prior methods have explored various approaches from aspects of augmentation, consistency or uncertainty, but these approaches come with associated drawbacks related to data complexity, representation, and learning, potentially diminishing their overall effectiveness. In response to these challenges, this study introduces a novel approach known as the Redundancy-Adaptive Multimodal Learning (RAML). RAML efficiently harnesses information redundancy across multiple modalities to combat the issues posed by imperfect data while remaining compatible with the complete modality. Specifically, RAML achieves redundancy-lossless information extraction through separate unimodal discriminative tasks and enforces a proper norm constraint on each unimodal feature representation. Furthermore, RAML explicitly enhances multimodal fusion by leveraging fine-grained redundancy among unimodal features to learn correspondences between corrupted and untainted information. Extensive experiments on various benchmark datasets under diverse conditions have consistently demonstrated that RAML outperforms state-of-the-art methods by a significant margin

    Towards uncertainty-aware and label-efficient machine learning of human expressive behaviour

    Get PDF
    The ability to recognise emotional expressions from non-verbal behaviour plays a key role in human-human interaction. Endowing machines with the same ability is critical to enriching human-computer interaction. Despite receiving widespread attention so far, human-level automatic recognition of affective expressions is still an elusive task for machines. Towards improving the current state of machine learning methods applied to affect recognition, this thesis identifies two challenges: label ambiguity and label scarcity. Firstly, this thesis notes that it is difficult to establish a clear one-to-one mapping between inputs (face images or speech segments) and their target emotion labels, considering that emotion perception is inherently subjective. As a result, the problem of label ambiguity naturally arises in the manual annotations of affect. Ignoring this fundamental problem, most existing affect recognition methods implicitly assume a one-to-one input-target mapping and use deterministic function learning. In contrast, this thesis proposes to learn non-deterministic functions based on uncertainty-aware probabilistic models, as they can naturally accommodate the one-to-many input-target mapping. Besides improving the affect recognition performance, the proposed uncertainty-aware models in this thesis demonstrate three important applications: adaptive multimodal affect fusion, human-in-the-loop learning of affect, and improved performance on downstream behavioural analysis tasks like personality traits estimation. Secondly, this thesis aims to address the challenge of scarcity of affect labelled datasets, caused by the cumbersome and time-consuming nature of the affect annotation process. To this end, this thesis notes that audio and visual feature encoders used in the existing models are label-inefficient i.e. learning them requires large amounts of labelled training data. As a solution, this thesis proposes to pre-train the feature encoders using unlabelled data to make them more label-efficient i.e. using as few labelled training examples as possible to achieve good emotion recognition performance. A novel self-supervised pre-training method is proposed in this thesis by posing hand-engineered emotion features as task-specific representation learning priors. By leveraging large amounts of unlabelled audiovisual data, the proposed self-supervised pre-training method demonstrates much better label efficiency compared to the commonly employed pre-training methods

    The active inference approach to ecological perception: general information dynamics for natural and artificial embodied cognition

    Get PDF
    The emerging neurocomputational vision of humans as embodied, ecologically embedded, social agents—who shape and are shaped by their environment—offers a golden opportunity to revisit and revise ideas about the physical and information-theoretic underpinnings of life, mind, and consciousness itself. In particular, the active inference framework (AIF) makes it possible to bridge connections from computational neuroscience and robotics/AI to ecological psychology and phenomenology, revealing common underpinnings and overcoming key limitations. AIF opposes the mechanistic to the reductive, while staying fully grounded in a naturalistic and information-theoretic foundation, using the principle of free energy minimization. The latter provides a theoretical basis for a unified treatment of particles, organisms, and interactive machines, spanning from the inorganic to organic, non-life to life, and natural to artificial agents. We provide a brief introduction to AIF, then explore its implications for evolutionary theory, ecological psychology, embodied phenomenology, and robotics/AI research. We conclude the paper by considering implications for machine consciousness

    Towards uncertainty-aware and label-efficient machine learning of human expressive behaviour

    Get PDF
    The ability to recognise emotional expressions from non-verbal behaviour plays a key role in human-human interaction. Endowing machines with the same ability is critical to enriching human-computer interaction. Despite receiving widespread attention so far, human-level automatic recognition of affective expressions is still an elusive task for machines. Towards improving the current state of machine learning methods applied to affect recognition, this thesis identifies two challenges: label ambiguity and label scarcity. Firstly, this thesis notes that it is difficult to establish a clear one-to-one mapping between inputs (face images or speech segments) and their target emotion labels, considering that emotion perception is inherently subjective. As a result, the problem of label ambiguity naturally arises in the manual annotations of affect. Ignoring this fundamental problem, most existing affect recognition methods implicitly assume a one-to-one input-target mapping and use deterministic function learning. In contrast, this thesis proposes to learn non-deterministic functions based on uncertainty-aware probabilistic models, as they can naturally accommodate the one-to-many input-target mapping. Besides improving the affect recognition performance, the proposed uncertainty-aware models in this thesis demonstrate three important applications: adaptive multimodal affect fusion, human-in-the-loop learning of affect, and improved performance on downstream behavioural analysis tasks like personality traits estimation. Secondly, this thesis aims to address the challenge of scarcity of affect labelled datasets, caused by the cumbersome and time-consuming nature of the affect annotation process. To this end, this thesis notes that audio and visual feature encoders used in the existing models are label-inefficient i.e. learning them requires large amounts of labelled training data. As a solution, this thesis proposes to pre-train the feature encoders using unlabelled data to make them more label-efficient i.e. using as few labelled training examples as possible to achieve good emotion recognition performance. A novel self-supervised pre-training method is proposed in this thesis by posing hand-engineered emotion features as task-specific representation learning priors. By leveraging large amounts of unlabelled audiovisual data, the proposed self-supervised pre-training method demonstrates much better label efficiency compared to the commonly employed pre-training methods

    Infinite Hidden Conditional Random Fields for the Recognition of Human Behaviour

    No full text
    While detecting and interpreting temporal patterns of nonverbal behavioral cues in a given context is a natural and often unconscious process for humans, it remains a rather difficult task for computer systems. In this thesis we are primarily motivated by the problem of recognizing expressions of high--level behavior, and specifically agreement and disagreement. We thoroughly dissect the problem by surveying the nonverbal behavioral cues that could be present during displays of agreement and disagreement; we discuss a number of methods that could be used or adapted to detect these suggested cues; we list some publicly available databases these tools could be trained on for the analysis of spontaneous, audiovisual instances of agreement and disagreement, we examine the few existing attempts at agreement and disagreement classification, and we discuss the challenges in automatically detecting agreement and disagreement. We present experiments that show that an existing discriminative graphical model, the Hidden Conditional Random Field (HCRF) is the best performing on this task. The HCRF is a discriminative latent variable model which has been previously shown to successfully learn the hidden structure of a given classification problem (provided an appropriate validation of the number of hidden states). We show here that HCRFs are also able to capture what makes each of these social attitudes unique. We present an efficient technique to analyze the concepts learned by the HCRF model and show that these coincide with the findings from social psychology regarding which cues are most prevalent in agreement and disagreement. Our experiments are performed on a spontaneous expressions dataset curated from real televised debates. The HCRF model outperforms conventional approaches such as Hidden Markov Models and Support Vector Machines. Subsequently, we examine existing graphical models that use Bayesian nonparametrics to have a countably infinite number of hidden states and adapt their complexity to the data at hand. We identify a gap in the literature that is the lack of a discriminative such graphical model and we present our suggestion for the first such model: an HCRF with an infinite number of hidden states, the Infinite Hidden Conditional Random Field (IHCRF). In summary, the IHCRF is an undirected discriminative graphical model for sequence classification and uses a countably infinite number of hidden states. We present two variants of this model. The first is a fully nonparametric model that relies on Hierarchical Dirichlet Processes and a Markov Chain Monte Carlo inference approach. The second is a semi--parametric model that uses Dirichlet Process Mixtures and relies on a mean--field variational inference approach. We show that both models are able to converge to a correct number of represented hidden states, and perform as well as the best finite HCRFs ---chosen via cross--validation--- for the difficult tasks of recognizing instances of agreement, disagreement, and pain in audiovisual sequences.Open Acces
    • …
    corecore