141 research outputs found
AI-generated Content for Various Data Modalities: A Survey
AI-generated content (AIGC) methods aim to produce text, images, videos, 3D
assets, and other media using AI algorithms. Due to its wide range of
applications and the demonstrated potential of recent works, AIGC developments
have been attracting lots of attention recently, and AIGC methods have been
developed for various data modalities, such as image, video, text, 3D shape (as
voxels, point clouds, meshes, and neural implicit fields), 3D scene, 3D human
avatar (body and head), 3D motion, and audio -- each presenting different
characteristics and challenges. Furthermore, there have also been many
significant developments in cross-modality AIGC methods, where generative
methods can receive conditioning input in one modality and produce outputs in
another. Examples include going from various modalities to image, video, 3D
shape, 3D scene, 3D avatar (body and head), 3D motion (skeleton and avatar),
and audio modalities. In this paper, we provide a comprehensive review of AIGC
methods across different data modalities, including both single-modality and
cross-modality methods, highlighting the various challenges, representative
works, and recent technical directions in each setting. We also survey the
representative datasets throughout the modalities, and present comparative
results for various modalities. Moreover, we also discuss the challenges and
potential future research directions
Adversarial Training in Affective Computing and Sentiment Analysis: Recent Advances and Perspectives
Over the past few years, adversarial training has become an extremely active
research topic and has been successfully applied to various Artificial
Intelligence (AI) domains. As a potentially crucial technique for the
development of the next generation of emotional AI systems, we herein provide a
comprehensive overview of the application of adversarial training to affective
computing and sentiment analysis. Various representative adversarial training
algorithms are explained and discussed accordingly, aimed at tackling diverse
challenges associated with emotional AI systems. Further, we highlight a range
of potential future research directions. We expect that this overview will help
facilitate the development of adversarial training for affective computing and
sentiment analysis in both the academic and industrial communities
Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Multimodal machine learning is a vibrant multi-disciplinary research field
that aims to design computer agents with intelligent capabilities such as
understanding, reasoning, and learning through integrating multiple
communicative modalities, including linguistic, acoustic, visual, tactile, and
physiological messages. With the recent interest in video understanding,
embodied autonomous agents, text-to-image generation, and multisensor fusion in
application domains such as healthcare and robotics, multimodal machine
learning has brought unique computational and theoretical challenges to the
machine learning community given the heterogeneity of data sources and the
interconnections often found between modalities. However, the breadth of
progress in multimodal research has made it difficult to identify the common
themes and open questions in the field. By synthesizing a broad range of
application domains and theoretical frameworks from both historical and recent
perspectives, this paper is designed to provide an overview of the
computational and theoretical foundations of multimodal machine learning. We
start by defining two key principles of modality heterogeneity and
interconnections that have driven subsequent innovations, and propose a
taxonomy of 6 core technical challenges: representation, alignment, reasoning,
generation, transference, and quantification covering historical and recent
trends. Recent technical achievements will be presented through the lens of
this taxonomy, allowing researchers to understand the similarities and
differences across new approaches. We end by motivating several open problems
for future research as identified by our taxonomy
Recommended from our members
Visual Dynamics Models for Robotic Planning and Control
For a robot to interact with its environment, it must perceive the world and understand how the world evolves as a consequence of its actions. This thesis studies a few methods that a robot can use to respond to its observations, with a focus on instances that can leverage visual dynamic models. In general, these are models of how the visual observations of a robot evolves as a consequence of its actions. This could be in the form of predictive models that directly predict the future in the space of image pixels, in the space of visual features extracted from these images, or in the space of compact learned latent representations. The three instances that this thesis studies are in the context of visual servoing, visual planning, and representation learning for reinforcement learning. In the first case, we combine learned visual features with learning single-step predictive dynamics models and reinforcement learning to learn visual servoing mechanisms. In the second case, we use a deterministic multi-step video prediction model to achieve various manipulation tasks through visual planning. In addition, we show that conventional video prediction models are unequipped to model uncertainty and multiple futures, which could limit the planning capabilities of the robot. To address this, we propose a stochastic video prediction model that is trained with a combination of variational losses, adversarial losses, and perceptual losses, and show that this model can predict futures that are more realistic, diverse, and accurate. Unlike the first two cases, in which the dynamics model is used to make predictions for decision-making, the third case learns the model solely for representation learning. We learn a stochastic sequential latent variable model to learn a latent representation, and then use it as an intermediate representation for reinforcement learning. We show that this approach improves final performance and sample efficiency
Modeling Visual Rhetoric and Semantics in Multimedia
Recent advances in machine learning have enabled computer vision algorithms to model complicated visual phenomena with accuracies unthinkable a mere decade ago. Their high-performance on a plethora of vision-related tasks has enabled computer vision researchers to begin to move beyond traditional visual recognition problems to tasks requiring higher-level image understanding. However, most computer vision research still focuses on describing what images, text, or other media literally portrays. In contrast, in this dissertation we focus on learning how and why such content is portrayed. Rather than viewing media for its content, we recast the problem as understanding visual communication and visual rhetoric. For example, the same content may be portrayed in different ways in order to present the story the author wishes to convey. We thus seek to model not only the content of the media, but its authorial intent and latent messaging. Understanding how and why visual content is portrayed a certain way requires understanding higher level abstract semantic concepts which are themselves latent within visual media. By latent, we mean the concept is not readily visually accessible within a single image (e.g. right vs left political bias), in contrast to explicit visual semantic concepts such as objects.
Specifically, we study the problems of modeling photographic style (how professional photographers portray their subjects), understanding visual persuasion in image advertisements, modeling political bias in multimedia (image and text) news articles, and learning cross-modal semantic representations. While most past research in vision and natural language processing studies the case where visual content and paired text are highly aligned (as in the case of image captions), we target the case where each modality conveys complementary information to tell a larger story. We particularly focus on the problem of learning cross-modal representations from multimedia exhibiting weak alignment between the image and text modalities. A variety of techniques are presented which improve modeling of multimedia rhetoric in real-world data and enable more robust artificially intelligent systems
On discovering and learning structure under limited supervision
Les formes, les surfaces, les événements et les objets (vivants et non vivants) constituent le monde. L'intelligence des agents naturels, tels que les humains, va au-delà de la simple reconnaissance de formes. Nous excellons à construire des représentations et à distiller des connaissances pour comprendre et déduire la structure du monde. Spécifiquement, le développement de telles capacités de raisonnement peut se produire même avec une supervision limitée.
D'autre part, malgré son développement phénoménal, les succès majeurs de l'apprentissage automatique, en particulier des modèles d'apprentissage profond, se situent principalement dans les tâches qui ont accès à de grands ensembles de données annotées. Dans cette thèse, nous proposons de nouvelles solutions pour aider à combler cette lacune en permettant aux modèles d'apprentissage automatique d'apprendre la structure et de permettre un raisonnement efficace en présence de tâches faiblement supervisés.
Le thème récurrent de la thèse tente de s'articuler autour de la question « Comment un système perceptif peut-il apprendre à organiser des informations sensorielles en connaissances utiles sous une supervision limitée ? » Et il aborde les thèmes de la géométrie, de la composition et des associations dans quatre articles distincts avec des applications à la vision par ordinateur (CV) et à l'apprentissage par renforcement (RL).
Notre première contribution ---Pix2Shape---présente une approche basée sur l'analyse par synthèse pour la perception. Pix2Shape exploite des modèles génératifs probabilistes pour apprendre des représentations 3D à partir d'images 2D uniques. Le formalisme qui en résulte nous offre une nouvelle façon de distiller l'information d'une scène ainsi qu'une représentation puissantes des images. Nous y parvenons en augmentant l'apprentissage profond non supervisé avec des biais inductifs basés sur la physique pour décomposer la structure causale des images en géométrie, orientation, pose, réflectance et éclairage.
Notre deuxième contribution ---MILe--- aborde les problèmes d'ambiguïté dans les ensembles de données à label unique tels que ImageNet. Il est souvent inapproprié de décrire une image avec un seul label lorsqu'il est composé de plus d'un objet proéminent. Nous montrons que l'intégration d'idées issues de la littérature linguistique cognitive et l'imposition de biais inductifs appropriés aident à distiller de multiples descriptions possibles à l'aide d'ensembles de données aussi faiblement étiquetés.
Ensuite, nous passons au paradigme d'apprentissage par renforcement, et considérons un agent interagissant avec son environnement sans signal de récompense. Notre troisième contribution ---HaC--- est une approche non supervisée basée sur la curiosité pour apprendre les associations entre les modalités visuelles et tactiles. Cela aide l'agent à explorer l'environnement de manière autonome et à utiliser davantage ses connaissances pour s'adapter aux tâches en aval. La supervision dense des récompenses n'est pas toujours disponible (ou n'est pas facile à concevoir), dans de tels cas, une exploration efficace est utile pour générer un comportement significatif de manière auto-supervisée.
Pour notre contribution finale, nous abordons l'information limitée contenue dans les représentations obtenues par des agents RL non supervisés. Ceci peut avoir un effet néfaste sur la performance des agents lorsque leur perception est basée sur des images de haute dimension. Notre approche a base de modèles combine l'exploration et la planification sans récompense pour affiner efficacement les modèles pré-formés non supervisés, obtenant des résultats comparables à un agent entraîné spécifiquement sur ces tâches. Il s'agit d'une étape vers la création d'agents capables de généraliser rapidement à plusieurs tâches en utilisant uniquement des images comme perception.Shapes, surfaces, events, and objects (living and non-living) constitute the world. The intelligence of natural agents, such as humans is beyond pattern recognition. We excel at building representations and distilling knowledge to understand and infer the structure of the world. Critically, the development of such reasoning capabilities can occur even with limited supervision.
On the other hand, despite its phenomenal development, the major successes of machine learning, in particular, deep learning models are primarily in tasks that have access to large annotated datasets. In this dissertation, we propose novel solutions to help address this gap by enabling machine learning models to learn the structure and enable effective reasoning in the presence of weakly supervised settings.
The recurring theme of the thesis tries to revolve around the question of "How can a perceptual system learn to organize sensory information into useful knowledge under limited supervision?" And it discusses the themes of geometry, compositions, and associations in four separate articles with applications to computer vision (CV) and reinforcement learning (RL).
Our first contribution ---Pix2Shape---presents an analysis-by-synthesis based approach(also referred to as inverse graphics) for perception. Pix2Shape leverages probabilistic generative models to learn 3D-aware representations from single 2D images. The resulting formalism allows us to perform a novel view synthesis of a scene and produce powerful representations of images. We achieve this by augmenting unsupervised learning with physically based inductive biases to decompose a scene structure into geometry, pose, reflectance and lighting.
Our Second contribution ---MILe--- addresses the ambiguity issues in single-labeled datasets such as ImageNet. It is often inappropriate to describe an image with a single label when it is composed of more than one prominent object. We show that integrating ideas from Cognitive linguistic literature and imposing appropriate inductive biases helps in distilling multiple possible descriptions using such weakly labeled datasets.
Next, moving into the RL setting, we consider an agent interacting with its environment without a reward signal. Our third Contribution ---HaC--- is a curiosity based unsupervised approach to learning associations between visual and tactile modalities. This aids the agent to explore the environment in an analogous self-guided fashion and further use this knowledge to adapt to downstream tasks.
In the absence of reward supervision, intrinsic movitivation is useful to generate meaningful behavior in a self-supervised manner.
In our final contribution, we address the representation learning bottleneck in unsupervised RL agents that has detrimental effect on the performance on high-dimensional pixel based inputs. Our model-based approach combines reward-free exploration and planning to efficiently fine-tune unsupervised pre-trained models, achieving comparable results to task-specific baselines. This is a step towards building agents that can generalize quickly on more than a single task using image inputs alone
Generative Models for Novelty Detection Applications in abnormal event and situational changedetection from data series
Novelty detection is a process for distinguishing the observations that differ in some respect
from the observations that the model is trained on. Novelty detection is one of the fundamental
requirements of a good classification or identification system since sometimes the
test data contains observations that were not known at the training time. In other words, the
novelty class is often is not presented during the training phase or not well defined.
In light of the above, one-class classifiers and generative methods can efficiently model
such problems. However, due to the unavailability of data from the novelty class, training
an end-to-end model is a challenging task itself. Therefore, detecting the Novel classes in
unsupervised and semi-supervised settings is a crucial step in such tasks.
In this thesis, we propose several methods to model the novelty detection problem in
unsupervised and semi-supervised fashion. The proposed frameworks applied to different
related applications of anomaly and outlier detection tasks. The results show the superior of
our proposed methods in compare to the baselines and state-of-the-art methods
- …