131 research outputs found
CLIP-Hand3D: Exploiting 3D Hand Pose Estimation via Context-Aware Prompting
Contrastive Language-Image Pre-training (CLIP) starts to emerge in many
computer vision tasks and has achieved promising performance. However, it
remains underexplored whether CLIP can be generalized to 3D hand pose
estimation, as bridging text prompts with pose-aware features presents
significant challenges due to the discrete nature of joint positions in 3D
space. In this paper, we make one of the first attempts to propose a novel 3D
hand pose estimator from monocular images, dubbed as CLIP-Hand3D, which
successfully bridges the gap between text prompts and irregular detailed pose
distribution. In particular, the distribution order of hand joints in various
3D space directions is derived from pose labels, forming corresponding text
prompts that are subsequently encoded into text representations.
Simultaneously, 21 hand joints in the 3D space are retrieved, and their spatial
distribution (in x, y, and z axes) is encoded to form pose-aware features.
Subsequently, we maximize semantic consistency for a pair of pose-text features
following a CLIP-based contrastive learning paradigm. Furthermore, a
coarse-to-fine mesh regressor is designed, which is capable of effectively
querying joint-aware cues from the feature pyramid. Extensive experiments on
several public hand benchmarks show that the proposed model attains a
significantly faster inference speed while achieving state-of-the-art
performance compared to methods utilizing the similar scale backbone.Comment: Accepted In Proceedings of the 31st ACM International Conference on
Multimedia (MM' 23
Inverting Adversarially Robust Networks for Image Synthesis
Recent research in adversarially robust classifiers suggests their
representations tend to be aligned with human perception, which makes them
attractive for image synthesis and restoration applications. Despite favorable
empirical results on a few downstream tasks, their advantages are limited to
slow and sensitive optimization-based techniques. Moreover, their use on
generative models remains unexplored. This work proposes the use of robust
representations as a perceptual primitive for feature inversion models, and
show its benefits with respect to standard non-robust image features. We
empirically show that adopting robust representations as an image prior
significantly improves the reconstruction accuracy of CNN-based feature
inversion models. Furthermore, it allows reconstructing images at multiple
scales out-of-the-box. Following these findings, we propose an
encoding-decoding network based on robust representations and show its
advantages for applications such as anomaly detection, style transfer and image
denoising
Closing the gap: Exact maximum likelihood training of generative autoencoders using invertible layers
In this work, we provide an exact likelihood alternative to the variational
training of generative autoencoders. We show that VAE-style autoencoders can be
constructed using invertible layers, which offer a tractable exact likelihood
without the need for any regularization terms. This is achieved while leaving
complete freedom in the choice of encoder, decoder and prior architectures,
making our approach a drop-in replacement for the training of existing VAEs and
VAE-style models. We refer to the resulting models as Autoencoders within Flows
(AEF), since the encoder, decoder and prior are defined as individual layers of
an overall invertible architecture. We show that the approach results in
strikingly higher performance than architecturally equivalent VAEs in term of
log-likelihood, sample quality and denoising performance. In a broad sense, the
main ambition of this work is to close the gap between the normalizing flow and
autoencoder literature under the common framework of invertibility and exact
maximum likelihood
A Review of Graph Neural Networks and Their Applications in Power Systems
Deep neural networks have revolutionized many machine learning tasks in power
systems, ranging from pattern recognition to signal processing. The data in
these tasks is typically represented in Euclidean domains. Nevertheless, there
is an increasing number of applications in power systems, where data are
collected from non-Euclidean domains and represented as graph-structured data
with high dimensional features and interdependency among nodes. The complexity
of graph-structured data has brought significant challenges to the existing
deep neural networks defined in Euclidean domains. Recently, many publications
generalizing deep neural networks for graph-structured data in power systems
have emerged. In this paper, a comprehensive overview of graph neural networks
(GNNs) in power systems is proposed. Specifically, several classical paradigms
of GNNs structures (e.g., graph convolutional networks) are summarized, and key
applications in power systems, such as fault scenario application, time series
prediction, power flow calculation, and data generation are reviewed in detail.
Furthermore, main issues and some research trends about the applications of
GNNs in power systems are discussed
Incorporating Human Expertise in Robot Motion Learning and Synthesis
With the exponential growth of robotics and the fast development of their advanced cognitive and motor capabilities, one can start to envision humans and robots jointly working together in unstructured environments. Yet, for that to be possible, robots need to be programmed for such types of complex scenarios, which demands significant domain knowledge in robotics and control. One viable approach to enable robots to acquire skills in a more flexible and efficient way is by giving them the capabilities of autonomously learn from human demonstrations and expertise through interaction. Such framework helps to make the creation of skills in robots more social and less demanding on programing and robotics expertise. Yet, current imitation learning approaches suffer from significant limitations, mainly about the flexibility and efficiency for representing, learning and reasoning about motor tasks. This thesis addresses this problem by exploring cost-function-based approaches to learning robot motion control, perception and the interplay between them. To begin with, the thesis proposes an efficient probabilistic algorithm to learn an impedance controller to accommodate motion contacts. The learning algorithm is able to incorporate important domain constraints, e.g., about force representation and decomposition, which are nontrivial to handle by standard techniques. Compliant handwriting motions are developed on an articulated robot arm and a multi-fingered hand. This work provides a flexible approach to learn robot motion conforming to both task and domain constraints. Furthermore, the thesis also contributes with techniques to learn from and reason about demonstrations with partial observability. The proposed approach combines inverse optimal control and ensemble methods, yielding a tractable learning of cost functions with latent variables. Two task priors are further incorporated. The first human kinematics prior results in a model which synthesizes rich and believable dynamical handwriting. The latter prior enforces dynamics on the latent variable and facilitates a real-time human intention cognition and an on-line motion adaptation in collaborative robot tasks. Finally, the thesis establishes a link between control and perception modalities. This work offers an analysis that bridges inverse optimal control and deep generative model, as well as a novel algorithm that learns cost features and embeds the modal coupling prior. This work contributes an end-to-end system for synthesizing arm joint motion from letter image pixels. The results highlight its robustness against noisy and out-of-sample sensory inputs. Overall, the proposed approach endows robots the potential to reason about diverse unstructured data, which is nowadays pervasive but hard to process for current imitation learning
Deep Neural Networks for Visual Reasoning, Program Induction, and Text-to-Image Synthesis.
Deep neural networks excel at pattern recognition, especially in the setting of large scale supervised learning. A combination of better hardware, more data, and algorithmic improvements have yielded breakthroughs in image classification, speech recognition and other perception problems. The research frontier has shifted towards the weak side of neural networks: reasoning, planning, and (like all machine learning algorithms) creativity. How can we advance along this frontier using the same generic techniques so effective in pattern recognition; i.e. gradient descent with backpropagation? In this thesis I develop neural architectures with new capabilities in visual reasoning, program induction and text-to-image synthesis. I propose two models that disentangle the latent visual factors of variation that give rise to images, and enable analogical reasoning in the latent space. I show how to augment a recurrent network with a memory of programs that enables the learning of compositional structure for more data-efficient and generalizable program induction. Finally, I develop a generative neural network that translates descriptions of birds, flowers and other categories into compelling natural images.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/135763/1/reedscot_1.pd
Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes
Humans and animals have a rich and flexible understanding of the physical
world, which enables them to infer the underlying dynamical trajectories of
objects and events, plausible future states, and use that to plan and
anticipate the consequences of actions. However, the neural mechanisms
underlying these computations are unclear. We combine a goal-driven modeling
approach with dense neurophysiological data and high-throughput human
behavioral readouts to directly impinge on this question. Specifically, we
construct and evaluate several classes of sensory-cognitive networks to predict
the future state of rich, ethologically-relevant environments, ranging from
self-supervised end-to-end models with pixel-wise or object-centric objectives,
to models that future predict in the latent space of purely static image-based
or dynamic video-based pretrained foundation models. We find strong
differentiation across these model classes in their ability to predict neural
and behavioral data both within and across diverse environments. In particular,
we find that neural responses are currently best predicted by models trained to
predict the future state of their environment in the latent space of pretrained
foundation models optimized for dynamic scenes in a self-supervised manner.
Notably, models that future predict in the latent space of video foundation
models that are optimized to support a diverse range of sensorimotor tasks,
reasonably match both human behavioral error patterns and neural dynamics
across all environmental scenarios that we were able to test. Overall, these
findings suggest that the neural mechanisms and behaviors of primate mental
simulation are thus far most consistent with being optimized to future predict
on dynamic, reusable visual representations that are useful for embodied AI
more generally.Comment: 17 pages, 6 figure
Pathway to Future Symbiotic Creativity
This report presents a comprehensive view of our vision on the development
path of the human-machine symbiotic art creation. We propose a classification
of the creative system with a hierarchy of 5 classes, showing the pathway of
creativity evolving from a mimic-human artist (Turing Artists) to a Machine
artist in its own right. We begin with an overview of the limitations of the
Turing Artists then focus on the top two-level systems, Machine Artists,
emphasizing machine-human communication in art creation. In art creation, it is
necessary for machines to understand humans' mental states, including desires,
appreciation, and emotions, humans also need to understand machines' creative
capabilities and limitations. The rapid development of immersive environment
and further evolution into the new concept of metaverse enable symbiotic art
creation through unprecedented flexibility of bi-directional communication
between artists and art manifestation environments. By examining the latest
sensor and XR technologies, we illustrate the novel way for art data collection
to constitute the base of a new form of human-machine bidirectional
communication and understanding in art creation. Based on such communication
and understanding mechanisms, we propose a novel framework for building future
Machine artists, which comes with the philosophy that a human-compatible AI
system should be based on the "human-in-the-loop" principle rather than the
traditional "end-to-end" dogma. By proposing a new form of inverse
reinforcement learning model, we outline the platform design of machine
artists, demonstrate its functions and showcase some examples of technologies
we have developed. We also provide a systematic exposition of the ecosystem for
AI-based symbiotic art form and community with an economic model built on NFT
technology. Ethical issues for the development of machine artists are also
discussed
- …