331 research outputs found
Part-level Action Parsing via a Pose-guided Coarse-to-Fine Framework
Action recognition from videos, i.e., classifying a video into one of the
pre-defined action types, has been a popular topic in the communities of
artificial intelligence, multimedia, and signal processing. However, existing
methods usually consider an input video as a whole and learn models, e.g.,
Convolutional Neural Networks (CNNs), with coarse video-level class labels.
These methods can only output an action class for the video, but cannot provide
fine-grained and explainable cues to answer why the video shows a specific
action. Therefore, researchers start to focus on a new task, Part-level Action
Parsing (PAP), which aims to not only predict the video-level action but also
recognize the frame-level fine-grained actions or interactions of body parts
for each person in the video. To this end, we propose a coarse-to-fine
framework for this challenging task. In particular, our framework first
predicts the video-level class of the input video, then localizes the body
parts and predicts the part-level action. Moreover, to balance the accuracy and
computation in part-level action parsing, we propose to recognize the
part-level actions by segment-level features. Furthermore, to overcome the
ambiguity of body parts, we propose a pose-guided positional embedding method
to accurately localize body parts. Through comprehensive experiments on a
large-scale dataset, i.e., Kinetics-TPS, our framework achieves
state-of-the-art performance and outperforms existing methods over a 31.10% ROC
score.Comment: Accepted by IEEE ISCAS 2022, 5 pages, 2 figures. arXiv admin note:
text overlap with arXiv:2110.0336
RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization
Text-to-image customization, which aims to synthesize text-driven images for
the given subjects, has recently revolutionized content creation. Existing
works follow the pseudo-word paradigm, i.e., represent the given subjects as
pseudo-words and then compose them with the given text. However, the inherent
entangled influence scope of pseudo-words with the given text results in a
dual-optimum paradox, i.e., the similarity of the given subjects and the
controllability of the given text could not be optimal simultaneously. We
present RealCustom that, for the first time, disentangles similarity from
controllability by precisely limiting subject influence to relevant parts only,
achieved by gradually narrowing real text word from its general connotation to
the specific subject and using its cross-attention to distinguish relevance.
Specifically, RealCustom introduces a novel "train-inference" decoupled
framework: (1) during training, RealCustom learns general alignment between
visual conditions to original textual conditions by a novel adaptive scoring
module to adaptively modulate influence quantity; (2) during inference, a novel
adaptive mask guidance strategy is proposed to iteratively update the influence
scope and influence quantity of the given subjects to gradually narrow the
generation of the real text word. Comprehensive experiments demonstrate the
superior real-time customization ability of RealCustom in the open domain,
achieving both unprecedented similarity of the given subjects and
controllability of the given text for the first time. The project page is
https://corleone-huang.github.io/realcustom/.Comment: Accepted by CVPR202
MCDAN: a Multi-scale Context-enhanced Dynamic Attention Network for Diffusion Prediction
Information diffusion prediction aims at predicting the target users in the
information diffusion path on social networks. Prior works mainly focus on the
observed structure or sequence of cascades, trying to predict to whom this
cascade will be infected passively. In this study, we argue that user intent
understanding is also a key part of information diffusion prediction. We
thereby propose a novel Multi-scale Context-enhanced Dynamic Attention Network
(MCDAN) to predict which user will most likely join the observed current
cascades. Specifically, to consider the global interactive relationship among
users, we take full advantage of user friendships and global cascading
relationships, which are extracted from the social network and historical
cascades, respectively. To refine the model's ability to understand the user's
preference for the current cascade, we propose a multi-scale sequential
hypergraph attention module to capture the dynamic preference of users at
different time scales. Moreover, we design a contextual attention enhancement
module to strengthen the interaction of user representations within the current
cascade. Finally, to engage the user's own susceptibility, we construct a
susceptibility label for each user based on user susceptibility analysis and
use the rank of this label for auxiliary prediction. We conduct experiments
over four widely used datasets and show that MCDAN significantly overperforms
the state-of-the-art models. The average improvements are up to 10.61% in terms
of Hits@100 and 9.71% in terms of MAP@100, respectively
Context-Aware Visual Policy Network for Fine-Grained Image Captioning
With the maturity of visual detection techniques, we are more ambitious in
describing visual content with open-vocabulary, fine-grained and free-form
language, i.e., the task of image captioning. In particular, we are interested
in generating longer, richer and more fine-grained sentences and paragraphs as
image descriptions. Image captioning can be translated to the task of
sequential language prediction given visual content, where the output sequence
forms natural language description with plausible grammar. However, existing
image captioning methods focus only on language policy while not visual policy,
and thus fail to capture visual context that are crucial for compositional
reasoning such as object relationships (e.g., "man riding horse") and visual
comparisons (e.g., "small(er) cat"). This issue is especially severe when
generating longer sequences such as a paragraph. To fill the gap, we propose a
Context-Aware Visual Policy network (CAVP) for fine-grained image-to-language
generation: image sentence captioning and image paragraph captioning. During
captioning, CAVP explicitly considers the previous visual attentions as
context, and decides whether the context is used for the current word/sentence
generation given the current visual attention. Compared against traditional
visual attention mechanism that only fixes a single visual region at each step,
CAVP can attend to complex visual compositions over time. The whole image
captioning model -- CAVP and its subsequent language policy network -- can be
efficiently optimized end-to-end by using an actor-critic policy gradient
method. We have demonstrated the effectiveness of CAVP by state-of-the-art
performances on MS-COCO and Stanford captioning datasets, using various metrics
and sensible visualizations of qualitative visual context.Comment: Accepted to IEEE Transactions on Pattern Analysis and Machine
Intelligence (T-PAMI). Extended version of "Context-Aware Visual Policy
Network for Sequence-Level Image Captioning", ACM MM 2018 (arXiv:1808.05864
Promoting Generalization for Exact Solvers via Adversarial Instance Augmentation
Machine learning has been successfully applied to improve the efficiency of
Mixed-Integer Linear Programming (MILP) solvers. However, the learning-based
solvers often suffer from severe performance degradation on unseen MILP
instances -- especially on large-scale instances from a perturbed environment
-- due to the limited diversity of training distributions. To tackle this
problem, we propose a novel approach, which is called Adversarial Instance
Augmentation and does not require to know the problem type for new instance
generation, to promote data diversity for learning-based branching modules in
the branch-and-bound (B&B) Solvers (AdaSolver). We use the bipartite graph
representations for MILP instances and obtain various perturbed instances to
regularize the solver by augmenting the graph structures with a learned
augmentation policy. The major technical contribution of AdaSolver is that we
formulate the non-differentiable instance augmentation as a contextual bandit
problem and adversarially train the learning-based solver and augmentation
policy, enabling efficient gradient-based training of the augmentation policy.
To the best of our knowledge, AdaSolver is the first general and effective
framework for understanding and improving the generalization of both
imitation-learning-based (IL-based) and reinforcement-learning-based (RL-based)
B&B solvers. Extensive experiments demonstrate that by producing various
augmented instances, AdaSolver leads to a remarkable efficiency improvement
across various distributions
- …