92,643 research outputs found
Preserving Modality Structure Improves Multi-Modal Learning
Self-supervised learning on large-scale multi-modal datasets allows learning
semantically meaningful embeddings in a joint multi-modal representation space
without relying on human annotations. These joint embeddings enable zero-shot
cross-modal tasks like retrieval and classification. However, these methods
often struggle to generalize well on out-of-domain data as they ignore the
semantic structure present in modality-specific embeddings. In this context, we
propose a novel Semantic-Structure-Preserving Consistency approach to improve
generalizability by preserving the modality-specific relationships in the joint
embedding space. To capture modality-specific semantic relationships between
samples, we propose to learn multiple anchors and represent the multifaceted
relationship between samples with respect to their relationship with these
anchors. To assign multiple anchors to each sample, we propose a novel
Multi-Assignment Sinkhorn-Knopp algorithm. Our experimentation demonstrates
that our proposed approach learns semantically meaningful anchors in a
self-supervised manner. Furthermore, our evaluation on MSR-VTT and YouCook2
datasets demonstrates that our proposed multi-anchor assignment based solution
achieves state-of-the-art performance and generalizes to both inand
out-of-domain datasets. Code: https://github.com/Swetha5/Multi_Sinkhorn_KnoppComment: Accepted at ICCV 202
EvIcon: Designing High-Usability Icon with Human-in-the-loop Exploration and IconCLIP
Interface icons are prevalent in various digital applications. Due to limited
time and budgets, many designers rely on informal evaluation, which often
results in poor usability icons. In this paper, we propose a unique
human-in-the-loop framework that allows our target users, i.e., novice and
professional UI designers, to improve the usability of interface icons
efficiently. We formulate several usability criteria into a perceptual
usability function and enable users to iteratively revise an icon set with an
interactive design tool, EvIcon. We take a large-scale pre-trained joint
image-text embedding (CLIP) and fine-tune it to embed icon visuals with icon
tags in the same embedding space (IconCLIP). During the revision process, our
design tool provides two types of instant perceptual usability feedback. First,
we provide perceptual usability feedback modeled by deep learning models
trained on IconCLIP embeddings and crowdsourced perceptual ratings. Second, we
use the embedding space of IconCLIP to assist users in improving icons' visual
distinguishability among icons within the user-prepared icon set. To provide
the perceptual prediction, we compiled IconCEPT10K, the first large-scale
dataset of perceptual usability ratings over interface icons, by
conducting a crowdsourcing study. We demonstrated that our framework could
benefit UI designers' interface icon revision process with a wide range of
professional experience. Moreover, the interface icons designed using our
framework achieved better semantic distance and familiarity, verified by an
additional online user study
Modeling relation paths for knowledge base completion via joint adversarial training
Knowledge Base Completion (KBC), which aims at determining the missing
relations between entity pairs, has received increasing attention in recent
years. Most existing KBC methods focus on either embedding the Knowledge Base
(KB) into a specific semantic space or leveraging the joint probability of
Random Walks (RWs) on multi-hop paths. Only a few unified models take both
semantic and path-related features into consideration with adequacy. In this
paper, we propose a novel method to explore the intrinsic relationship between
the single relation (i.e. 1-hop path) and multi-hop paths between paired
entities. We use Hierarchical Attention Networks (HANs) to select important
relations in multi-hop paths and encode them into low-dimensional vectors. By
treating relations and multi-hop paths as two different input sources, we use a
feature extractor, which is shared by two downstream components (i.e. relation
classifier and source discriminator), to capture shared/similar information
between them. By joint adversarial training, we encourage our model to extract
features from the multi-hop paths which are representative for relation
completion. We apply the trained model (except for the source discriminator) to
several large-scale KBs for relation completion. Experimental results show that
our method outperforms existing path information-based approaches. Since each
sub-module of our model can be well interpreted, our model can be applied to a
large number of relation learning tasks.Comment: Accepted by Knowledge-Based System
GANzzle: Reframing jigsaw puzzle solving as a retrieval task using a generative mental image
Puzzle solving is a combinatorial challenge due to the difficulty of matching
adjacent pieces. Instead, we infer a mental image from all pieces, which a
given piece can then be matched against avoiding the combinatorial explosion.
Exploiting advancements in Generative Adversarial methods, we learn how to
reconstruct the image given a set of unordered pieces, allowing the model to
learn a joint embedding space to match an encoding of each piece to the cropped
layer of the generator. Therefore we frame the problem as a R@1 retrieval task,
and then solve the linear assignment using differentiable Hungarian attention,
making the process end-to-end. In doing so our model is puzzle size agnostic,
in contrast to prior deep learning methods which are single size. We evaluate
on two new large-scale datasets, where our model is on par with deep learning
methods, while generalizing to multiple puzzle sizes.Comment: Accepted at International Conference of Image Processing (ICIP22
On Aggregation of Unsupervised Deep Binary Descriptor with Weak Bits
Despite the thrilling success achieved by existing binary descriptors, most of them are still in the mire of three limitations: 1) vulnerable to the geometric transformations; 2) incapable of preserving the manifold structure when learning binary codes; 3) NO guarantee to find the true match if multiple candidates happen to have the same Hamming distance to a given query. All these together make the binary descriptor less effective, given large-scale visual recognition tasks. In this paper, we propose a novel learning-based feature descriptor, namely Unsupervised Deep Binary Descriptor (UDBD), which learns transformation invariant binary descriptors via projecting the original data and their transformed sets into a joint binary space. Moreover, we involve a â„“2,1-norm loss term in the binary embedding process to gain simultaneously the robustness against data noises and less probability of mistakenly flipping bits of the binary descriptor, on top of it, a graph constraint is used to preserve the original manifold structure in the binary space. Furthermore, a weak bit mechanism is adopted to find the real match from candidates sharing the same minimum Hamming distance, thus enhancing matching performance. Extensive experimental results on public datasets show the superiority of UDBD in terms of matching and retrieval accuracy over state-of-the-arts
- …