143 research outputs found
AMC: Attention guided Multi-modal Correlation Learning for Image Search
Given a user's query, traditional image search systems rank images according
to its relevance to a single modality (e.g., image content or surrounding
text). Nowadays, an increasing number of images on the Internet are available
with associated meta data in rich modalities (e.g., titles, keywords, tags,
etc.), which can be exploited for better similarity measure with queries. In
this paper, we leverage visual and textual modalities for image search by
learning their correlation with input query. According to the intent of query,
attention mechanism can be introduced to adaptively balance the importance of
different modalities. We propose a novel Attention guided Multi-modal
Correlation (AMC) learning method which consists of a jointly learned hierarchy
of intra and inter-attention networks. Conditioned on query's intent,
intra-attention networks (i.e., visual intra-attention network and language
intra-attention network) attend on informative parts within each modality; a
multi-modal inter-attention network promotes the importance of the most
query-relevant modalities. In experiments, we evaluate AMC models on the search
logs from two real world image search engines and show a significant boost on
the ranking of user-clicked images in search results. Additionally, we extend
AMC models to caption ranking task on COCO dataset and achieve competitive
results compared with recent state-of-the-arts.Comment: CVPR 201
Visual to Sound: Generating Natural Sound for Videos in the Wild
As two of the five traditional human senses (sight, hearing, taste, smell,
and touch), vision and sound are basic sources through which humans understand
the world. Often correlated during natural events, these two modalities combine
to jointly affect human perception. In this paper, we pose the task of
generating sound given visual input. Such capabilities could help enable
applications in virtual reality (generating sound for virtual scenes
automatically) or provide additional accessibility to images or videos for
people with visual impairments. As a first step in this direction, we apply
learning-based methods to generate raw waveform samples given input video
frames. We evaluate our models on a dataset of videos containing a variety of
sounds (such as ambient sounds and sounds from people/animals). Our experiments
show that the generated sounds are fairly realistic and have good temporal
synchronization with the visual inputs.Comment: Project page:
http://bvision11.cs.unc.edu/bigpen/yipin/visual2sound_webpage/visual2sound.htm
Speeding up Context-based Sentence Representation Learning with Non-autoregressive Convolutional Decoding
Context plays an important role in human language understanding, thus it may
also be useful for machines learning vector representations of language. In
this paper, we explore an asymmetric encoder-decoder structure for unsupervised
context-based sentence representation learning. We carefully designed
experiments to show that neither an autoregressive decoder nor an RNN decoder
is required. After that, we designed a model which still keeps an RNN as the
encoder, while using a non-autoregressive convolutional decoder. We further
combine a suite of effective designs to significantly improve model efficiency
while also achieving better performance. Our model is trained on two different
large unlabelled corpora, and in both cases the transferability is evaluated on
a set of downstream NLP tasks. We empirically show that our model is simple and
fast while producing rich sentence representations that excel in downstream
tasks
Rethinking Skip-thought: A Neighborhood based Approach
We study the skip-thought model with neighborhood information as weak
supervision. More specifically, we propose a skip-thought neighbor model to
consider the adjacent sentences as a neighborhood. We train our skip-thought
neighbor model on a large corpus with continuous sentences, and then evaluate
the trained model on 7 tasks, which include semantic relatedness, paraphrase
detection, and classification benchmarks. Both quantitative comparison and
qualitative investigation are conducted. We empirically show that, our
skip-thought neighbor model performs as well as the skip-thought model on
evaluation tasks. In addition, we found that, incorporating an autoencoder path
in our model didn't aid our model to perform better, while it hurts the
performance of the skip-thought model
Diversified Texture Synthesis with Feed-forward Networks
Recent progresses on deep discriminative and generative modeling have shown
promising results on texture synthesis. However, existing feed-forward based
methods trade off generality for efficiency, which suffer from many issues,
such as shortage of generality (i.e., build one network per texture), lack of
diversity (i.e., always produce visually identical output) and suboptimality
(i.e., generate less satisfying visual effects). In this work, we focus on
solving these issues for improved texture synthesis. We propose a deep
generative feed-forward network which enables efficient synthesis of multiple
textures within one single network and meaningful interpolation between them.
Meanwhile, a suite of important techniques are introduced to achieve better
convergence and diversity. With extensive experiments, we demonstrate the
effectiveness of the proposed model and techniques for synthesizing a large
number of textures and show its applications with the stylization.Comment: accepted by CVPR201
- …