3 research outputs found
Task-adaptive Asymmetric Deep Cross-modal Hashing
Supervised cross-modal hashing aims to embed the semantic correlations of
heterogeneous modality data into the binary hash codes with discriminative
semantic labels. Because of its advantages on retrieval and storage efficiency,
it is widely used for solving efficient cross-modal retrieval. However,
existing researches equally handle the different tasks of cross-modal
retrieval, and simply learn the same couple of hash functions in a symmetric
way for them. Under such circumstance, the uniqueness of different cross-modal
retrieval tasks are ignored and sub-optimal performance may be brought.
Motivated by this, we present a Task-adaptive Asymmetric Deep Cross-modal
Hashing (TA-ADCMH) method in this paper. It can learn task-adaptive hash
functions for two sub-retrieval tasks via simultaneous modality representation
and asymmetric hash learning. Unlike previous cross-modal hashing approaches,
our learning framework jointly optimizes semantic preserving that transforms
deep features of multimedia data into binary hash codes, and the semantic
regression which directly regresses query modality representation to explicit
label. With our model, the binary codes can effectively preserve semantic
correlations across different modalities, meanwhile, adaptively capture the
query semantics. The superiority of TA-ADCMH is proved on two standard datasets
from many aspects
Audio Description from Image by Modal Translation Network
Audio is the main form for the visually impaired to obtain information. In
reality, all kinds of visual data always exist, but audio data does not exist
in many cases. In order to help the visually impaired people to better perceive
the information around them, an image-to-audio-description (I2AD) task is
proposed to generate audio descriptions from images in this paper. To complete
this totally new task, a modal translation network (MT-Net) from visual to
auditory sense is proposed. The proposed MT-Net includes three progressive
sub-networks: 1) feature learning, 2) cross-modal mapping, and 3) audio
generation. First, the feature learning sub-network aims to learn semantic
features from image and audio, including image feature learning and audio
feature learning. Second, the cross-modal mapping sub-network transforms the
image feature into a cross-modal representation with the same semantic concept
as the audio feature. In this way, the correlation of inter-modal data is
effectively mined for easing the heterogeneous gap between image and audio.
Finally, the audio generation sub-network is designed to generate the audio
waveform from the cross-modal representation. The generated audio waveform is
interpolated to obtain the corresponding audio file according to the sample
frequency. Being the first attempt to explore the I2AD task, three large-scale
datasets with plenty of manual audio descriptions are built. Experiments on the
datasets verify the feasibility of generating intelligible audio from an image
directly and the effectiveness of proposed method
Survey on Deep Multi-modal Data Analytics: Collaboration, Rivalry and Fusion
With the development of web technology, multi-modal or multi-view data has
surged as a major stream for big data, where each modal/view encodes individual
property of data objects. Often, different modalities are complementary to each
other. Such fact motivated a lot of research attention on fusing the
multi-modal feature spaces to comprehensively characterize the data objects.
Most of the existing state-of-the-art focused on how to fuse the energy or
information from multi-modal spaces to deliver a superior performance over
their counterparts with single modal. Recently, deep neural networks have
exhibited as a powerful architecture to well capture the nonlinear distribution
of high-dimensional multimedia data, so naturally does for multi-modal data.
Substantial empirical studies are carried out to demonstrate its advantages
that are benefited from deep multi-modal methods, which can essentially deepen
the fusion from multi-modal deep feature spaces. In this paper, we provide a
substantial overview of the existing state-of-the-arts on the filed of
multi-modal data analytics from shallow to deep spaces. Throughout this survey,
we further indicate that the critical components for this field go to
collaboration, adversarial competition and fusion over multi-modal spaces.
Finally, we share our viewpoints regarding some future directions on this
field.Comment: Appearing at ACM TOMM, 26 page