2,115 research outputs found
Supervised cross-modal factor analysis for multiple modal data classification
In this paper we study the problem of learning from multiple modal data for
purpose of document classification. In this problem, each document is composed
two different modals of data, i.e., an image and a text. Cross-modal factor
analysis (CFA) has been proposed to project the two different modals of data to
a shared data space, so that the classification of a image or a text can be
performed directly in this space. A disadvantage of CFA is that it has ignored
the supervision information. In this paper, we improve CFA by incorporating the
supervision information to represent and classify both image and text modals of
documents. We project both image and text data to a shared data space by factor
analysis, and then train a class label predictor in the shared space to use the
class label information. The factor analysis parameter and the predictor
parameter are learned jointly by solving one single objective function. With
this objective function, we minimize the distance between the projections of
image and text of the same document, and the classification error of the
projection measured by hinge loss function. The objective function is optimized
by an alternate optimization strategy in an iterative algorithm. Experiments in
two different multiple modal document data sets show the advantage of the
proposed algorithm over other CFA methods
AspectMMKG: A Multi-modal Knowledge Graph with Aspect-aware Entities
Multi-modal knowledge graphs (MMKGs) combine different modal data (e.g., text
and image) for a comprehensive understanding of entities. Despite the recent
progress of large-scale MMKGs, existing MMKGs neglect the multi-aspect nature
of entities, limiting the ability to comprehend entities from various
perspectives. In this paper, we construct AspectMMKG, the first MMKG with
aspect-related images by matching images to different entity aspects.
Specifically, we collect aspect-related images from a knowledge base, and
further extract aspect-related sentences from the knowledge base as queries to
retrieve a large number of aspect-related images via an online image search
engine. Finally, AspectMMKG contains 2,380 entities, 18,139 entity aspects, and
645,383 aspect-related images. We demonstrate the usability of AspectMMKG in
entity aspect linking (EAL) downstream task and show that previous EAL models
achieve a new state-of-the-art performance with the help of AspectMMKG. To
facilitate the research on aspect-related MMKG, we further propose an
aspect-related image retrieval (AIR) model, that aims to correct and expand
aspect-related images in AspectMMKG. We train an AIR model to learn the
relationship between entity image and entity aspect-related images by
incorporating entity image, aspect, and aspect image information. Experimental
results indicate that the AIR model could retrieve suitable images for a given
entity w.r.t different aspects.Comment: Accepted by CIKM 202
Joint Intermodal and Intramodal Label Transfers for Extremely Rare or Unseen Classes
In this paper, we present a label transfer model from texts to images for
image classification tasks. The problem of image classification is often much
more challenging than text classification. On one hand, labeled text data is
more widely available than the labeled images for classification tasks. On the
other hand, text data tends to have natural semantic interpretability, and they
are often more directly related to class labels. On the contrary, the image
features are not directly related to concepts inherent in class labels. One of
our goals in this paper is to develop a model for revealing the functional
relationships between text and image features as to directly transfer
intermodal and intramodal labels to annotate the images. This is implemented by
learning a transfer function as a bridge to propagate the labels between two
multimodal spaces. However, the intermodal label transfers could be undermined
by blindly transferring the labels of noisy texts to annotate images. To
mitigate this problem, we present an intramodal label transfer process, which
complements the intermodal label transfer by transferring the image labels
instead when relevant text is absent from the source corpus. In addition, we
generalize the inter-modal label transfer to zero-shot learning scenario where
there are only text examples available to label unseen classes of images
without any positive image examples. We evaluate our algorithm on an image
classification task and show the effectiveness with respect to the other
compared algorithms.Comment: The paper has been accepted by IEEE Transactions on Pattern Analysis
and Machine Intelligence. It will apear in a future issu
Visual Named Entity Linking: A New Dataset and A Baseline
Visual Entity Linking (VEL) is a task to link regions of images with their
corresponding entities in Knowledge Bases (KBs), which is beneficial for many
computer vision tasks such as image retrieval, image caption, and visual
question answering. While existing tasks in VEL either rely on textual data to
complement a multi-modal linking or only link objects with general entities,
which fails to perform named entity linking on large amounts of image data. In
this paper, we consider a purely Visual-based Named Entity Linking (VNEL) task,
where the input only consists of an image. The task is to identify objects of
interest (i.e., visual entity mentions) in images and link them to
corresponding named entities in KBs. Since each entity often contains rich
visual and textual information in KBs, we thus propose three different
sub-tasks, i.e., visual to visual entity linking (V2VEL), visual to textual
entity linking (V2TEL), and visual to visual-textual entity linking (V2VTEL).
In addition, we present a high-quality human-annotated visual person linking
dataset, named WIKIPerson. Based on WIKIPerson, we establish a series of
baseline algorithms for the solution of each sub-task, and conduct experiments
to verify the quality of proposed datasets and the effectiveness of baseline
methods. We envision this work to be helpful for soliciting more works
regarding VNEL in the future. The codes and datasets are publicly available at
https://github.com/ict-bigdatalab/VNEL.Comment: 13 pages, 11 figures, published to EMNLP 2022(findings
Neural Cross-Lingual Entity Linking
A major challenge in Entity Linking (EL) is making effective use of
contextual information to disambiguate mentions to Wikipedia that might refer
to different entities in different contexts. The problem exacerbates with
cross-lingual EL which involves linking mentions written in non-English
documents to entries in the English Wikipedia: to compare textual clues across
languages we need to compute similarity between textual fragments across
languages. In this paper, we propose a neural EL model that trains fine-grained
similarities and dissimilarities between the query and candidate document from
multiple perspectives, combined with convolution and tensor networks. Further,
we show that this English-trained system can be applied, in zero-shot learning,
to other languages by making surprisingly effective use of multi-lingual
embeddings. The proposed system has strong empirical evidence yielding
state-of-the-art results in English as well as cross-lingual: Spanish and
Chinese TAC 2015 datasets.Comment: Association for the Advancement of Artificial Intelligence (AAAI),
201
- …