67 research outputs found
Multimodal Classification of Urban Micro-Events
In this paper we seek methods to effectively detect urban micro-events. Urban
micro-events are events which occur in cities, have limited geographical
coverage and typically affect only a small group of citizens. Because of their
scale these are difficult to identify in most data sources. However, by using
citizen sensing to gather data, detecting them becomes feasible. The data
gathered by citizen sensing is often multimodal and, as a consequence, the
information required to detect urban micro-events is distributed over multiple
modalities. This makes it essential to have a classifier capable of combining
them. In this paper we explore several methods of creating such a classifier,
including early, late, hybrid fusion and representation learning using
multimodal graphs. We evaluate performance on a real world dataset obtained
from a live citizen reporting system. We show that a multimodal approach yields
higher performance than unimodal alternatives. Furthermore, we demonstrate that
our hybrid combination of early and late fusion with multimodal embeddings
performs best in classification of urban micro-events
Adversarial Multimodal Representation Learning for Click-Through Rate Prediction
For better user experience and business effectiveness, Click-Through Rate
(CTR) prediction has been one of the most important tasks in E-commerce.
Although extensive CTR prediction models have been proposed, learning good
representation of items from multimodal features is still less investigated,
considering an item in E-commerce usually contains multiple heterogeneous
modalities. Previous works either concatenate the multiple modality features,
that is equivalent to giving a fixed importance weight to each modality; or
learn dynamic weights of different modalities for different items through
technique like attention mechanism. However, a problem is that there usually
exists common redundant information across multiple modalities. The dynamic
weights of different modalities computed by using the redundant information may
not correctly reflect the different importance of each modality. To address
this, we explore the complementarity and redundancy of modalities by
considering modality-specific and modality-invariant features differently. We
propose a novel Multimodal Adversarial Representation Network (MARN) for the
CTR prediction task. A multimodal attention network first calculates the
weights of multiple modalities for each item according to its modality-specific
features. Then a multimodal adversarial network learns modality-invariant
representations where a double-discriminators strategy is introduced. Finally,
we achieve the multimodal item representations by combining both
modality-specific and modality-invariant representations. We conduct extensive
experiments on both public and industrial datasets, and the proposed method
consistently achieves remarkable improvements to the state-of-the-art methods.
Moreover, the approach has been deployed in an operational E-commerce system
and online A/B testing further demonstrates the effectiveness.Comment: Accepted to WWW 2020, 10 page
Temporal Cross-Media Retrieval with Soft-Smoothing
Multimedia information have strong temporal correlations that shape the way
modalities co-occur over time. In this paper we study the dynamic nature of
multimedia and social-media information, where the temporal dimension emerges
as a strong source of evidence for learning the temporal correlations across
visual and textual modalities. So far, cross-media retrieval models, explored
the correlations between different modalities (e.g. text and image) to learn a
common subspace, in which semantically similar instances lie in the same
neighbourhood. Building on such knowledge, we propose a novel temporal
cross-media neural architecture, that departs from standard cross-media
methods, by explicitly accounting for the temporal dimension through temporal
subspace learning. The model is softly-constrained with temporal and
inter-modality constraints that guide the new subspace learning task by
favouring temporal correlations between semantically similar and temporally
close instances. Experiments on three distinct datasets show that accounting
for time turns out to be important for cross-media retrieval. Namely, the
proposed method outperforms a set of baselines on the task of temporal
cross-media retrieval, demonstrating its effectiveness for performing temporal
subspace learning.Comment: To appear in ACM MM 201
- …