94,112 research outputs found
Deep Multimodal Speaker Naming
Automatic speaker naming is the problem of localizing as well as identifying
each speaking character in a TV/movie/live show video. This is a challenging
problem mainly attributes to its multimodal nature, namely face cue alone is
insufficient to achieve good performance. Previous multimodal approaches to
this problem usually process the data of different modalities individually and
merge them using handcrafted heuristics. Such approaches work well for simple
scenes, but fail to achieve high performance for speakers with large appearance
variations. In this paper, we propose a novel convolutional neural networks
(CNN) based learning framework to automatically learn the fusion function of
both face and audio cues. We show that without using face tracking, facial
landmark localization or subtitle/transcript, our system with robust multimodal
feature extraction is able to achieve state-of-the-art speaker naming
performance evaluated on two diverse TV series. The dataset and implementation
of our algorithm are publicly available online
Framework for constructing multimodal transport networks and routing using a graph database: A case study in London
Most prior multimodal transport networks have been organized as relational databases with multilayer structures to support transport management and routing; however, database expandability and update efficiency in new networks and timetables are low due to the strict database schemas. This study aimed to develop multimodal transport networks using a graph database that can accommodate efficient updates and extensions, high relation-based query performance, and flexible integration in multimodal routing. As a case study, a database was constructed for London transport networks, and routing tests were performed under various conditions. The constructed multimodal graph database showed stable performance in processing iterative queries, and efficient multi-stop routing was particularly enhanced. By applying the proposed framework, databases for multimodal routing can be readily constructed for other regions, while enabling responses to diversified routings, such as personalized routing through integration with various unstructured information, due to the flexible schema of the graph database
Multimodal Network Alignment
A multimodal network encodes relationships between the same set of nodes in
multiple settings, and network alignment is a powerful tool for transferring
information and insight between a pair of networks. We propose a method for
multimodal network alignment that computes a matrix which indicates the
alignment, but produces the result as a low-rank factorization directly. We
then propose new methods to compute approximate maximum weight matchings of
low-rank matrices to produce an alignment. We evaluate our approach by applying
it on synthetic networks and use it to de-anonymize a multimodal transportation
network.Comment: 14 pages, 6 figures, Siam Data Mining 201
LIDAR-Camera Fusion for Road Detection Using Fully Convolutional Neural Networks
In this work, a deep learning approach has been developed to carry out road
detection by fusing LIDAR point clouds and camera images. An unstructured and
sparse point cloud is first projected onto the camera image plane and then
upsampled to obtain a set of dense 2D images encoding spatial information.
Several fully convolutional neural networks (FCNs) are then trained to carry
out road detection, either by using data from a single sensor, or by using
three fusion strategies: early, late, and the newly proposed cross fusion.
Whereas in the former two fusion approaches, the integration of multimodal
information is carried out at a predefined depth level, the cross fusion FCN is
designed to directly learn from data where to integrate information; this is
accomplished by using trainable cross connections between the LIDAR and the
camera processing branches.
To further highlight the benefits of using a multimodal system for road
detection, a data set consisting of visually challenging scenes was extracted
from driving sequences of the KITTI raw data set. It was then demonstrated
that, as expected, a purely camera-based FCN severely underperforms on this
data set. A multimodal system, on the other hand, is still able to provide high
accuracy. Finally, the proposed cross fusion FCN was evaluated on the KITTI
road benchmark where it achieved excellent performance, with a MaxF score of
96.03%, ranking it among the top-performing approaches
Interactive Search and Exploration in Online Discussion Forums Using Multimodal Embeddings
In this paper we present a novel interactive multimodal learning system,
which facilitates search and exploration in large networks of social multimedia
users. It allows the analyst to identify and select users of interest, and to
find similar users in an interactive learning setting. Our approach is based on
novel multimodal representations of users, words and concepts, which we
simultaneously learn by deploying a general-purpose neural embedding model. We
show these representations to be useful not only for categorizing users, but
also for automatically generating user and community profiles. Inspired by
traditional summarization approaches, we create the profiles by selecting
diverse and representative content from all available modalities, i.e. the
text, image and user modality. The usefulness of the approach is evaluated
using artificial actors, which simulate user behavior in a relevance feedback
scenario. Multiple experiments were conducted in order to evaluate the quality
of our multimodal representations, to compare different embedding strategies,
and to determine the importance of different modalities. We demonstrate the
capabilities of the proposed approach on two different multimedia collections
originating from the violent online extremism forum Stormfront and the
microblogging platform Twitter, which are particularly interesting due to the
high semantic level of the discussions they feature
- …