9,097 research outputs found
Recent Research Advances on Interactive Machine Learning
Interactive Machine Learning (IML) is an iterative learning process that
tightly couples a human with a machine learner, which is widely used by
researchers and practitioners to effectively solve a wide variety of real-world
application problems. Although recent years have witnessed the proliferation of
IML in the field of visual analytics, most recent surveys either focus on a
specific area of IML or aim to summarize a visualization field that is too
generic for IML. In this paper, we systematically review the recent literature
on IML and classify them into a task-oriented taxonomy built by us. We conclude
the survey with a discussion of open challenges and research opportunities that
we believe are inspiring for future work in IML
News Across Languages - Cross-Lingual Document Similarity and Event Tracking
In today's world, we follow news which is distributed globally. Significant
events are reported by different sources and in different languages. In this
work, we address the problem of tracking of events in a large multilingual
stream. Within a recently developed system Event Registry we examine two
aspects of this problem: how to compare articles in different languages and how
to link collections of articles in different languages which refer to the same
event. Taking a multilingual stream and clusters of articles from each
language, we compare different cross-lingual document similarity measures based
on Wikipedia. This allows us to compute the similarity of any two articles
regardless of language. Building on previous work, we show there are methods
which scale well and can compute a meaningful similarity between articles from
languages with little or no direct overlap in the training data. Using this
capability, we then propose an approach to link clusters of articles across
languages which represent the same event. We provide an extensive evaluation of
the system as a whole, as well as an evaluation of the quality and robustness
of the similarity measure and the linking algorithm.Comment: Accepted for publication in Journal of Artificial Intelligence
Research, Special Track on Cross-language Algorithms and Application
Necessary and Sufficient Conditions and a Provably Efficient Algorithm for Separable Topic Discovery
We develop necessary and sufficient conditions and a novel provably
consistent and efficient algorithm for discovering topics (latent factors) from
observations (documents) that are realized from a probabilistic mixture of
shared latent factors that have certain properties. Our focus is on the class
of topic models in which each shared latent factor contains a novel word that
is unique to that factor, a property that has come to be known as separability.
Our algorithm is based on the key insight that the novel words correspond to
the extreme points of the convex hull formed by the row-vectors of a suitably
normalized word co-occurrence matrix. We leverage this geometric insight to
establish polynomial computation and sample complexity bounds based on a few
isotropic random projections of the rows of the normalized word co-occurrence
matrix. Our proposed random-projections-based algorithm is naturally amenable
to an efficient distributed implementation and is attractive for modern
web-scale distributed data mining applications.Comment: Typo corrected; Revised argument in Lemma 3 and
Learning Semantics for Image Annotation
Image search and retrieval engines rely heavily on textual annotation in
order to match word queries to a set of candidate images. A system that can
automatically annotate images with meaningful text can be highly beneficial for
such engines. Currently, the approaches to develop such systems try to
establish relationships between keywords and visual features of images. In this
paper, We make three main contributions to this area: (i) We transform this
problem from the low-level keyword space to the high-level semantics space that
we refer to as the "{\em image theme}", (ii) Instead of treating each possible
keyword independently, we use latent Dirichlet allocation to learn image themes
from the associated texts in a training phase. Images are then annotated with
image themes rather than keywords, using a modified continuous relevance model,
which takes into account the spatial coherence and the visual continuity among
images of common theme. (iii) To achieve more coherent annotations among images
of common theme, we have integrated ConceptNet in learning the semantics of
images, and hence augment image descriptions beyond annotations provided by
humans. Images are thus further annotated by a few most significant words of
the prominent image theme. Our extensive experiments show that a coherent
theme-based image annotation using high-level semantics results in improved
precision and recall as compared with equivalent classical keyword annotation
systems
Learning Taxonomies of Concepts and not Words using Contextualized Word Representations: A Position Paper
Taxonomies are semantic hierarchies of concepts. One limitation of current
taxonomy learning systems is that they define concepts as single words. This
position paper argues that contextualized word representations, which recently
achieved state-of-the-art results on many competitive NLP tasks, are a
promising method to address this limitation. We outline a novel approach for
taxonomy learning that (1) defines concepts as synsets, (2) learns
density-based approximations of contextualized word representations, and (3)
can measure similarity and hypernymy among them.Comment: 5 pages, 1 figur
Contextualization of topics: Browsing through the universe of bibliographic information
This paper describes how semantic indexing can help to generate a contextual
overview of topics and visually compare clusters of articles. The method was
originally developed for an innovative information exploration tool, called
Ariadne, which operates on bibliographic databases with tens of millions of
records. In this paper, the method behind Ariadne is further developed and
applied to the research question of the special issue "Same data, different
results" - the better understanding of topic (re-)construction by different
bibliometric approaches. For the case of the Astro dataset of 111,616 articles
in astronomy and astrophysics, a new instantiation of the interactive exploring
tool, LittleAriadne, has been created. This paper contributes to the overall
challenge to delineate and define topics in two different ways. First, we
produce two clustering solutions based on vector representations of articles in
a lexical space. These vectors are built on semantic indexing of entities
associated with those articles. Second, we discuss how LittleAriadne can be
used to browse through the network of topical terms, authors, journals,
citations and various cluster solutions of the Astro dataset. More
specifically, we treat the assignment of an article to the different clustering
solutions as an additional element of its bibliographic record. Keeping the
principle of semantic indexing on the level of such an extended list of
entities of the bibliographic record, LittleAriadne in turn provides a
visualization of the context of a specific clustering solution. It also conveys
the similarity of article clusters produced by different algorithms, hence
representing a complementary approach to other possible means of comparison.Comment: Special Issue of Scientometrics: Same data - different results?
Towards a comparative approach to the identification of thematic structures
in scienc
A Comprehensive Survey on Cross-modal Retrieval
In recent years, cross-modal retrieval has drawn much attention due to the
rapid growth of multimodal data. It takes one type of data as the query to
retrieve relevant data of another type. For example, a user can use a text to
retrieve relevant pictures or videos. Since the query and its retrieved results
can be of different modalities, how to measure the content similarity between
different modalities of data remains a challenge. Various methods have been
proposed to deal with such a problem. In this paper, we first review a number
of representative methods for cross-modal retrieval and classify them into two
main groups: 1) real-valued representation learning, and 2) binary
representation learning. Real-valued representation learning methods aim to
learn real-valued common representations for different modalities of data. To
speed up the cross-modal retrieval, a number of binary representation learning
methods are proposed to map different modalities of data into a common Hamming
space. Then, we introduce several multimodal datasets in the community, and
show the experimental results on two commonly used multimodal datasets. The
comparison reveals the characteristic of different kinds of cross-modal
retrieval methods, which is expected to benefit both practical applications and
future research. Finally, we discuss open problems and future research
directions.Comment: 20 pages, 11 figures, 9 table
Improving Coreference Resolution by Learning Entity-Level Distributed Representations
A long-standing challenge in coreference resolution has been the
incorporation of entity-level information - features defined over clusters of
mentions instead of mention pairs. We present a neural network based
coreference system that produces high-dimensional vector representations for
pairs of coreference clusters. Using these representations, our system learns
when combining clusters is desirable. We train the system with a
learning-to-search algorithm that teaches it which local decisions (cluster
merges) will lead to a high-scoring final coreference partition. The system
substantially outperforms the current state-of-the-art on the English and
Chinese portions of the CoNLL 2012 Shared Task dataset despite using few
hand-engineered features.Comment: Accepted for publication at the Association for Computational
Linguistics (ACL), 201
State of the Art, Evaluation and Recommendations regarding "Document Processing and Visualization Techniques"
Several Networks of Excellence have been set up in the framework of the
European FP5 research program. Among these Networks of Excellence, the NEMIS
project focuses on the field of Text Mining.
Within this field, document processing and visualization was identified as
one of the key topics and the WG1 working group was created in the NEMIS
project, to carry out a detailed survey of techniques associated with the text
mining process and to identify the relevant research topics in related research
areas.
In this document we present the results of this comprehensive survey. The
report includes a description of the current state-of-the-art and practice, a
roadmap for follow-up research in the identified areas, and recommendations for
anticipated technological development in the domain of text mining.Comment: 54 pages, Report of Working Group 1 for the European Network of
Excellence (NoE) in Text Mining and its Applications in Statistics (NEMIS
Multi-view Anomaly Detection via Probabilistic Latent Variable Models
We propose a nonparametric Bayesian probabilistic latent variable model for
multi-view anomaly detection, which is the task of finding instances that have
inconsistent views. With the proposed model, all views of a non-anomalous
instance are assumed to be generated from a single latent vector. On the other
hand, an anomalous instance is assumed to have multiple latent vectors, and its
different views are generated from different latent vectors. By inferring the
number of latent vectors used for each instance with Dirichlet process priors,
we obtain multi-view anomaly scores. The proposed model can be seen as a robust
extension of probabilistic canonical correlation analysis for noisy multi-view
data. We present Bayesian inference procedures for the proposed model based on
a stochastic EM algorithm. The effectiveness of the proposed model is
demonstrated in terms of performance when detecting multi-view anomalies and
imputing missing values in multi-view data with anomalies
- …