37,551 research outputs found
Analyzing Evolving Stories in News Articles
There is an overwhelming number of news articles published every day around
the globe. Following the evolution of a news-story is a difficult task given
that there is no such mechanism available to track back in time to study the
diffusion of the relevant events in digital news feeds. The techniques
developed so far to extract meaningful information from a massive corpus rely
on similarity search, which results in a myopic loopback to the same topic
without providing the needed insights to hypothesize the origin of a story that
may be completely different than the news today. In this paper, we present an
algorithm that mines historical data to detect the origin of an event, segments
the timeline into disjoint groups of coherent news articles, and outlines the
most important documents in a timeline with a soft probability to provide a
better understanding of the evolution of a story. Qualitative and quantitative
approaches to evaluate our framework demonstrate that our algorithm discovers
statistically significant and meaningful stories in reasonable time.
Additionally, a relevant case study on a set of news articles demonstrates that
the generated output of the algorithm holds the promise to aid prediction of
future entities in a story.Comment: This is a pre-print of an article published in the International
Journal of Data Science and Analytics. The final authenticated version is
available online at: https://doi.org/10.1007/s41060-017-0091-
Paper evolution graph: Multi-view structural retrieval for academic literature
Academic literature retrieval is concerned with the selection of papers that
are most likely to match a user's information needs. Most of the retrieval
systems are limited to list-output models, in which the retrieval results are
isolated from each other. In this work, we aim to uncover the relationships of
the retrieval results and propose a method for building structural retrieval
results for academic literatures, which we call a paper evolution graph (PEG).
A PEG describes the evolution of the diverse aspects of input queries through
several evolution chains of papers. By utilizing the author, citation and
content information, PEGs can uncover the various underlying relationships
among the papers and present the evolution of articles from multiple
viewpoints. Our system supports three types of input queries: keyword,
single-paper and two-paper queries. The construction of a PEG mainly consists
of three steps. First, the papers are soft-clustered into communities via
metagraph factorization during which the topic distribution of each paper is
obtained. Second, topically cohesive evolution chains are extracted from the
communities that are relevant to the query. Each chain focuses on one aspect of
the query. Finally, the extracted chains are combined to generate a PEG, which
fully covers all the topics of the query. The experimental results on a
real-world dataset demonstrate that the proposed method is able to construct
meaningful PEGs
A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques
The amount of text that is generated every day is increasing dramatically.
This tremendous volume of mostly unstructured text cannot be simply processed
and perceived by computers. Therefore, efficient and effective techniques and
algorithms are required to discover useful patterns. Text mining is the task of
extracting meaningful information from text, which has gained significant
attentions in recent years. In this paper, we describe several of the most
fundamental text mining tasks and techniques including text pre-processing,
classification and clustering. Additionally, we briefly explain text mining in
biomedical and health care domains.Comment: some of References format have update
Capturing Semantic Similarity for Entity Linking with Convolutional Neural Networks
A key challenge in entity linking is making effective use of contextual
information to disambiguate mentions that might refer to different entities in
different contexts. We present a model that uses convolutional neural networks
to capture semantic correspondence between a mention's context and a proposed
target entity. These convolutional networks operate at multiple granularities
to exploit various kinds of topic information, and their rich parameterization
gives them the capacity to learn which n-grams characterize different topics.
We combine these networks with a sparse linear model to achieve
state-of-the-art performance on multiple entity linking datasets, outperforming
the prior systems of Durrett and Klein (2014) and Nguyen et al. (2014).Comment: Accepted at NAACL 201
Temporally Coherent Bayesian Models for Entity Discovery in Videos by Tracklet Clustering
A video can be represented as a sequence of tracklets, each spanning 10-20
frames, and associated with one entity (eg. a person). The task of \emph{Entity
Discovery} in videos can be naturally posed as tracklet clustering. We approach
this task by leveraging \emph{Temporal Coherence}(TC): the fundamental property
of videos that each tracklet is likely to be associated with the same entity as
its temporal neighbors. Our major contributions are the first Bayesian
nonparametric models for TC at tracklet-level. We extend Chinese Restaurant
Process (CRP) to propose TC-CRP, and further to Temporally Coherent Chinese
Restaurant Franchise (TC-CRF) to jointly model short temporal segments. On the
task of discovering persons in TV serial videos without meta-data like scripts,
these methods show considerable improvement in cluster purity and person
coverage compared to state-of-the-art approaches to tracklet clustering. We
represent entities with mixture components, and tracklets with vectors of very
generic features, which can work for any type of entity (not necessarily
person). The proposed methods can perform online tracklet clustering on
streaming videos with little performance deterioration unlike existing
approaches, and can automatically reject tracklets resulting from false
detections. Finally we discuss entity-driven video summarization- where some
temporal segments of the video are selected automatically based on the
discovered entities.Comment: 11 page
SSP: Semantic Space Projection for Knowledge Graph Embedding with Text Descriptions
Knowledge representation is an important, long-history topic in AI, and there
have been a large amount of work for knowledge graph embedding which projects
symbolic entities and relations into low-dimensional, real-valued vector space.
However, most embedding methods merely concentrate on data fitting and ignore
the explicit semantic expression, leading to uninterpretable representations.
Thus, traditional embedding methods have limited potentials for many
applications such as question answering, and entity classification. To this
end, this paper proposes a semantic representation method for knowledge graph
\textbf{(KSR)}, which imposes a two-level hierarchical generative process that
globally extracts many aspects and then locally assigns a specific category in
each aspect for every triple. Since both aspects and categories are
semantics-relevant, the collection of categories in each aspect is treated as
the semantic representation of this triple. Extensive experiments justify our
model outperforms other state-of-the-art baselines substantially.Comment: Submitted to AAAI.201
Doctoral Advisor or Medical Condition: Towards Entity-specific Rankings of Knowledge Base Properties [Extended Version]
In knowledge bases such as Wikidata, it is possible to assert a large set of
properties for entities, ranging from generic ones such as name and place of
birth to highly profession-specific or background-specific ones such as
doctoral advisor or medical condition. Determining a preference or ranking in
this large set is a challenge in tasks such as prioritisation of edits or
natural-language generation. Most previous approaches to ranking knowledge base
properties are purely data-driven, that is, as we show, mistake frequency for
interestingness.
In this work, we have developed a human-annotated dataset of 350 preference
judgments among pairs of knowledge base properties for fixed entities. From
this set, we isolate a subset of pairs for which humans show a high level of
agreement (87.5% on average). We show, however, that baseline and
state-of-the-art techniques achieve only 61.3% precision in predicting human
preferences for this subset.
We then analyze what contributes to one property being rated as more
important than another one, and identify that at least three factors play a
role, namely (i) general frequency, (ii) applicability to similar entities and
(iii) semantic similarity between property and entity. We experimentally
analyze the contribution of each factor and show that a combination of
techniques addressing all the three factors achieves 74% precision on the task.
The dataset is available at
www.kaggle.com/srazniewski/wikidatapropertyranking.Comment: Extended version of an ADMA 2017 conference pape
Discriminative Probabilistic Models for Relational Data
In many supervised learning tasks, the entities to be labeled are related to
each other in complex ways and their labels are not independent. For example,
in hypertext classification, the labels of linked pages are highly correlated.
A standard approach is to classify each entity independently, ignoring the
correlations between them. Recently, Probabilistic Relational Models, a
relational version of Bayesian networks, were used to define a joint
probabilistic model for a collection of related entities. In this paper, we
present an alternative framework that builds on (conditional) Markov networks
and addresses two limitations of the previous approach. First, undirected
models do not impose the acyclicity constraint that hinders representation of
many important relational dependencies in directed models. Second, undirected
models are well suited for discriminative training, where we optimize the
conditional likelihood of the labels given the features, which generally
improves classification accuracy. We show how to train these models
effectively, and how to use approximate probabilistic inference over the
learned model for collective classification of multiple related entities. We
provide experimental results on a webpage classification task, showing that
accuracy can be significantly improved by modeling relational dependencies.Comment: Appears in Proceedings of the Eighteenth Conference on Uncertainty in
Artificial Intelligence (UAI2002
State of the Art, Evaluation and Recommendations regarding "Document Processing and Visualization Techniques"
Several Networks of Excellence have been set up in the framework of the
European FP5 research program. Among these Networks of Excellence, the NEMIS
project focuses on the field of Text Mining.
Within this field, document processing and visualization was identified as
one of the key topics and the WG1 working group was created in the NEMIS
project, to carry out a detailed survey of techniques associated with the text
mining process and to identify the relevant research topics in related research
areas.
In this document we present the results of this comprehensive survey. The
report includes a description of the current state-of-the-art and practice, a
roadmap for follow-up research in the identified areas, and recommendations for
anticipated technological development in the domain of text mining.Comment: 54 pages, Report of Working Group 1 for the European Network of
Excellence (NoE) in Text Mining and its Applications in Statistics (NEMIS
Learning Fine-Grained Knowledge about Contingent Relations between Everyday Events
Much of the user-generated content on social media is provided by ordinary
people telling stories about their daily lives. We develop and test a novel
method for learning fine-grained common-sense knowledge from these stories
about contingent (causal and conditional) relationships between everyday
events. This type of knowledge is useful for text and story understanding,
information extraction, question answering, and text summarization. We test and
compare different methods for learning contingency relation, and compare what
is learned from topic-sorted story collections vs. general-domain stories. Our
experiments show that using topic-specific datasets enables learning
finer-grained knowledge about events and results in significant improvement
over the baselines. An evaluation on Amazon Mechanical Turk shows 82% of the
relations between events that we learn from topic-sorted stories are judged as
contingent.Comment: SIGDIAL 201
- …