166,424 research outputs found
Using Explicit Semantic Analysis for Cross-Lingual Link Discovery
This paper explores how to automatically generate cross language links between resources in large document collections. The paper presents new methods for Cross Lingual Link Discovery(CLLD) based on Explicit Semantic Analysis (ESA). The methods are applicable to any multilingual document collection. In this report, we present their comparative study on the Wikipedia corpus and provide new insights into the evaluation of link discovery systems. In particular, we measure the agreement of human annotators in linking articles in different language versions of Wikipedia, and compare it to the results achieved by the presented methods
Seeding with Costly Network Information
We study the task of selecting nodes in a social network of size , to
seed a diffusion with maximum expected spread size, under the independent
cascade model with cascade probability . Most of the previous work on this
problem (known as influence maximization) focuses on efficient algorithms to
approximate the optimal seed set with provable guarantees, given the knowledge
of the entire network. However, in practice, obtaining full knowledge of the
network is very costly. To address this gap, we first study the achievable
guarantees using influence samples. We provide an approximation
algorithm with a tight (1-1/e){\mbox{OPT}}-\epsilon n guarantee, using
influence samples and show that this dependence on
is asymptotically optimal. We then propose a probing algorithm that queries
edges from the graph and use them to find a seed set with the
same almost tight approximation guarantee. We also provide a matching (up to
logarithmic factors) lower-bound on the required number of edges. To address
the dependence of our probing algorithm on the independent cascade probability
, we show that it is impossible to maintain the same approximation
guarantees by controlling the discrepancy between the probing and seeding
cascade probabilities. Instead, we propose to down-sample the probed edges to
match the seeding cascade probability, provided that it does not exceed that of
probing. Finally, we test our algorithms on real world data to quantify the
trade-off between the cost of obtaining more refined network information and
the benefit of the added information for guiding improved seeding strategies
Substructure Discovery Using Minimum Description Length and Background Knowledge
The ability to identify interesting and repetitive substructures is an
essential component to discovering knowledge in structural data. We describe a
new version of our SUBDUE substructure discovery system based on the minimum
description length principle. The SUBDUE system discovers substructures that
compress the original data and represent structural concepts in the data. By
replacing previously-discovered substructures in the data, multiple passes of
SUBDUE produce a hierarchical description of the structural regularities in the
data. SUBDUE uses a computationally-bounded inexact graph match that identifies
similar, but not identical, instances of a substructure and finds an
approximate measure of closeness of two substructures when under computational
constraints. In addition to the minimum description length principle, other
background knowledge can be used by SUBDUE to guide the search towards more
appropriate substructures. Experiments in a variety of domains demonstrate
SUBDUE's ability to find substructures capable of compressing the original data
and to discover structural concepts important to the domain. Description of
Online Appendix: This is a compressed tar file containing the SUBDUE discovery
system, written in C. The program accepts as input databases represented in
graph form, and will output discovered substructures with their corresponding
value.Comment: See http://www.jair.org/ for an online appendix and other files
accompanying this articl
Discovery Is Never By Chance: Designing for (Un)Serendipity
Serendipity has a long tradition in the history of science as having played a key role in many significant discoveries. Computer scientists, valuing the role of serendipity in discovery, have attempted to design systems that encourage serendipity. However, that research has focused primarily on only one aspect of serendipity: that of chance encounters. In reality, for serendipity to be valuable chance encounters must be synthesized into insight. In this paper we show, through a formal consideration of serendipity and analysis of how various systems have seized on attributes of interpreting serendipity, that there is a richer space for design to support serendipitous creativity, innovation and discovery than has been tapped to date. We discuss how ideas might be encoded to be shared or discovered by ‘association-hunting’ agents. We propose considering not only the inventor’s role in perceiving serendipity, but also how that inventor’s perception may be enhanced to increase the opportunity for serendipity. We explore the role of environment and how we can better enable serendipitous discoveries to find a home more readily and immediately
- …