7 research outputs found
Mining Missing Hyperlinks from Human Navigation Traces: A Case Study of Wikipedia
Hyperlinks are an essential feature of the World Wide Web. They are
especially important for online encyclopedias such as Wikipedia: an article can
often only be understood in the context of related articles, and hyperlinks
make it easy to explore this context. But important links are often missing,
and several methods have been proposed to alleviate this problem by learning a
linking model based on the structure of the existing links. Here we propose a
novel approach to identifying missing links in Wikipedia. We build on the fact
that the ultimate purpose of Wikipedia links is to aid navigation. Rather than
merely suggesting new links that are in tune with the structure of existing
links, our method finds missing links that would immediately enhance
Wikipedia's navigability. We leverage data sets of navigation paths collected
through a Wikipedia-based human-computation game in which users must find a
short path from a start to a target article by only clicking links encountered
along the way. We harness human navigational traces to identify a set of
candidates for missing links and then rank these candidates. Experiments show
that our procedure identifies missing links of high quality
Uncovering New Links Through Interaction Duration
Link Prediction is the problem of inferring new relationships among nodes in a network that can occur in the near future. Classical approaches mainly consider neighborhood structure similarity when linking nodes. However, we may also want to take into account whether the two nodes we are going to link will benefit from that by having an active interaction over time. For instance, it is better to link two nodes � and � if we know that these two nodes will interact in the social network in the future, rather than suggesting �, who may never interact with �. Thus, the longer the interaction is estimated to last, i.e., persistent interactions, the higher the priority is for connecting the two nodes.
This current thesis focuses on the problem of predicting how long two nodes will interact in a network by identifying potential pairs of nodes (�, �)that are not connected, yet show some Indirect Interaction. “Indirect Interaction” means that there is a particular action involving both the nodes depending on the type of network. For example, in social networks such as Facebook, there are users that are not friends but interact with other user’s wall posts. On the Wikipedia hyperlink network, it happens when readers navigate from page � to page � through the search box (on the top right corner of page �), and there is no explicit link on page � to �. This research explores cases that involved multiple interactions between � and � during an observational time interval [��, ��). Two supervised learning approaches are proposed for the problem. Given a set of network-based predictors, the basic approach consists of learning a binary classifier to predict whether or not an observed Indirect Interaction will last in the future. The second and more fine-grained approach consists of estimating how long the interaction will last by modeling the problem via Survival Analysis or as a Regression task. Once the duration is estimated, this information is leveraged for the Link Prediction task.
Experiments were performed on the longitudinal Facebook network and wall interactions dataset, and Wikipedia Clickstream dataset to test this approach of predicting the Duration of Interaction and Link Prediction. Based on the experiments conducted, this study’s results show that the fine-grained approach performs the best with an AUROC of 85.4% on Facebook and 77% on Wikipedia for Link Prediction. Moreover, this approach beats a Link Prediction model that does not consider the Duration of Interaction and is based only on network properties, and that performs with an AUROC of 0.80 and 0.68 on Facebook and Wikipedia, respectively
New frontiers in supervised word sense disambiguation: building multilingual resources and neural models on a large scale
Word Sense Disambiguation is a long-standing task in Natural Language Processing
(NLP), lying at the core of human language understanding. While it has already
been studied from many different angles over the years, ranging from knowledge
based systems to semi-supervised and fully supervised models, the field seems to
be slowing down in respect to other NLP tasks, e.g., part-of-speech tagging and
dependencies parsing. Despite the organization of several international competitions
aimed at evaluating Word Sense Disambiguation systems, the evaluation of automatic
systems has been problematic mainly due to the lack of a reliable evaluation
framework aiming at performing a direct quantitative confrontation.
To this end we develop a unified evaluation framework and analyze the performance
of various Word Sense Disambiguation systems in a fair setup. The results
show that supervised systems clearly outperform knowledge-based models. Among
the supervised systems, a linear classifier trained on conventional local features
still proves to be a hard baseline to beat. Nonetheless, recent approaches exploiting
neural networks on unlabeled corpora achieve promising results, surpassing this
hard baseline in most test sets. Even though supervised systems tend to perform
best in terms of accuracy, they often lose ground to more flexible knowledge-based
solutions, which do not require training for every disambiguation target. To bridge
this gap we adopt a different perspective and rely on sequence learning to frame
the disambiguation problem: we propose and study in depth a series of end-to-end
neural architectures directly tailored to the task, from bidirectional Long ShortTerm
Memory to encoder-decoder models. Our extensive evaluation over standard
benchmarks and in multiple languages shows that sequence learning enables more
versatile all-words models that consistently lead to state-of-the-art results, even
against models trained with engineered features.
However, supervised systems need annotated training corpora and the few available
to date are of limited size: this is mainly due to the expensive and timeconsuming
process of annotating a wide variety of word senses at a reasonably high
scale, i.e., the so-called knowledge acquisition bottleneck. To address this issue, we
also present different strategies to acquire automatically high quality sense annotated
data in multiple languages, without any manual effort. We assess the quality of the
sense annotations both intrinsically and extrinsically achieving competitive results
on multiple tasks
Harnessing sense-level information for semantically augmented knowledge extraction
Nowadays, building accurate computational models for the semantics of language lies at the very core of Natural Language Processing and Artificial Intelligence. A first and foremost step in this respect consists in moving from word-based to sense-based approaches, in which operating explicitly at the level of word senses enables a model to produce more accurate and unambiguous results. At the same time, word senses create a bridge towards structured lexico-semantic resources, where the vast amount of available machine-readable information can help overcome the shortage of annotated data in many languages and domains of knowledge.
This latter phenomenon, known as the knowledge acquisition bottlneck, is a crucial problem that hampers the development of large-scale, data-driven approaches for many Natural Language Processing tasks, especially when lexical semantics is directly involved. One of these tasks is Information Extraction, where an effective model has to cope with data sparsity, as well as with lexical ambiguity that can arise at the level of both arguments and relational phrases. Even in more recent Information Extraction approaches where semantics is implicitly modeled, these issues have not yet been addressed in their entirety. On the other hand, however, having access to explicit sense-level information is a very demanding task on its own, which can rarely be performed with high accuracy on a large scale. With this in mind, in ths thesis we will tackle a two-fold objective: our first focus will be on studying fully automatic approaches to obtain high-quality sense-level information from textual corpora; then, we will investigate in depth where and how such sense-level information has the potential to enhance the extraction of knowledge from open text.
In the first part of this work, we will explore three different disambiguation scenar- ios (semi-structured text, parallel text, and definitional text) and devise automatic disambiguation strategies that are not only capable of scaling to different corpus sizes and different languages, but that actually take advantage of a multilingual and/or heterogeneous setting to improve and refine their performance. As a result, we will obtain three sense-annotated resources that, when tested experimentally with a baseline system in a series of downstream semantic tasks (i.e. Word Sense Disam- biguation, Entity Linking, Semantic Similarity), show very competitive performances on standard benchmarks against both manual and semi-automatic competitors.
In the second part we will instead focus on Information Extraction, with an emphasis on Open Information Extraction (OIE), where issues like sparsity and lexical ambiguity are especially critical, and study how to exploit at best sense-level information within the extraction process. We will start by showing that enforcing a deeper semantic analysis in a definitional setting enables a full-fledged extraction pipeline to compete with state-of-the-art approaches based on much larger (but noisier) data. We will then demonstrate how working at the sense level at the end of an extraction pipeline is also beneficial: indeed, by leveraging sense-based techniques, very heterogeneous OIE-derived data can be aligned semantically, and unified with respect to a common sense inventory. Finally, we will briefly shift the focus to the more constrained setting of hypernym discovery, and study a sense-aware supervised framework for the task that is robust and effective, even when trained on heterogeneous OIE-derived hypernymic knowledge