248 research outputs found
Mining Meaning from Wikipedia
Wikipedia is a goldmine of information; not just for its many readers, but
also for the growing community of researchers who recognize it as a resource of
exceptional scale and utility. It represents a vast investment of manual effort
and judgment: a huge, constantly evolving tapestry of concepts and relations
that is being applied to a host of tasks.
This article provides a comprehensive description of this work. It focuses on
research that extracts and makes use of the concepts, relations, facts and
descriptions found in Wikipedia, and organizes the work into four broad
categories: applying Wikipedia to natural language processing; using it to
facilitate information retrieval and information extraction; and as a resource
for ontology building. The article addresses how Wikipedia is being used as is,
how it is being improved and adapted, and how it is being combined with other
structures to create entirely new resources. We identify the research groups
and individuals involved, and how their work has developed in the last few
years. We provide a comprehensive list of the open-source software they have
produced.Comment: An extensive survey of re-using information in Wikipedia in natural
language processing, information retrieval and extraction and ontology
building. Accepted for publication in International Journal of Human-Computer
Studie
Commonsense knowledge acquisition and applications
Computers are increasingly expected to make smart decisions based on what humans consider commonsense. This would require computers to understand their environment, including properties of objects in the environment (e.g., a wheel is round), relations between objects (e.g., two wheels are part of a bike, or a bike is slower than a car) and interactions of objects (e.g., a driver drives a car on the road).
The goal of this dissertation is to investigate automated methods for acquisition of large-scale, semantically organized commonsense knowledge. Prior state-of-the-art methods to acquire commonsense are either not automated or based on shallow representations. Thus, they cannot produce large-scale, semantically organized commonsense knowledge.
To achieve the goal, we divide the problem space into three research directions, constituting our core contributions:
1. Properties of objects: acquisition of properties like hasSize, hasShape, etc. We develop WebChild, a semi-supervised method to compile semantically organized properties.
2. Relationships between objects: acquisition of relations like largerThan, partOf, memberOf, etc. We develop CMPKB, a linear-programming based method to compile comparative relations, and, we develop PWKB, a method based on statistical and logical inference to compile part-whole relations.
3. Interactions between objects: acquisition of activities like drive a car, park a car, etc., with attributes such as temporal or spatial attributes. We develop Knowlywood, a method based on semantic parsing and probabilistic graphical models to compile activity knowledge.
Together, these methods result in the construction of a large, clean and semantically organized Commonsense Knowledge Base that we call WebChild KB.Von Computern wird immer mehr erwartet, dass sie kluge Entscheidungen treffen können, basierend auf Allgemeinwissen. Dies setzt voraus, dass Computer ihre Umgebung, einschlieĂlich der Eigenschaften von Objekten (z. B. das Rad ist rund), Beziehungen zwischen Objekten (z. B. ein Fahrrad hat zwei RĂ€der, ein Fahrrad ist langsamer als ein Auto) und Interaktionen von Objekten (z. B. ein Fahrer fĂ€hrt ein Auto auf der StraĂe), verstehen können.
Das Ziel dieser Dissertation ist es, automatische Methoden fĂŒr die Erfassung von groĂmaĂstĂ€blichem, semantisch organisiertem Allgemeinwissen zu schaffen. Dies ist schwierig aufgrund folgender Eigenschaften des Allgemeinwissens. Es ist: (i) implizit und spĂ€rlich, da Menschen nicht explizit das Offensichtliche ausdrĂŒcken, (ii) multimodal, da es ĂŒber textuelle und visuelle Inhalte verteilt ist, (iii) beeintrĂ€chtigt vom Einfluss des Berichtenden, da ungewöhnliche Fakten disproportional hĂ€ufig berichtet werden, (iv) KontextabhĂ€ngig, und hat aus diesem Grund eine eingeschrĂ€nkte statistische Konfidenz.
Vorherige Methoden, auf diesem Gebiet sind entweder nicht automatisiert oder basieren auf flachen ReprĂ€sentationen. Daher können sie kein groĂmaĂstĂ€bliches, semantisch organisiertes Allgemeinwissen erzeugen.
Um unser Ziel zu erreichen, teilen wir den Problemraum in drei Forschungsrichtungen, welche den Hauptbeitrag dieser Dissertation formen:
1. Eigenschaften von Objekten: Erfassung von Eigenschaften wie hasSize, hasShape, usw. Wir entwickeln WebChild, eine halbĂŒberwachte Methode zum Erfassen semantisch organisierter Eigenschaften.
2. Beziehungen zwischen Objekten: Erfassung von Beziehungen wie largerThan, partOf, memberOf, usw. Wir entwickeln CMPKB, eine Methode basierend auf linearer Programmierung um vergleichbare Beziehungen zu erfassen. Weiterhin entwickeln wir PWKB, eine Methode basierend auf statistischer und logischer Inferenz welche zugehörigkeits Beziehungen erfasst.
3. Interaktionen zwischen Objekten: Erfassung von AktivitÀten, wie drive a car, park a car, usw. mit temporalen und rÀumlichen Attributen. Wir entwickeln Knowlywood, eine Methode basierend auf semantischem Parsen und probabilistischen grafischen Modellen um AktivitÀtswissen zu erfassen.
Als Resultat dieser Methoden erstellen wir eine groĂe, saubere und semantisch organisierte Allgemeinwissensbasis, welche wir WebChild KB nennen
Statistical Extraction of Multilingual Natural Language Patterns for RDF Predicates: Algorithms and Applications
The Data Web has undergone a tremendous growth period.
It currently consists of more then 3300 publicly available knowledge bases describing millions of resources from various domains, such as life sciences, government or geography, with over 89 billion facts.
In the same way, the Document Web grew to the state where approximately 4.55 billion websites exist, 300 million photos are uploaded on Facebook as well as 3.5 billion Google searches are performed on average every day.
However, there is a gap between the Document Web and the Data Web, since for example knowledge bases available on the Data Web are most commonly extracted from structured or semi-structured sources, but the majority of information available on the Web is contained in unstructured sources such as news articles, blog post, photos, forum discussions, etc.
As a result, data on the Data Web not only misses a significant fragment of information but also suffers from a lack of actuality since typical extraction methods are time-consuming and can only be carried out periodically.
Furthermore, provenance information is rarely taken into consideration and therefore gets lost in the transformation process.
In addition, users are accustomed to entering keyword queries to satisfy their information needs.
With the availability of machine-readable knowledge bases, lay users could be empowered to issue more specific questions and get more precise answers.
In this thesis, we address the problem of Relation Extraction, one of the key challenges pertaining to closing the gap between the Document Web and the Data Web by four means.
First, we present a distant supervision approach that allows finding multilingual natural language representations of formal relations already contained in the Data Web.
We use these natural language representations to find sentences on the Document Web that contain unseen instances of this relation between two entities.
Second, we address the problem of data actuality by presenting a real-time data stream RDF extraction framework and utilize this framework to extract RDF from RSS news feeds.
Third, we present a novel fact validation algorithm, based on natural language representations, able to not only verify or falsify a given triple, but also to find trustworthy sources for it on the Web and estimating a time scope in which the triple holds true.
The features used by this algorithm to determine if a website is indeed trustworthy are used as provenance information and therewith help to create metadata for facts in the Data Web.
Finally, we present a question answering system that uses the natural language representations to map natural language question to formal SPARQL queries, allowing lay users to make use of the large amounts of data available on the Data Web to satisfy their information need
Commonsense knowledge acquisition and applications
Computers are increasingly expected to make smart decisions based on what humans consider commonsense. This would require computers to understand their environment, including properties of objects in the environment (e.g., a wheel is round), relations between objects (e.g., two wheels are part of a bike, or a bike is slower than a car) and interactions of objects (e.g., a driver drives a car on the road).
The goal of this dissertation is to investigate automated methods for acquisition of large-scale, semantically organized commonsense knowledge. Prior state-of-the-art methods to acquire commonsense are either not automated or based on shallow representations. Thus, they cannot produce large-scale, semantically organized commonsense knowledge.
To achieve the goal, we divide the problem space into three research directions, constituting our core contributions:
1. Properties of objects: acquisition of properties like hasSize, hasShape, etc. We develop WebChild, a semi-supervised method to compile semantically organized properties.
2. Relationships between objects: acquisition of relations like largerThan, partOf, memberOf, etc. We develop CMPKB, a linear-programming based method to compile comparative relations, and, we develop PWKB, a method based on statistical and logical inference to compile part-whole relations.
3. Interactions between objects: acquisition of activities like drive a car, park a car, etc., with attributes such as temporal or spatial attributes. We develop Knowlywood, a method based on semantic parsing and probabilistic graphical models to compile activity knowledge.
Together, these methods result in the construction of a large, clean and semantically organized Commonsense Knowledge Base that we call WebChild KB.Von Computern wird immer mehr erwartet, dass sie kluge Entscheidungen treffen können, basierend auf Allgemeinwissen. Dies setzt voraus, dass Computer ihre Umgebung, einschlieĂlich der Eigenschaften von Objekten (z. B. das Rad ist rund), Beziehungen zwischen Objekten (z. B. ein Fahrrad hat zwei RĂ€der, ein Fahrrad ist langsamer als ein Auto) und Interaktionen von Objekten (z. B. ein Fahrer fĂ€hrt ein Auto auf der StraĂe), verstehen können.
Das Ziel dieser Dissertation ist es, automatische Methoden fĂŒr die Erfassung von groĂmaĂstĂ€blichem, semantisch organisiertem Allgemeinwissen zu schaffen. Dies ist schwierig aufgrund folgender Eigenschaften des Allgemeinwissens. Es ist: (i) implizit und spĂ€rlich, da Menschen nicht explizit das Offensichtliche ausdrĂŒcken, (ii) multimodal, da es ĂŒber textuelle und visuelle Inhalte verteilt ist, (iii) beeintrĂ€chtigt vom Einfluss des Berichtenden, da ungewöhnliche Fakten disproportional hĂ€ufig berichtet werden, (iv) KontextabhĂ€ngig, und hat aus diesem Grund eine eingeschrĂ€nkte statistische Konfidenz.
Vorherige Methoden, auf diesem Gebiet sind entweder nicht automatisiert oder basieren auf flachen ReprĂ€sentationen. Daher können sie kein groĂmaĂstĂ€bliches, semantisch organisiertes Allgemeinwissen erzeugen.
Um unser Ziel zu erreichen, teilen wir den Problemraum in drei Forschungsrichtungen, welche den Hauptbeitrag dieser Dissertation formen:
1. Eigenschaften von Objekten: Erfassung von Eigenschaften wie hasSize, hasShape, usw. Wir entwickeln WebChild, eine halbĂŒberwachte Methode zum Erfassen semantisch organisierter Eigenschaften.
2. Beziehungen zwischen Objekten: Erfassung von Beziehungen wie largerThan, partOf, memberOf, usw. Wir entwickeln CMPKB, eine Methode basierend auf linearer Programmierung um vergleichbare Beziehungen zu erfassen. Weiterhin entwickeln wir PWKB, eine Methode basierend auf statistischer und logischer Inferenz welche zugehörigkeits Beziehungen erfasst.
3. Interaktionen zwischen Objekten: Erfassung von AktivitÀten, wie drive a car, park a car, usw. mit temporalen und rÀumlichen Attributen. Wir entwickeln Knowlywood, eine Methode basierend auf semantischem Parsen und probabilistischen grafischen Modellen um AktivitÀtswissen zu erfassen.
Als Resultat dieser Methoden erstellen wir eine groĂe, saubere und semantisch organisierte Allgemeinwissensbasis, welche wir WebChild KB nennen
Recommended from our members
The Value of Everything: Ranking and Association with Encyclopedic Knowledge
This dissertation describes WikiRank, an unsupervised method of assigning relative values to elements of a broad coverage encyclopedic information source in order to identify those entries that may be relevant to a given piece of text. The valuation given to an entry is based not on textual similarity but instead on the links that associate entries, and an estimation of the expected frequency of visitation that would be given to each entry based on those associations in context. This estimation of relative frequency of visitation is embodied in modifications to the random walk interpretation of the PageRank algorithm. WikiRank is an effective algorithm to support natural language processing applications. It is shown to exceed the performance of previous machine learning algorithms for the task of automatic topic identification, providing results comparable to that of human annotators. Second, WikiRank is found useful for the task of recognizing text-based paraphrases on a semantic level, by comparing the distribution of attention generated by two pieces of text using the encyclopedic resource as a common reference. Finally, WikiRank is shown to have the ability to use its base of encyclopedic knowledge to recognize terms from different ontologies as describing the same thing, and thus allowing for the automatic generation of mapping links between ontologies. The conclusion of this thesis is that the "knowledge access heuristic" is valuable and that a ranking process based on a large encyclopedic resource can form the basis for an extendable general purpose mechanism capable of identifying relevant concepts by association, which in turn can be effectively utilized for enumeration and comparison at a semantic level
- âŠ