104,141 research outputs found
From Text to Knowledge
The global information space provided by the World Wide Web has changed dramatically
the way knowledge is shared all over the world. To make this unbelievable huge information
space accessible, search engines index the uploaded contents and provide efficient
algorithmic machinery for ranking the importance of documents with respect to an input
query. All major search engines such as Google, Yahoo or Bing are keyword-based, which
is indisputable a very powerful tool for accessing information needs centered around documents.
However, this unstructured, document-oriented paradigm of the World Wide Web has serious drawbacks, when searching for specific knowledge about real-world entities.
When asking for advanced facts about entities, today's search engines are not very good in providing accurate answers. Hand-built knowledge bases such as Wikipedia or its structured counterpart DBpedia are excellent sources that provide common facts. However, these knowledge bases are far from being complete and most of the knowledge lies still buried in unstructured documents.
Statistical machine learning methods have the great potential to help to bridge the gap between text and knowledge by (semi-)automatically transforming the unstructured representation of the today's World Wide Web to a more structured representation. This
thesis is devoted to reduce this gap with Probabilistic Graphical Models. Probabilistic
Graphical Models play a crucial role in modern pattern recognition as they merge two important fields of applied mathematics: Graph Theory and Probability Theory.
The first part of the thesis will present a novel system called Text2SemRel that is able to (semi-)automatically construct knowledge bases from textual document collections. The resulting knowledge base consists of facts centered around entities and their relations.
Essential part of the system is a novel algorithm for extracting relations between entity
mentions that is based on Conditional Random Fields, which are Undirected Probabilistic Graphical Models.
In the second part of the thesis, we will use the power of Directed Probabilistic Graphical Models to solve important knowledge discovery tasks in semantically annotated large document collections. In particular, we present extensions of the Latent Dirichlet Allocation framework that are able to learn in an unsupervised way the statistical semantic
dependencies between unstructured representations such as documents and their semantic annotations. Semantic annotations of documents might refer to concepts originating from a thesaurus or ontology but also to user-generated informal tags in social tagging
systems. These forms of annotations represent a first step towards the conversion to a more structured form of the World Wide Web.
In the last part of the thesis, we prove the large-scale applicability of the proposed fact extraction system Text2SemRel. In particular, we extract semantic relations between genes and diseases from a large biomedical textual repository. The resulting knowledge
base contains far more potential disease genes exceeding the number of disease genes that
are currently stored in curated databases. Thus, the proposed system is able to unlock
knowledge currently buried in the literature. The literature-derived human gene-disease
network is subject of further analysis with respect to existing curated state of the art
databases. We analyze the derived knowledge base quantitatively by comparing it with
several curated databases with regard to size of the databases and properties of known
disease genes among other things. Our experimental analysis shows that the facts extracted
from the literature are of high quality
Information Extraction in Illicit Domains
Extracting useful entities and attribute values from illicit domains such as
human trafficking is a challenging problem with the potential for widespread
social impact. Such domains employ atypical language models, have `long tails'
and suffer from the problem of concept drift. In this paper, we propose a
lightweight, feature-agnostic Information Extraction (IE) paradigm specifically
designed for such domains. Our approach uses raw, unlabeled text from an
initial corpus, and a few (12-120) seed annotations per domain-specific
attribute, to learn robust IE models for unobserved pages and websites.
Empirically, we demonstrate that our approach can outperform feature-centric
Conditional Random Field baselines by over 18\% F-Measure on five annotated
sets of real-world human trafficking datasets in both low-supervision and
high-supervision settings. We also show that our approach is demonstrably
robust to concept drift, and can be efficiently bootstrapped even in a serial
computing environment.Comment: 10 pages, ACM WWW 201
Conceptualising teachers' professional learning with Web 2.0
Purpose ā This paper seeks to identify and develop an exploratory framework for conceptualising how teachers might use the affordances of Web 2.0 technologies to support their own professional learning. Design/methodology/approach ā The paper draws on a large corpus of literature and recent research evidence to identify the principal elements and features of professional learning and the underlying affordances of Web 2.0 technologies and applications. It generates an exploratory conceptual framework based on the emerging findings from this review using a socioācultural theoretical perspective. The framework is explored through three individual illustrations which are drawn from a much larger case study which the author is undertaking within a newly established Academy in the North of England. Findings ā The findings indicate that there is potential value in exploring professional learning with Web 2.0 technologies in the ways described. The framework offers an exploratory instrument to examine how professional learning for teachers could be supported with Web 2.0 technologies in ways that might have significant benefits over traditional methods of continuing professional development (CPD). Originality/value ā The potential value and affordances of Web 2.0 technologies for teachers' professional learning are largely unexplored and underātheorised, and this work seeks to establish a framework for further discussion and empirical exploration
Artequakt: Generating tailored biographies from automatically annotated fragments from the web
The Artequakt project seeks to automatically generate narrativebiographies of artists from knowledge that has been extracted from the Web and maintained in a knowledge base. An overview of the system architecture is presented here and the three key components of that architecture are explained in detail, namely knowledge extraction, information management and biography construction. Conclusions are drawn from the initial experiences of the project and future progress is detailed
Identifying Web Tables - Supporting a Neglected Type of Content on the Web
The abundance of the data in the Internet facilitates the improvement of
extraction and processing tools. The trend in the open data publishing
encourages the adoption of structured formats like CSV and RDF. However, there
is still a plethora of unstructured data on the Web which we assume contain
semantics. For this reason, we propose an approach to derive semantics from web
tables which are still the most popular publishing tool on the Web. The paper
also discusses methods and services of unstructured data extraction and
processing as well as machine learning techniques to enhance such a workflow.
The eventual result is a framework to process, publish and visualize linked
open data. The software enables tables extraction from various open data
sources in the HTML format and an automatic export to the RDF format making the
data linked. The paper also gives the evaluation of machine learning techniques
in conjunction with string similarity functions to be applied in a tables
recognition task.Comment: 9 pages, 4 figure
- ā¦