The Semantic Web promises a way of linking distributed information at a granular
level by interconnecting compact data items instead of complete HTML pages. New
data is gradually being added to the SemanticWeb but there is a need to incorporate existing
knowledge. This thesis explores ways to convert a coherent body of information
from various structured and unstructured formats into the necessary graph form. The
transformation work crosses several currently active disciplines, and there are further
research questions that can be addressed once the graph has been built.
Hybrid databases, such as the cultural heritage one used here, consist of structured
relational tables associated with free text documents. Access to the data is hampered by
complex schemas, confusing terminology and difficulties in searching the text effectively.
This thesis describes how hybrid data can be unified by assembly into a graph.
A major component task is the conversion of relational database content to RDF. This
is an active research field, to which this work contributes by examining weaknesses in
some existing methods and proposing alternatives.
The next significant element of the work is an attempt to extract structure automatically
from English text using natural language processing methods. The first claim
made is that the semantic content of the text documents can be adequately captured as
a set of binary relations forming a directed graph. It is shown that the data can then
be grounded using existing domain thesauri, by building an upper ontology structure
from these. A schema for cultural heritage data is proposed, intended to be generic for
that domain and as compact as possible.
Another hypothesis is that use of a graph will assist retrieval. The structure is
uniform and very simple, and the graph can be queried even if the predicates (or edge
labels) are unknown. Additional benefits of the graph structure are examined, such as
using path length between nodes as a measure of relatedness (unavailable in a relational
database where there is no equivalent concept of locality), and building information
summaries by grouping the attributes of nodes that share predicates.
These claims are tested by comparing queries across the original and the new
data structures. The graph must be able to answer correctly queries that the original
database dealt with, and should also demonstrate valid answers to queries that could
not previously be answered or where the results were incomplete