4 research outputs found
A Corpus-based Approach to the Chinese Word Segmentation
For a society based upon laws and reason, it has become too easy for us to believe
that we live in a world without them. And given that our linguistics wisdom was
originally motivated by the search for rules, it seems strange that we now consider
these rules to be the exceptions and take exceptions as the norm.
The current task of contemporary computational linguistics is to describe these
exceptions. In particular, it suffices for most language processing needs, to just
describe the argument and predicate within an elementary sentence, under the
framework of local grammar. Therefore, a corpus-based approach to the Chinese
Word Segmentation problem is proposed, as the first step towards a local grammar
for the Chinese language.
The two main issues with existing lexicon-based approaches are (a) the classification
of unknown character sequences, i.e. sequences that are not listed in
the lexicon, and (b) the disambiguation of situations where two candidate words
overlap.
For (a), we propose an automatic method of enriching the lexicon by comparing
candidate sequences to occurrences of the same strings in a manually segmented
reference corpus, and using methods of machine learning to select the optimal
segmentation for them. These methods are developed in the course of the thesis
specifically for this task. The possibility of applying these machine learning
method will be discussed in NP-extraction and alignment domain.
(b) is approached by designing a general processing framework for Chinese text,
which will be called multi-level processing. Under this framework, sentences are
recursively split into fragments, according to a language-specific, but domainindependent
heuristics. The resulting fragments then define the ultimate boundaries
between candidate words and therefore resolve any segmentation ambiguity
caused by overlapping sequences. A new shallow semantical annotation is also
proposed under the frame work of multi-level processing.
A word segmentation algorithm based on these principles has been implemented
and tested; results of the evaluation are given and compared to the performance of
previous approaches as reported in the literature.
The first chapter of this thesis discusses the goals of segmentation and introduces
some background concepts. The second chapter analyses the current state-of-theart
approach to Chinese language segmentation. Chapter 3 proposes a new corpusbased
approach to the identification of unknown words. In chapter 4, a new shallow
semantical annotation is also proposed under the framework of multi-level
processing
A Corpus-based Approach to the Chinese Word Segmentation
For a society based upon laws and reason, it has become too easy for us to believe
that we live in a world without them. And given that our linguistics wisdom was
originally motivated by the search for rules, it seems strange that we now consider
these rules to be the exceptions and take exceptions as the norm.
The current task of contemporary computational linguistics is to describe these
exceptions. In particular, it suffices for most language processing needs, to just
describe the argument and predicate within an elementary sentence, under the
framework of local grammar. Therefore, a corpus-based approach to the Chinese
Word Segmentation problem is proposed, as the first step towards a local grammar
for the Chinese language.
The two main issues with existing lexicon-based approaches are (a) the classification
of unknown character sequences, i.e. sequences that are not listed in
the lexicon, and (b) the disambiguation of situations where two candidate words
overlap.
For (a), we propose an automatic method of enriching the lexicon by comparing
candidate sequences to occurrences of the same strings in a manually segmented
reference corpus, and using methods of machine learning to select the optimal
segmentation for them. These methods are developed in the course of the thesis
specifically for this task. The possibility of applying these machine learning
method will be discussed in NP-extraction and alignment domain.
(b) is approached by designing a general processing framework for Chinese text,
which will be called multi-level processing. Under this framework, sentences are
recursively split into fragments, according to a language-specific, but domainindependent
heuristics. The resulting fragments then define the ultimate boundaries
between candidate words and therefore resolve any segmentation ambiguity
caused by overlapping sequences. A new shallow semantical annotation is also
proposed under the frame work of multi-level processing.
A word segmentation algorithm based on these principles has been implemented
and tested; results of the evaluation are given and compared to the performance of
previous approaches as reported in the literature.
The first chapter of this thesis discusses the goals of segmentation and introduces
some background concepts. The second chapter analyses the current state-of-theart
approach to Chinese language segmentation. Chapter 3 proposes a new corpusbased
approach to the identification of unknown words. In chapter 4, a new shallow
semantical annotation is also proposed under the framework of multi-level
processing
Exploiting general-purpose background knowledge for automated schema matching
The schema matching task is an integral part of the data integration process. It is usually the first step in integrating data. Schema matching is typically very complex and time-consuming. It is, therefore, to the largest part, carried out by humans. One reason for the low amount of automation is the fact that schemas are often defined with deep background knowledge that is not itself present within the schemas. Overcoming the problem of missing background knowledge is a core challenge in automating the data integration process.
In this dissertation, the task of matching semantic models, so-called ontologies, with the help of external background knowledge is investigated in-depth in Part I. Throughout this thesis, the focus lies on large, general-purpose resources since domain-specific resources are rarely available for most domains. Besides new knowledge resources, this thesis also explores new strategies to exploit such resources.
A technical base for the development and comparison of matching systems is presented in Part II. The framework introduced here allows for simple and modularized matcher development (with background knowledge sources) and for extensive evaluations of matching systems.
One of the largest structured sources for general-purpose background knowledge are knowledge graphs which have grown significantly in size in recent years. However, exploiting such graphs is not trivial. In Part III, knowledge graph em- beddings are explored, analyzed, and compared. Multiple improvements to existing approaches are presented.
In Part IV, numerous concrete matching systems which exploit general-purpose background knowledge are presented. Furthermore, exploitation strategies and resources are analyzed and compared. This dissertation closes with a perspective on real-world applications