Search CORE

3 research outputs found

Improving text categorization by resolving semantic ambiguity

Author: Deerwester
Fuhr
Fukumoto
Grossman
Ichimura
Lewis
Lewis
Li
Miller
Nasukawa
Rodriguez
Sebastiani
Yang
Publication venue: 'Wiley'
Publication date
Field of study

Grammatical Functions and Possibilistic Reasoning for the Extraction and Representation of Semantic Knowledge in Text Documents

Author: Khoury Richard
Publication venue: 'University of Waterloo'
Publication date: 01/01/2007
Field of study

This study seeks to explore and develop innovative methods for the extraction of semantic knowledge from unlabelled written English documents and the representation of this knowledge using a formal mathematical expression to facilitate its use in practical applications. The first method developed in this research focuses on semantic information extraction. To perform this task, the study introduces a natural language processing (NLP) method designed to extract information-rich keywords from English sentences. The method involves initially learning a set of rules that guide the extraction of keywords from parts of sentences. Once this learning stage is completed, the method can be used to extract the keywords from complete sentences by pairing these sentences to the most similar sequence of rules. The key innovation in this method is the use of a part-of-speech hierarchy. By raising words to increasingly general grammatical categories in this hierarchy, the system can compare rules, compute the degree of similarity between them, and learn new rules. The second method developed in this study addresses the problem of knowledge representation. This method processes triplets of keywords through several successive steps to represent information contained in the triplets using possibility distributions. These distributions represent the possibility of a topic given a particular triplet of keywords. Using this methodology, the information contained in the natural language triplets can be quantified and represented in a mathematical format, which can be easily used in a number of applications, such as document classifiers. In further extensions to the research, a theoretical justification and mathematical development for both methods are provided, and examples are given to illustrate these notions. Sample applications are also developed based on these methods, and the experimental results generated through these implementations are expounded and thoroughly analyzed to confirm that the methods are reliable in practice

University of Waterloo's Institutional Repository

Language-independent pre-processing of large document bases for text classification

Author: Justin. Wang Yanbo
Publication venue
Publication date
Field of study

Text classification is a well-known topic in the research of knowledge discovery in databases. Algorithms for text classification generally involve two stages. The first is concerned with identification of textual features (i.e. words andlor phrases) that may be relevant to the classification process. The second is concerned with classification rule mining and categorisation of "unseen" textual data. The first stage is the subject of this thesis and often involves an analysis of text that is both language-specific (and possibly domain-specific), and that may also be computationally costly especially when dealing with large datasets. Existing approaches to this stage are not, therefore, generally applicable to all languages. In this thesis, we examine a number of alternative keyword selection methods and phrase generation strategies, coupled with two potential significant word list construction mechanisms and two final significant word selection mechanisms, to identify such words andlor phrases in a given textual dataset that are expected to serve to distinguish between classes, by simple, language-independent statistical properties. We present experimental results, using common (large) textual datasets presented in two distinct languages, to show that the proposed approaches can produce good performance with respect to both classification accuracy and processing efficiency. In other words, the study presented in this thesis demonstrates the possibility of efficiently solving the traditional text classification problem in a language-independent (also domain-independent) manner

University of Liverpool Repository