117 research outputs found

    Semantically enhanced document clustering

    Get PDF
    This thesis advocates the view that traditional document clustering could be significantly improved by representing documents at different levels of abstraction at which the similarity between documents is considered. The improvement is with regard to the alignment of the clustering solutions to human judgement. The proposed methodology employs semantics with which the conceptual similarity be-tween documents is measured. The goal is to design algorithms which implement the meth-odology, in order to solve the following research problems: (i) how to obtain multiple deter-ministic clustering solutions; (ii) how to produce coherent large-scale clustering solutions across domains, regardless of the number of clusters; (iii) how to obtain clustering solutions which align well with human judgement; and (iv) how to produce specific clustering solu-tions from the perspective of the user’s understanding for the domain of interest. The developed clustering methodology enhances separation between and improved coher-ence within clusters generated across several domains by using levels of abstraction. The methodology employs a semantically enhanced text stemmer, which is developed for the pur-pose of producing coherent clustering, and a concept index that provides generic document representation and reduced dimensionality of document representation. These characteristics of the methodology enable addressing the limitations of traditional text document clustering by employing computationally expensive similarity measures such as Earth Mover’s Distance (EMD), which theoretically aligns the clustering solutions closer to human judgement. A threshold for similarity between documents that employs many-to-many similarity matching is proposed and experimentally proven to benefit the traditional clustering algorithms in pro-ducing clustering solutions aligned closer to human judgement. 4 The experimental validation demonstrates the scalability of the semantically enhanced document clustering methodology and supports the contributions: (i) multiple deterministic clustering solutions and different viewpoints to a document collection are obtained; (ii) the use of concept indexing as a document representation technique in the domain of document clustering is beneficial for producing coherent clusters across domains; (ii) SETS algorithm provides an improved text normalisation by using external knowledge; (iv) a method for measuring similarity between documents on a large scale by using many-to-many matching; (v) a semantically enhanced methodology that employs levels of abstraction that correspond to a user’s background, understanding and motivation. The achieved results will benefit the research community working in the area of document management, information retrieval, data mining and knowledge management

    Classification management and use in a networked environment : the case of the Universal Decimal Classification

    Get PDF
    In the Internet information space, advanced information retrieval (IR) methods and automatic text processing are used in conjunction with traditional knowledge organization systems (KOS). New information technology provides a platform for better KOS publishing, exploitation and sharing both for human and machine use. Networked KOS services are now being planned and developed as powerful tools for resource discovery. They will enable automatic contextualisation, interpretation and query matching to different indexing languages. The Semantic Web promises to be an environment in which the quality of semantic relationships in bibliographic classification systems can be fully exploited. Their use in the networked environment is, however, limited by the fact that they are not prepared or made available for advanced machine processing. The UDC was chosen for this research because of its widespread use and its long-term presence in online information retrieval systems. It was also the first system to be used for the automatic classification of Internet resources, and the first to be made available as a classification tool on the Web. The objective of this research is to establish the advantages of using UDC for information retrieval in a networked environment, to highlight the problems of automation and classification exchange, and to offer possible solutions. The first research question was is there enough evidence of the use of classification on the Internet to justify further development with this particular environment in mind? The second question is what are the automation requirements for the full exploitation of UDC and its exchange? The third question is which areas are in need of improvement and what specific recommendations can be made for implementing the UDC in a networked environment? A summary of changes required in the management and development of the UDC to facilitate its full adaptation for future use is drawn from this analysis.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Annotation persistence over dynamic documents

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Civil and Environmental Engineering, 2005.Includes bibliographical references (p. 212-216).Annotations, as a routine practice of actively engaging with reading materials, are heavily used in the paper world to augment the usefulness of documents. By annotation, we include a large variety of creative manipulations by which the otherwise passive reader becomes actively involved in a document. Annotations in digital form possess many benefits paper annotations do not enjoy, such as annotation searching, annotation multi- referencing, and annotation sharing. The digital form also introduces challenges to the process of annotation. This study looks at one of them, annotation persistence over dynamic documents. With the development of annotation software, users now have the opportunity to annotate documents which they don't own, or to which they don't have write permission. In annotation software, annotations are normally created and saved independently of the document. The owners of the documents being annotated may have no knowledge of the fact that third parties are annotating their documents' contents. When document contents are modified, annotation software faces a difficult situation where annotations need to be reattached. Reattaching annotations in a revised version of a document is a crucial component in annotation system design. Annotation persistence over document versions is a complicated and challenging problem, as documents can go through various changes between versions. In this thesis, we treat annotation persistence over dynamic documents as a specialized information retrieval problem. We then design a scheme to reposition annotations between versions by three mechanisms: the meta-structure information match, the keywords match, and content semantics match.(cont.) Content semantics matching is the determining factor in our annotation persistence scheme design. Latent Semantic Analysis, an innovative information retrieval model, is used to extract and compare document semantics. Two editions of an introductory computer science textbook are used to evaluate the annotation persistence scheme proposed in this study. The evaluation provides substantial evidence that the annotation persistence scheme proposed in this thesis is able to make the right decisions on repositioning annotations based on their degree of modifications, i.e. to reattach annotations if modifications are light, and to orphan annotations if modifications are heavy.by Shaomin Wang.Ph.D

    Computational Methods for Analyzing Health News Coverage

    Get PDF
    Researchers that investigate the media's coverage of health have historically relied on keyword searches to retrieve relevant health news coverage, and manual content analysis methods to categorize and score health news text. These methods are problematic. Manual content analysis methods are labor intensive, time consuming, and inherently subjective because they rely on human coders to review, score, and annotate content. Retrieving relevant health news coverage using keywords can be challenging because manually defining an optimal keyword query, especially for complex health topics and media analysis concepts, can be very difficult, and the optimal query may vary based on when the news was published, the type of news published, and the target audience of the news coverage. This dissertation research investigated computational methods that can assist health news investigators by facilitating these tasks. The first step was to identify the research methods currently used by investigators, and the research questions and health topics researchers tend to investigate. To capture this information an extensive literature review of health news analyses was performed. No literature review of this type and scope could be found in the research literature. This review confirmed that researchers overwhelmingly rely on manual content analysis methods to analyze the text of health news coverage, and on the use of keyword searching to identify relevant health news articles. To investigate the use of computational methods for facilitating these tasks, classifiers that categorize health news on relevance to the topic of obesity, and on their news framing were developed and evaluated. The obesity news classifier developed for this dissertation outperformed alternative methods, including searching based on keyword appearance. Classifying on the framing of health news proved to be a more difficult task. The news framing classifiers performed well, but the results suggest that the underlying features of health news coverage that contribute to the framing of health news are a richer and more useful source of framing information rather than binary news framing classifications. The third step in this dissertation was to use the findings of the literature review and the classifier studies to design the SalientHealthNews system. The purpose of SalientHealthNews is to facilitate the use of computational and data mining techniques for health news investigation, hypothesis testing, and hypothesis generation. To illustrate the use of SalientHealthNews' features and algorithms, it was used to generate preliminary data for a study investigating how framing features vary in health and obesity news coverage that discusses populations with health disparities. This research contributes to the study of the media's coverage of health by providing a detailed description of how health news is studied and what health news topics are investigated, then by demonstrating that certain tasks performed in health news analyses can be facilitated by computational methods, and lastly by describing the design of a system that will facilitate the use of computational and data mining techniques for the study of health news. These contributions should further the study of health news by expanding the methods available to health news analysis researchers. This will lead to researchers being better equipped to accurately and consistently evaluate the media's coverage of health. Knowledge of the quality of health news coverage should in turn lead to better informed health journalists, healthcare providers, and healthcare consumers, ultimately improving individual and public health

    An improved method for text summarization using lexical chains

    Get PDF
    This work is directed toward the creation of a system for automatically sum-marizing documents by extracting selected sentences. Several heuristics including position, cue words, and title words are used in conjunction with lexical chain in-formation to create a salience function that is used to rank sentences for extraction. Compiler technology, including the Flex and Bison tools, is used to create the AutoExtract summarizer that extracts and combines this information from the raw text. The WordNet database is used for the creation of the lexical chains. The AutoExtract summarizer performed better than the Microsoft Word97 AutoSummarize tool and the Sinope commercial summarizer in tests against ideal extracts and in tests judged by humans

    24th Nordic Conference on Computational Linguistics (NoDaLiDa)

    Get PDF

    The Future of Information Sciences : INFuture2009 : Digital Resources and Knowledge Sharing

    Get PDF
    corecore