393 research outputs found

    Automatic generation of inter-passage links based on semantic similarity

    Get PDF
    This paper investigates the use and the prediction potential of semantic similarity measures for automatic generation of links across different documents and passages. First, the correlation between the way people link content and the results produced by standard semantic similarity measures is investigated. The relation between semantic similarity and the length of the documents is then also analysed. Based on these findings a new method for link generation is formulated and tested

    Applying Wikipedia to Interactive Information Retrieval

    Get PDF
    There are many opportunities to improve the interactivity of information retrieval systems beyond the ubiquitous search box. One idea is to use knowledge bases—e.g. controlled vocabularies, classification schemes, thesauri and ontologies—to organize, describe and navigate the information space. These resources are popular in libraries and specialist collections, but have proven too expensive and narrow to be applied to everyday webscale search. Wikipedia has the potential to bring structured knowledge into more widespread use. This online, collaboratively generated encyclopaedia is one of the largest and most consulted reference works in existence. It is broader, deeper and more agile than the knowledge bases put forward to assist retrieval in the past. Rendering this resource machine-readable is a challenging task that has captured the interest of many researchers. Many see it as a key step required to break the knowledge acquisition bottleneck that crippled previous efforts. This thesis claims that the roadblock can be sidestepped: Wikipedia can be applied effectively to open-domain information retrieval with minimal natural language processing or information extraction. The key is to focus on gathering and applying human-readable rather than machine-readable knowledge. To demonstrate this claim, the thesis tackles three separate problems: extracting knowledge from Wikipedia; connecting it to textual documents; and applying it to the retrieval process. First, we demonstrate that a large thesaurus-like structure can be obtained directly from Wikipedia, and that accurate measures of semantic relatedness can be efficiently mined from it. Second, we show that Wikipedia provides the necessary features and training data for existing data mining techniques to accurately detect and disambiguate topics when they are mentioned in plain text. Third, we provide two systems and user studies that demonstrate the utility of the Wikipedia-derived knowledge base for interactive information retrieval

    A protocol for constructing a domain-specific ontology for use in biomedical information extraction using lexical-chaining analysis

    Get PDF
    In order to do more semantics-based information extraction, we require specialized domain models. We develop a hybrid approach for constructing such a domain-specific ontology, which integrates key concepts from the protein-protein–interaction domain with the Gene Ontology. In addition, we present a method for using the domain-specific ontology in a discourse-based analysis module for analyzing full-text articles on protein interactions. The analysis module uses a lexical chaining technique to extract strings of semantically related words that represent the topic structure of the text. We show that the domain-specific ontology improved the performance of the lexical-chaining module. As well the topic structure as represented by the lexical chains contains important information on protein-protein interactions appearing in the same textual context

    Human-competitive automatic topic indexing

    Get PDF
    Topic indexing is the task of identifying the main topics covered by a document. These are useful for many purposes: as subject headings in libraries, as keywords in academic publications and as tags on the web. Knowing a document's topics helps people judge its relevance quickly. However, assigning topics manually is labor intensive. This thesis shows how to generate them automatically in a way that competes with human performance. Three kinds of indexing are investigated: term assignment, a task commonly performed by librarians, who select topics from a controlled vocabulary; tagging, a popular activity of web users, who choose topics freely; and a new method of keyphrase extraction, where topics are equated to Wikipedia article names. A general two-stage algorithm is introduced that first selects candidate topics and then ranks them by significance based on their properties. These properties draw on statistical, semantic, domain-specific and encyclopedic knowledge. They are combined using a machine learning algorithm that models human indexing behavior from examples. This approach is evaluated by comparing automatically generated topics to those assigned by professional indexers, and by amateurs. We claim that the algorithm is human-competitive because it chooses topics that are as consistent with those assigned by humans as their topics are with each other. The approach is generalizable, requires little training data and applies across different domains and languages

    Approaches to Automatic Text Structuring

    Get PDF
    Structured text helps readers to better understand the content of documents. In classic newspaper texts or books, some structure already exists. In the Web 2.0, the amount of textual data, especially user-generated data, has increased dramatically. As a result, there exists a large amount of textual data which lacks structure, thus making it more difficult to understand. In this thesis, we will explore techniques for automatic text structuring to help readers to fulfill their information needs. Useful techniques for automatic text structuring are keyphrase identification, table-of-contents generation, and link identification. We improve state of the art results for approaches to text structuring on several benchmark datasets. In addition, we present new representative datasets for users’ everyday tasks. We evaluate the quality of text structuring approaches with regard to these scenarios and discover that the quality of approaches highly depends on the dataset on which they are applied. In the first chapter of this thesis, we establish the theoretical foundations regarding text structuring. We describe our findings from a user survey regarding web usage from which we derive three typical scenarios of Internet users. We then proceed to the three main contributions of this thesis. We evaluate approaches to keyphrase identification both by extracting and assigning keyphrases for English and German datasets. We find that unsupervised keyphrase extraction yields stable results, but for datasets with predefined keyphrases, additional filtering of keyphrases and assignment approaches yields even higher results. We present a de- compounding extension, which further improves results for datasets with shorter texts. We construct hierarchical table-of-contents of documents for three English datasets and discover that the results for hierarchy identification are sufficient for an automatic system, but for segment title generation, user interaction based on suggestions is required. We investigate approaches to link identification, including the subtasks of identifying the mention (anchor) of the link and linking the mention to an entity (target). Approaches that make use of the Wikipedia link structure perform best, as long as there is sufficient training data available. For identifying links to sense inventories other than Wikipedia, approaches that do not make use of the link structure outperform the approaches using existing links. We further analyze the effect of senses on computing similarities. In contrast to entity linking, where most entities can be discriminated by their name, we consider cases where multiple entities with the same name exist. We discover that similarity de- pends on the selected sense inventory. To foster future evaluation of natural language processing components for text structuring, we present two prototypes of text structuring systems, which integrate techniques for automatic text structuring in a wiki setting and in an e-learning setting with eBooks

    Semantic Interaction in Web-based Retrieval Systems : Adopting Semantic Web Technologies and Social Networking Paradigms for Interacting with Semi-structured Web Data

    Get PDF
    Existing web retrieval models for exploration and interaction with web data do not take into account semantic information, nor do they allow for new forms of interaction by employing meaningful interaction and navigation metaphors in 2D/3D. This thesis researches means for introducing a semantic dimension into the search and exploration process of web content to enable a significantly positive user experience. Therefore, an inherently dynamic view beyond single concepts and models from semantic information processing, information extraction and human-machine interaction is adopted. Essential tasks for semantic interaction such as semantic annotation, semantic mediation and semantic human-computer interaction were identified and elaborated for two general application scenarios in web retrieval: Web-based Question Answering in a knowledge-based dialogue system and semantic exploration of information spaces in 2D/3D
    • 

    corecore