195 research outputs found

    Human-competitive automatic topic indexing

    Get PDF
    Topic indexing is the task of identifying the main topics covered by a document. These are useful for many purposes: as subject headings in libraries, as keywords in academic publications and as tags on the web. Knowing a document's topics helps people judge its relevance quickly. However, assigning topics manually is labor intensive. This thesis shows how to generate them automatically in a way that competes with human performance. Three kinds of indexing are investigated: term assignment, a task commonly performed by librarians, who select topics from a controlled vocabulary; tagging, a popular activity of web users, who choose topics freely; and a new method of keyphrase extraction, where topics are equated to Wikipedia article names. A general two-stage algorithm is introduced that first selects candidate topics and then ranks them by significance based on their properties. These properties draw on statistical, semantic, domain-specific and encyclopedic knowledge. They are combined using a machine learning algorithm that models human indexing behavior from examples. This approach is evaluated by comparing automatically generated topics to those assigned by professional indexers, and by amateurs. We claim that the algorithm is human-competitive because it chooses topics that are as consistent with those assigned by humans as their topics are with each other. The approach is generalizable, requires little training data and applies across different domains and languages

    User-centered semantic dataset retrieval

    Get PDF
    Finding relevant research data is an increasingly important but time-consuming task in daily research practice. Several studies report on difficulties in dataset search, e.g., scholars retrieve only partial pertinent data, and important information can not be displayed in the user interface. Overcoming these problems has motivated a number of research efforts in computer science, such as text mining and semantic search. In particular, the emergence of the Semantic Web opens a variety of novel research perspectives. Motivated by these challenges, the overall aim of this work is to analyze the current obstacles in dataset search and to propose and develop a novel semantic dataset search. The studied domain is biodiversity research, a domain that explores the diversity of life, habitats and ecosystems. This thesis has three main contributions: (1) We evaluate the current situation in dataset search in a user study, and we compare a semantic search with a classical keyword search to explore the suitability of semantic web technologies for dataset search. (2) We generate a question corpus and develop an information model to figure out on what scientific topics scholars in biodiversity research are interested in. Moreover, we also analyze the gap between current metadata and scholarly search interests, and we explore whether metadata and user interests match. (3) We propose and develop an improved dataset search based on three components: (A) a text mining pipeline, enriching metadata and queries with semantic categories and URIs, (B) a retrieval component with a semantic index over categories and URIs and (C) a user interface that enables a search within categories and a search including further hierarchical relations. Following user centered design principles, we ensure user involvement in various user studies during the development process

    ENHANCING IMAGE FINDABILITY THROUGH A DUAL-PERSPECTIVE NAVIGATION FRAMEWORK

    Get PDF
    This dissertation focuses on investigating whether users will locate desired images more efficiently and effectively when they are provided with information descriptors from both experts and the general public. This study develops a way to support image finding through a human-computer interface by providing subject headings and social tags about the image collection and preserving the information scent (Pirolli, 2007) during the image search experience. In order to improve search performance most proposed solutions integrating experts’ annotations and social tags focus on how to utilize controlled vocabularies to structure folksonomies which are taxonomies created by multiple users (Peters, 2009). However, these solutions merely map terms from one domain into the other without considering the inherent differences between the two. In addition, many websites reflect the benefits of using both descriptors by applying a multiple interface approach (McGrenere, Baecker, & Booth, 2002), but this type of navigational support only allows users to access one information source at a time. By contrast, this study is to develop an approach to integrate these two features to facilitate finding resources without changing their nature or forcing users to choose one means or the other. Driven by the concept of information scent, the main contribution of this dissertation is to conduct an experiment to explore whether the images can be found more efficiently and effectively when multiple access routes with two information descriptors are provided to users in the dual-perspective navigation framework. This framework has proven to be more effective and efficient than the subject heading-only and tag-only interfaces for exploratory tasks in this study. This finding can assist interface designers who struggle with determining what information is best to help users and facilitate the searching tasks. Although this study explicitly focuses on image search, the result may be applicable to wide variety of other domains. The lack of textual content in image systems makes them particularly hard to locate using traditional search methods. While the role of professionals in describing items in a collection of images, the role of the crowd in assigning social tags augments this professional effort in a cost effective manner

    Pictures in words : indexing, folksonomy and representation of subject content in historic photographs

    Get PDF
    Subject access to images is a major issue for image collections. Research is needed to understand how indexing and tagging contribute to make the subjects of historic photographs accessible. This thesis firstly investigates the evidence of cognitive dissonance between indexers and users in the way they attribute subjects to historic photographs, and, secondly, how indexers and users might work together to enhance subject description. It analyses how current indexing and social tagging represent the subject content of historic photographs. It also suggests a practical way indexers can work with taggers to deal with the classic problem of resource constraints and to enhance metadata to make photo collections more accessible. In an original application of the Shatford/Panofsky classification matrix within the applications domain of historic images, patterns of subject attribution are explored between taggers and professional indexers. The study was conducted in two stages. The first stage (Studies A to D) investigated how professional indexers and taggers represent the subject content of historic photographs and revealed differences based on Shatford/Panofsky. The indexers (Study A) demonstrated a propensity for specific and generic subjects and almost complete avoidance of abstracts. In contrast, a pilot study with users (Study B) and with baseline taggers (Studies C and D) showed their propensity for generics and equal inclination to specifics and abstracts. The evidence supports the conclusion that indexers and users approach the subject content of historic photographs differently, demonstrating cognitive dissonance, a conflict between how they appear to think about and interpret images. The second stage (Study E) demonstrated that an online training intervention affected tagging behaviour. The intervention resulted in increased tagging and fuller representation of all subject facets according to the Shatford/Panofsky classification matrix. The evidence showed that trained taggers tagged more generic and abstract facets than untrained taggers. Importantly, this suggests that training supports the annotation of the higher levels of subject content and so potentially provides enhanced intellectual access. The research demonstrated a practical way institutions can work with taggers to extend the representation of subject content in historic photographs. Improved subject description is critical for intellectual access and retrieval in the cultural heritage space. Through systematic application of the training method a richer corpus of descriptors might be created that enhances machine based information retrieval via automatic extraction

    Getting More out of Biomedical Documents with GATE's Full Lifecycle Open Source Text Analytics.

    Get PDF
    This software article describes the GATE family of open source text analysis tools and processes. GATE is one of the most widely used systems of its type with yearly download rates of tens of thousands and many active users in both academic and industrial contexts. In this paper we report three examples of GATE-based systems operating in the life sciences and in medicine. First, in genome-wide association studies which have contributed to discovery of a head and neck cancer mutation association. Second, medical records analysis which has significantly increased the statistical power of treatment/ outcome models in the UK’s largest psychiatric patient cohort. Third, richer constructs in drug-related searching. We also explore the ways in which the GATE family supports the various stages of the lifecycle present in our examples. We conclude that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process, and that with the right computational tools and data collection strategies this process can be made defined and repeatable. The GATE research programme is now 20 years old and has grown from its roots as a specialist development tool for text processing to become a rather comprehensive ecosystem, bringing together software developers, language engineers and research staff from diverse fields. GATE now has a strong claim to cover a uniquely wide range of the lifecycle of text analysis systems. It forms a focal point for the integration and reuse of advances that have been made by many people (the majority outside of the authors’ own group) who work in text processing for biomedicine and other areas. GATE is available online ,1. under GNU open source licences and runs on all major operating systems. Support is available from an active user and developer community and also on a commercial basis

    #MPLP: a Comparison of Domain Novice and Expert User-generated Tags in a Minimally Processed Digital Archive

    Get PDF
    The high costs of creating and maintaining digital archives precluded many archives from providing users with digital content or increasing the amount of digitized materials. Studies have shown users increasingly demand immediate online access to archival materials with detailed descriptions (access points). The adoption of minimal processing to digital archives limits the access points at the folder or series level rather than the item-level description users\u27 desire. User-generated content such as tags, could supplement the minimally processed metadata, though users are reluctant to trust or use unmediated tags. This dissertation project explores the potential for controlling/mediating the supplemental metadata from user-generated tags through inclusion of only expert domain user-generated tags. The study was designed to answer three research questions with two parts each: 1(a) What are the similarities and differences between tags generated by expert and novice users in a minimally processed digital archive?, 1(b) Are there differences between expert and novice users\u27 opinions of the tagging experience and tag creation considerations?, 2(a) In what ways do tags generated by expert and/or novice users in a minimally processed collection correspond with metadata in a traditionally processed digital archive?, 2(b) Does user knowledge affect the proportion of tags matching unselected metadata in a minimally processed digital archive?, 3(a) In what ways do tags generated by expert and/or novice users in a minimally processed collection correspond with existing users\u27 search terms in a digital archive?, and 3(b) Does user knowledge affect the proportion of tags matching query terms in a minimally processed digital archive? The dissertation project was a mixed-methods, quasi-experimental design focused on tag generation within a sample minimally processed digital archive. The study used a sample collection of fifteen documents and fifteen photographs. Sixty participants divided into two groups (novices and experts) based on assessed prior knowledge of the sample collection\u27s domain generated tags for fifteen documents and fifteen photographs (a minimum of one tag per object). Participants completed a pre-questionnaire identifying prior knowledge, and use of social tagging and archives. Additionally, participants provided their opinions regarding factors associated with tagging including the tagging experience and considerations while creating tags through structured and open-ended questions in a post-questionnaire. An open-coding analysis of the created tags developed a coding scheme of six major categories and six subcategories. Application of the coding scheme categorized all generated tags. Additional descriptive statistics summarized the number of tags created by each domain group (expert, novice) for all objects and divided by format (photograph, document). T-tests and Chi-square tests explored the associations (and associative strengths) between domain knowledge and the number of tags created or types of tags created for all objects and divided by format. The subsequent analysis compared the tags with the metadata from the existing collection not displayed within the sample collection participants used. Descriptive statistics summarized the proportion of tags matching unselected metadata and Chi-square tests analyzed the findings for associations with domain knowledge. Finally, the author extracted existing users\u27 query terms from one month of server-log data and compared the generated-tags and unselected metadata. Descriptive statistics summarized the proportion of tags and unselected metadata matching query terms, and Chi-square tests analyzed the findings for associations with domain knowledge. Based on the findings, the author discussed the theoretical and practical implications of including social tags within a minimally processed digital archive

    Linking genes to literature: text mining, information extraction, and retrieval applications for biology

    Get PDF
    Efficient access to information contained in online scientific literature collections is essential for life science research, playing a crucial role from the initial stage of experiment planning to the final interpretation and communication of the results. The biological literature also constitutes the main information source for manual literature curation used by expert-curated databases. Following the increasing popularity of web-based applications for analyzing biological data, new text-mining and information extraction strategies are being implemented. These systems exploit existing regularities in natural language to extract biologically relevant information from electronic texts automatically. The aim of the BioCreative challenge is to promote the development of such tools and to provide insight into their performance. This review presents a general introduction to the main characteristics and applications of currently available text-mining systems for life sciences in terms of the following: the type of biological information demands being addressed; the level of information granularity of both user queries and results; and the features and methods commonly exploited by these applications. The current trend in biomedical text mining points toward an increasing diversification in terms of application types and techniques, together with integration of domain-specific resources such as ontologies. Additional descriptions of some of the systems discussed here are available on the internet

    The Use of Social Tags in Text and Image Searching on the Web.

    Full text link
    In recent years, tags have become a standard feature on a diverse range of sites on the Web, accompanying blog posts, photos, videos, and online news stories. Tags are descriptive terms attached to Internet resources. Despite the rapid adoption of tagging, how people use tags during the search process is not well understood. There is little empirical data on the use and perceptions of tags created by those other than the searcher. Previous research on tags focused on the motivations and behaviors of taggers, although non-taggers represent a larger proportion of Web users than taggers. This study examines how people use tags, created by others, during the search process. Forty-eight subjects were each assigned four search tasks in a within-subjects study. Subjects searched for text documents and images in a controlled laboratory setting, using information retrieval interfaces differing in their incorporation of tags. User behavior and perception data were collected through search logs and interviews. Both direct and indirect uses of tags across the search process were examined. Tags are used directly when they are clicked on, resulting in a new query, while tags are used indirectly when used for judgments of relevance or to obtain additional terms for query reformulation. Tags increased interactions with the information retrieval system, as subjects issued more queries and saw more search results when using the tagged interface. For both text and image searches, tags were used for query reformulation, predictive judgment, and evaluative judgment of relevance. Subjects interacted most frequently with tags on the search results page, using them for query reformulation and predictive judgment. Tags were more likely to be used for predictive judgment in text searches than in image searches. Subjects’ understanding of tags focused on the role of tags in search, especially findability through a search engine. Tags were not uniformly perceived as being user-generated; site owners and automatic generation were mentioned as sources of tags. Several implications for the design of search interfaces and presentation of tags to support information interactions are discussed in the conclusion.Ph.D.InformationUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/89816/1/kimym_1.pd

    The Future of Information Sciences : INFuture2009 : Digital Resources and Knowledge Sharing

    Get PDF
    corecore