102,056 research outputs found

    Reports of the DAS02 Working Groups

    Get PDF
    This document is a collection of four working group reports in the areas of digital libraries, document image retrieval, layout analysis, and Web document analysis. These reports were the outcome of discussions by participants at the Fifth IAPR International Workshop on Document Analysis Systems held in Princeton, NJ on 19-21 August 2002

    DIAL 2004 Working Group Report on Acquisition Quality Control

    Get PDF
    This report summarizes the discussions of the Working Group on Acquisition Quality at the International Workshop on Document Image Analysis for Libraries, Palo Alto, CA, 23-24 January 2004. Acquisition of the image is one of the most time intensive components of forming a digital library, and the quality of the acquisition will affect all later stages of the digital library project. The current state of the art in acquisition is analyzed. Problems and suggested improvements for image acquisition and storage formats and the special problems associated with acquisition from microfilm follows. A list of general suggestions was developed which was complemented by a wish list of things the Working Group would like to see followed in acquisition discussions in the future

    Virtual Wrap-Up Presentation: Digital Libraries, Intelligent Data Analytics, and Augmented Description

    Get PDF
    Includes framing, overview, and discussion of the explorations pursued as part of the Digital Libraries, Intelligent Data Analytics, and Augmented Description demonstration project, pursued by members of the Aida digital libraries research team at the University of Nebraska-Lincoln through a research services contract with the Library of Congress. This presentation covered: Aida research team and background for the demonstration project; broad outlines of “Digital Libraries, Intelligent Data Analytics, and Augmented Description”; what changed for us as a research team over the collaboration and why; deliverables of our work; thoughts toward “What next”; and deep-dives into the explorations. The machine learning explorations, which focus on historic document materials from the Library of Congress, include image segmentation; visual context extraction from textual materials; text extraction from images; document/corpus quality assessment; differentiation among documents created via different means; differentiation among printed, handwritten, and mixed content; and metadata generation. Preliminary take-aways discussed include an expanded sense of how these projects may be useful, with greater emphasis on internal use within the Library of Congress; consideration of how crowd-sourced information can aid in machine-learning, as well as what may be well-suited to the crowd, to the machine, and to domain experts; the need for analysis of the materials through a variety of strategies to inform machine learning; and greater awareness of the full range of resources--computational, human, technical, social--necessary to do this work

    Opening Books and the National Corpus of Graduate Research

    Get PDF
    Virginia Tech University Libraries, in collaboration with Virginia Tech Department of Computer Science and Old Dominion University Department of Computer Science, request $505,214 in grant funding for a 3-year project, the goal of which is to bring computational access to book-length documents, demonstrating that with Electronic Theses and Dissertations (ETDs). The project is motivated by the following library and community needs. (1) Despite huge volumes of book-length documents in digital libraries, there is a lack of models offering effective and efficient computational access to these long documents. (2) Nationwide open access services for ETDs generally function at the metadata level. Much important knowledge and scientific data lie hidden in ETDs, and we need better tools to mine the content and facilitate the identification, discovery, and reuse of these important components. (3) A wide range of audiences can potentially benefit from this research, including but not limited to Librarians, Students, Authors, Educators, Researchers, and other interested readers. We will answer the following key research questions: (1) How can we effectively identify and extract key parts (chapters, sections, tables, figures, citations), in both born digital and page image formats? (2) How can we develop effective automatic classication as well as chapter summarization techniques? (3) How can our ETD digital library most effectively serve stakeholders? In response to these questions, we plan to first compile an ETD corpus consisting of at least 50,000 documents from multiple institutional repositories. We will make the corpus inclusive and diverse, covering a range of degrees (master’s and doctoral), years, graduate programs (STEM and non-STEM), and authors (from HBCUs and non-HBCUs). Testing first with this sample, we will investigate three major research areas (RAs), outlined below. RA 1: Document analysis and extraction, in which we experiment with machine/deep learning models for effective ETD segmentation and subsequent information extraction. Anticipated results of this research include new software tools that can be used and adapted by libraries for automatic extraction of structural metadata and document components (chapters, sections, figures, tables, citations, bibliographies) from ETDs - applied to both page image and born digital documents. RA 2: Adding value, in which we investigate techniques and build machine/deep learning models to automatically summarize and classify ETD chapters. Anticipated results of this research include software implementations of a chapter-level text summarizer that generates paragraph-length summaries of ETD chapters, and a multi-label classifier that assigns subject categories to ETD chapters. Our aim is to develop software that can be adapted or replicated by libraries to add value to their existing ETD services. RA 3: User services, in which we study users to identify and understand their information needs and information seeking behaviors, so that we may establish corresponding requirements for user interface and service components most useful for interacting with ETD content. Basing our design decisions on empirical evidence obtained from user analysis, we will construct a prototype system to demonstrate how these components can improve the user experience with ETD collections, and ultimately increase the capacity of libraries to provide access to ETDs and other long-form document content. Our project brings to bear cutting-edge computer science and machine/deep learning technologies to advance discovery, use, and potential for reuse of the knowledge hidden in the text of books and book-length documents. In addition, by focusing on libraries\u27 ETD collections (where legal restrictions from book publishers generally are not applicable), our research will open this rich corpus of graduate research and scholarship, leverage ETDs to advance further research and education, and allow libraries to achieve greater impact

    A semantic-based system for querying personal digital libraries

    Get PDF
    This is the author's accepted manuscript. The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-540-28640-0_4. Copyright @ Springer 2004.The decreasing cost and the increasing availability of new technologies is enabling people to create their own digital libraries. One of the main topic in personal digital libraries is allowing people to select interesting information among all the different digital formats available today (pdf, html, tiff, etc.). Moreover the increasing availability of these on-line libraries, as well as the advent of the so called Semantic Web [1], is raising the demand for converting paper documents into digital, possibly semantically annotated, documents. These motivations drove us to design a new system which could enable the user to interact and query documents independently from the digital formats in which they are represented. In order to achieve this independence from the format we consider all the digital documents contained in a digital library as images. Our system tries to automatically detect the layout of the digital documents and recognize the geometric regions of interest. All the extracted information is then encoded with respect to a reference ontology, so that the user can query his digital library by typing free text or browsing the ontology

    Judging a book by its cover: interface elements that affect reader selection of ebooks

    Get PDF
    Digital library research has demonstrated the impact of content presentation on both search and reading behaviours. In this paper, we scrutinise the influence of ebook presentation on user behaviour, focussing on document thumbnails and the first page view. We demonstrate that flaws in presentation increase the volume of short time-span reading, and reduce the likelihood of long-span reading when compared to other documents. This reflects other patterns of information seeking behaviour that demonstrate increased short-term reading when information content is uncertain, and suggests an ineffective use of reader time on less useful content

    Detecting Family Resemblance: Automated Genre Classification.

    Get PDF
    This paper presents results in automated genre classification of digital documents in PDF format. It describes genre classification as an important ingredient in contextualising scientific data and in retrieving targetted material for improving research. The current paper compares the role of visual layout, stylistic features and language model features in clustering documents and presents results in retrieving five selected genres (Scientific Article, Thesis, Periodicals, Business Report, and Form) from a pool of materials populated with documents of the nineteen most popular genres found in our experimental data set.

    Visual collaging of music in a digital library

    Get PDF
    This article explores the role visual browsing can play within a digital music library. The context to the work is provided through a review of related techniques drawn from the fields of digital libraries and human computer interaction. Implemented within the open source digital library toolkit Greenstone, a prototype system is described that combines images located through textual metadata with a visualisation technique known as collaging to provide a leisurely, undirected interaction with a music collection. Emphasis in the article is given to the augmentations of the basic technique to work in the musical domain

    Searching for Ground Truth: a stepping stone in automating genre classification

    Get PDF
    This paper examines genre classification of documents and its role in enabling the effective automated management of digital documents by digital libraries and other repositories. We have previously presented genre classification as a valuable step toward achieving automated extraction of descriptive metadata for digital material. Here, we present results from experiments using human labellers, conducted to assist in genre characterisation and the prediction of obstacles which need to be overcome by an automated system, and to contribute to the process of creating a solid testbed corpus for extending automated genre classification and testing metadata extraction tools across genres. We also describe the performance of two classifiers based on image and stylistic modeling features in labelling the data resulting from the agreement of three human labellers across fifteen genre classes.

    An Experimental Digital Library Platform - A Demonstrator Prototype for the DigLib Project at SICS

    Get PDF
    Within the framework of the Digital Library project at SICS, this thesis describes the implementation of a demonstrator prototype of a digital library (DigLib); an experimental platform integrating several functions in one common interface. It includes descriptions of the structure and formats of the digital library collection, the tailoring of the search engine Dienst, the construction of a keyword extraction tool, and the design and development of the interface. The platform was realised through sicsDAIS, an agent interaction and presentation system, and is to be used for testing and evaluating various tools for information seeking. The platform supports various user interaction strategies by providing: search in bibliographic records (Dienst); an index of keywords (the Keyword Extraction Function (KEF)); and browsing through the hierarchical structure of the collection. KEF was developed for this thesis work, and extracts and presents keywords from Swedish documents. Although based on a comparatively simple algorithm, KEF contributes by supplying a long-felt want in the area of Information Retrieval. Evaluations of the tasks and the interface still remain to be done, but the digital library is very much up and running. By implementing the platform through sicsDAIS, DigLib can deploy additional tools and search engines without interfering with already running modules. If wanted, agents providing other services than SICS can supply, can be plugged in
    corecore