20,270 research outputs found

    Template-Based Metadata Extraction for Heterogeneous Collection

    Get PDF
    With the growth of the Internet and related tools, there has been a rapid growth of online resources. In particular, by using high-quality OCR (Optical Character Recognition) tools it has become easy to convert an existing corpus into digital form and make it available online. However, a number of organizations have legacy collections that lack metadata. The lack of metadata hampers not only the discovery and dispersion of these collections over the Web, but also their interoperability with other collections. Unfortunately, manual metadata creation is expensive and time-consuming for a large collection, and most existing automated metadata extraction approaches have focused on specific domains and homogeneous collections. Developing an approach to extract metadata automatically from a large number of challenges. In particular, the heterogeneous legacy collection poses a following issues need to be addressed: (1) Heterogeneity, i.e. how to achieve a high accuracy for a heterogeneous collection; (2) Scaling, i.e. how to apply an automated metadata extraction approach to a very large collection; (3) Evolution, i.e. how to process new documents added to a collection over time; (4) Adaptability, i.e. how to apply an approach to a new document collection; (5) Complexity, i.e. how many document features can be handled, and how complex the features should be. In this dissertation, we propose a template-based metadata extraction approach to address these issues. The key idea of addressing the heterogeneity is to classify documents into equivalent groups so that each document group contains similar documents only. Next, for each document group we create a template that contains a set of rules to instruct a template engine how to extract metadata from documents in the group. Templates are written in an XML-based language and kept in separate files. Our approach of decoupling rules from programming codes and representing them in a XML format is easy to adapt to another collection with documents in different styles. We developed our test bed by downloading about 10,000 documents from DTIC (Defense Technical Information Center) document collection that consists of scanned versions of documents in PDF (Portable Document Format) format. We have evaluated our approach on the test bed consisting of documents from DTIC collection, and our results are encouraging. We have also demonstrated how the extracted metadata can be utilized to integrate our test bed with an interoperable digital library framework based on OAI (Open Archives Initiative)

    Baseline Detection in Historical Documents using Convolutional U-Nets

    Full text link
    Baseline detection is still a challenging task for heterogeneous collections of historical documents. We present a novel approach to baseline extraction in such settings, turning out the winning entry to the ICDAR 2017 Competition on Baseline detection (cBAD). It utilizes deep convolutional nets (CNNs) for both, the actual extraction of baselines, as well as for a simple form of layout analysis in a pre-processing step. To the best of our knowledge it is the first CNN-based system for baseline extraction applying a U-net architecture and sliding window detection, profiting from a high local accuracy of the candidate lines extracted. Final baseline post-processing complements our approach, compensating for inaccuracies mainly due to missing context information during sliding window detection. We experimentally evaluate the components of our system individually on the cBAD dataset. Moreover, we investigate how it generalizes to different data by means of the dataset used for the baseline extraction task of the ICDAR 2017 Competition on Layout Analysis for Challenging Medieval Manuscripts (HisDoc). A comparison with the results reported for HisDoc shows that it also outperforms the contestants of the latter.Comment: 6 pages, accepted to DAS 201

    Selective relevance feedback using term characteristics

    Get PDF
    This paper presents a new relevance feedback technique; selectively combining evidence based on the usage of terms within documents. By considering how terms are used within documents, we can better describe the features that might make a document relevant and thus improve retrieval effectiveness. In this paper we present an initial, experimental investigation of this technique, incorporating new and existing measures for describing the information content of a document. The results from these experiments positively support our hypothesis that extending relevance feedback to take into account how terms are used within documents can improve the performance of relevance feedback

    Economics and Engineering for Preserving Digital Content

    Get PDF
    Progress towards practical long-term preservation seems to be stalled. Preservationists cannot afford specially developed technology, but must exploit what is created for the marketplace. Economic and technical facts suggest that most preservation ork should be shifted from repository institutions to information producers and consumers. Prior publications describe solutions for all known conceptual challenges of preserving a single digital object, but do not deal with software development or scaling to large collections. Much of the document handling software needed is available. It has, however, not yet been selected, adapted, integrated, or deployed for digital preservation. The daily tools of both information producers and information consumers can be extended to embed preservation packaging without much burdening these users. We describe a practical strategy for detailed design and implementation. Document handling is intrinsically complicated because of human sensitivity to communication nuances. Our engineering section therefore starts by discussing how project managers can master the many pertinent details.

    Final Research Report for Sound Design and Audio Player

    Get PDF
    This deliverable describes the work on Task 4.3 Algorithms for sound design and feature developments for audio player. The audio player runs on the in-store player (ISP) and takes care of rendering the music playlists via beat-synchronous automatic DJ mixing, taking advantage of the rich musical content description extracted in T4.2 (beat markers, structural segmentation into intro and outro, musical and sound content classification). The deliverable covers prototypes and final results on: (1) automatic beat-synchronous mixing by beat alignment and time stretching – we developed an algorithm for beat alignment and scheduling of time-stretched tracks; (2) compensation of play duration changes introduced by time stretching – in order to make the playlist generator independent of beat mixing, we chose to readjust the tempo of played tracks such that their stretched duration is the same as their original duration; (3) prospective research on the extraction of data from DJ mixes – to alleviate the lack of extensive ground truth databases of DJ mixing practices, we propose steps towards extracting this data from existing mixes by alignment and unmixing of the tracks in a mix. We also show how these methods can be evaluated even without labelled test data, and propose an open dataset for further research; (4) a description of the software player module, a GUI-less application to run on the ISP that performs streaming of tracks from disk and beat-synchronous mixing. The estimation of cue points where tracks should cross-fade is now described in D4.7 Final Research Report on Auto-Tagging of Music.EC/H2020/688122/EU/Artist-to-Business-to-Business-to-Consumer Audio Branding System/ABC D

    Damage function for historic paper. Part I: Fitness for use

    Get PDF
    Background In heritage science literature and in preventive conservation practice, damage functions are used to model material behaviour and specifically damage (unacceptable change), as a result of the presence of a stressor over time. For such functions to be of use in the context of collection management, it is important to define a range of parameters, such as who the stakeholders are (e.g. the public, curators, researchers), the mode of use (e.g. display, storage, manual handling), the long-term planning horizon (i.e. when in the future it is deemed acceptable for an item to become damaged or unfit for use), and what the threshold of damage is, i.e. extent of physical change assessed as damage. Results In this paper, we explore the threshold of fitness for use for archival and library paper documents used for display or reading in the context of access in reading rooms by the general public. Change is considered in the context of discolouration and mechanical deterioration such as tears and missing pieces: forms of physical deterioration that accumulate with time in libraries and archives. We also explore whether the threshold fitness for use is defined differently for objects perceived to be of different value, and for different modes of use. The data were collected in a series of fitness-for-use workshops carried out with readers/visitors in heritage institutions using principles of Design of Experiments. Conclusions The results show that when no particular value is pre-assigned to an archival or library document, missing pieces influenced readers/visitors’ subjective judgements of fitness-for-use to a greater extent than did discolouration and tears (which had little or no influence). This finding was most apparent in the display context in comparison to the reading room context. The finding also best applied when readers/visitors were not given a value scenario (in comparison to when they were asked to think about the document having personal or historic value). It can be estimated that, in general, items become unfit when text is evidently missing. However, if the visitor/reader is prompted to think of a document in terms of its historic value, then change in a document has little impact on fitness for use

    Bringing Structure into Summaries: Crowdsourcing a Benchmark Corpus of Concept Maps

    Full text link
    Concept maps can be used to concisely represent important information and bring structure into large document collections. Therefore, we study a variant of multi-document summarization that produces summaries in the form of concept maps. However, suitable evaluation datasets for this task are currently missing. To close this gap, we present a newly created corpus of concept maps that summarize heterogeneous collections of web documents on educational topics. It was created using a novel crowdsourcing approach that allows us to efficiently determine important elements in large document collections. We release the corpus along with a baseline system and proposed evaluation protocol to enable further research on this variant of summarization.Comment: Published at EMNLP 201
    • 

    corecore