20,270 research outputs found
Recommended from our members
Ranking for Scalable Information Extraction
Information extraction systems are complex software tools that discover structured information in natural language text. For instance, an information extraction system trained to extract tuples for an Occurs-in(Natural Disaster, Location) relation may extract the tuple from the sentence: "A tsunami swept the coast of Hawaii." Having information in structured form enables more sophisticated querying and data mining than what is possible over the natural language text. Unfortunately, information extraction is a time-consuming task. For example, a state-of-the-art information extraction system to extract Occurs-in tuples may take up to two hours to process only 1,000 text documents. Since document collections routinely contain millions of documents or more, improving the efficiency and scalability of the information extraction process over these collections is critical. As a significant step towards this goal, this dissertation presents approaches for (i) enabling the deployment of efficient information extraction systems and (ii) scaling the information extraction process to large volumes of text.
To enable the deployment of efficient information extraction systems, we have developed two crucial building blocks for this task. As a first contribution, we have created REEL, a toolkit to easily implement, evaluate, and deploy full-fledged relation extraction systems. REEL, in contrast to existing toolkits, effectively modularizes the key components involved in relation extraction systems and can integrate other long-established text processing and machine learning toolkits. To define a relation extraction system for a new relation and text collection, users only need to specify the desired configuration, which makes REEL a powerful framework for both research and application building. As a second contribution, we have addressed the problem of building representative extraction task-specific document samples from collections, a step often required by approaches for efficient information extraction. Specifically, we devised fully automatic document sampling techniques for information extraction that can produce better-quality document samples than the state-of-the-art sampling strategies; furthermore, our techniques are substantially more efficient than the existing alternative approaches.
To scale the information extraction process to large volumes of text, we have developed approaches that address the efficiency and scalability of the extraction process by focusing the extraction effort on the collections, documents, and sentences worth processing for a given extraction task. For collections, we have studied both (adaptations of) state-of-the art approaches for estimating the number of documents in a collection that lead to the extraction of tuples as well as information extraction-specific approaches. Using these estimations we can identify the collections worth processing and ignore the rest, for efficiency. For documents, we have developed an adaptive document ranking approach that relies on learning-to-rank techniques to prioritize the documents that are likely to produce tuples for an extraction task of choice. Our approach revises the (learned) ranking decisions periodically as the extraction process progresses and new characteristics of the useful documents are revealed. Finally, for sentences, we have developed an approach based on the sparse group selection problem that identifies sentences|modeled as groups of words|that best characterize the extraction task. Beyond identifying sentences worth processing, our approach aims at selecting sentences that lead to the extraction of unseen, novel tuples. Our approaches are lightweight and efficient, and dramatically improve the efficiency and scalability of the information extraction process. We can often complete the extraction task by focusing on just a very small fraction of the available text, namely, the text that contains relevant information for the extraction task at hand. Our approaches therefore constitute a substantial step towards efficient and scalable information extraction over large volumes of text
Template-Based Metadata Extraction for Heterogeneous Collection
With the growth of the Internet and related tools, there has been a rapid growth of online resources. In particular, by using high-quality OCR (Optical Character Recognition) tools it has become easy to convert an existing corpus into digital form and make it available online. However, a number of organizations have legacy collections that lack metadata. The lack of metadata hampers not only the discovery and dispersion of these collections over the Web, but also their interoperability with other collections. Unfortunately, manual metadata creation is expensive and time-consuming for a large collection, and most existing automated metadata extraction approaches have focused on specific domains and homogeneous collections.
Developing an approach to extract metadata automatically from a large number of challenges. In particular, the heterogeneous legacy collection poses a following issues need to be addressed: (1) Heterogeneity, i.e. how to achieve a high accuracy for a heterogeneous collection; (2) Scaling, i.e. how to apply an automated metadata extraction approach to a very large collection; (3) Evolution, i.e. how to process new documents added to a collection over time; (4) Adaptability, i.e. how to apply an approach to a new document collection; (5) Complexity, i.e. how many document features can be handled, and how complex the features should be.
In this dissertation, we propose a template-based metadata extraction approach to address these issues. The key idea of addressing the heterogeneity is to classify documents into equivalent groups so that each document group contains similar documents only. Next, for each document group we create a template that contains a set of rules to instruct a template engine how to extract metadata from documents in the group. Templates are written in an XML-based language and kept in separate files. Our approach of decoupling rules from programming codes and representing them in a XML format is easy to adapt to another collection with documents in different styles.
We developed our test bed by downloading about 10,000 documents from DTIC (Defense Technical Information Center) document collection that consists of scanned versions of documents in PDF (Portable Document Format) format. We have evaluated our approach on the test bed consisting of documents from DTIC collection, and our results are encouraging. We have also demonstrated how the extracted metadata can be utilized to integrate our test bed with an interoperable digital library framework based on OAI (Open Archives Initiative)
Baseline Detection in Historical Documents using Convolutional U-Nets
Baseline detection is still a challenging task for heterogeneous collections
of historical documents. We present a novel approach to baseline extraction in
such settings, turning out the winning entry to the ICDAR 2017 Competition on
Baseline detection (cBAD). It utilizes deep convolutional nets (CNNs) for both,
the actual extraction of baselines, as well as for a simple form of layout
analysis in a pre-processing step. To the best of our knowledge it is the first
CNN-based system for baseline extraction applying a U-net architecture and
sliding window detection, profiting from a high local accuracy of the candidate
lines extracted. Final baseline post-processing complements our approach,
compensating for inaccuracies mainly due to missing context information during
sliding window detection. We experimentally evaluate the components of our
system individually on the cBAD dataset. Moreover, we investigate how it
generalizes to different data by means of the dataset used for the baseline
extraction task of the ICDAR 2017 Competition on Layout Analysis for
Challenging Medieval Manuscripts (HisDoc). A comparison with the results
reported for HisDoc shows that it also outperforms the contestants of the
latter.Comment: 6 pages, accepted to DAS 201
Selective relevance feedback using term characteristics
This paper presents a new relevance feedback technique; selectively combining evidence based on the usage of terms within documents. By considering how terms are used within documents, we can better describe the features that might make a document relevant and thus improve retrieval effectiveness. In this paper we present an initial, experimental investigation of this technique, incorporating new and existing measures for describing the information content of a document. The results from these experiments positively support our hypothesis that extending relevance feedback to take into account how terms are used within documents can improve the performance of relevance feedback
Economics and Engineering for Preserving Digital Content
Progress towards practical long-term preservation seems to be stalled. Preservationists cannot afford specially developed technology, but must exploit what is created for the marketplace.
Economic and technical facts suggest that most preservation ork should be shifted from repository institutions to information producers and consumers. Prior publications describe solutions for all known conceptual challenges of preserving a single digital object, but do not deal with software development or scaling to large collections. Much of the document handling software needed is available. It has, however, not yet been selected, adapted, integrated, or
deployed for digital preservation. The daily tools of both information producers and information consumers can be extended to embed preservation packaging without much burdening these users.
We describe a practical strategy for detailed design and implementation. Document handling is intrinsically complicated because of human sensitivity to communication nuances. Our engineering section therefore starts by discussing how project managers can master the many pertinent details.
Final Research Report for Sound Design and Audio Player
This deliverable describes the work on Task 4.3 Algorithms for sound design and feature developments for audio player. The audio player runs on the in-store player (ISP) and takes care of rendering the music playlists via beat-synchronous automatic DJ mixing, taking advantage of the rich musical content description extracted in T4.2 (beat markers, structural segmentation into intro and outro, musical and sound content classification).
The deliverable covers prototypes and final results on: (1) automatic beat-synchronous mixing by beat alignment and time stretching â we developed an algorithm for beat alignment and scheduling of time-stretched tracks; (2) compensation of play duration changes introduced by time stretching â in order to make the playlist generator independent of beat mixing, we chose to readjust the tempo of played tracks such that their stretched duration is the same as their original duration; (3) prospective research on the extraction of data from DJ mixes â to alleviate the lack of extensive ground truth databases of DJ mixing practices, we propose steps towards extracting this data from existing mixes by alignment and unmixing of the tracks in a mix. We also show how these methods can be evaluated even without labelled test data, and propose an open dataset for further research; (4) a description of the software player module, a GUI-less application to run on the ISP that performs streaming of tracks from disk and beat-synchronous mixing.
The estimation of cue points where tracks should cross-fade is now described in D4.7 Final Research Report on Auto-Tagging of Music.EC/H2020/688122/EU/Artist-to-Business-to-Business-to-Consumer Audio Branding System/ABC D
Damage function for historic paper. Part I: Fitness for use
Background In heritage science literature and in preventive conservation practice, damage functions are used to model material behaviour and specifically damage (unacceptable change), as a result of the presence of a stressor over time. For such functions to be of use in the context of collection management, it is important to define a range of parameters, such as who the stakeholders are (e.g. the public, curators, researchers), the mode of use (e.g. display, storage, manual handling), the long-term planning horizon (i.e. when in the future it is deemed acceptable for an item to become damaged or unfit for use), and what the threshold of damage is, i.e. extent of physical change assessed as damage. Results In this paper, we explore the threshold of fitness for use for archival and library paper documents used for display or reading in the context of access in reading rooms by the general public. Change is considered in the context of discolouration and mechanical deterioration such as tears and missing pieces: forms of physical deterioration that accumulate with time in libraries and archives. We also explore whether the threshold fitness for use is defined differently for objects perceived to be of different value, and for different modes of use. The data were collected in a series of fitness-for-use workshops carried out with readers/visitors in heritage institutions using principles of Design of Experiments. Conclusions The results show that when no particular value is pre-assigned to an archival or library document, missing pieces influenced readers/visitorsâ subjective judgements of fitness-for-use to a greater extent than did discolouration and tears (which had little or no influence). This finding was most apparent in the display context in comparison to the reading room context. The finding also best applied when readers/visitors were not given a value scenario (in comparison to when they were asked to think about the document having personal or historic value). It can be estimated that, in general, items become unfit when text is evidently missing. However, if the visitor/reader is prompted to think of a document in terms of its historic value, then change in a document has little impact on fitness for use
Bringing Structure into Summaries: Crowdsourcing a Benchmark Corpus of Concept Maps
Concept maps can be used to concisely represent important information and
bring structure into large document collections. Therefore, we study a variant
of multi-document summarization that produces summaries in the form of concept
maps. However, suitable evaluation datasets for this task are currently
missing. To close this gap, we present a newly created corpus of concept maps
that summarize heterogeneous collections of web documents on educational
topics. It was created using a novel crowdsourcing approach that allows us to
efficiently determine important elements in large document collections. We
release the corpus along with a baseline system and proposed evaluation
protocol to enable further research on this variant of summarization.Comment: Published at EMNLP 201
- âŠ