60,338 research outputs found

    Universal Indexes for Highly Repetitive Document Collections

    Get PDF
    Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

    Indexing Audio Documents by using Latent Semantic Analysis and SOM

    Get PDF
    This paper describes an important application for state-of-art automatic speech recognition, natural language processing and information retrieval systems. Methods for enhancing the indexing of spoken documents by using latent semantic analysis and self-organizing maps are presented, motivated and tested. The idea is to extract extra information from the structure of the document collection and use it for more accurate indexing by generating new index terms and stochastic index weights. Indexing methods are evaluated for two broadcast news databases (one French and one English) using the average document perplexity defined in this paper and test queries analyzed by human expert

    Template Mining for Information Extraction from Digital Documents

    Get PDF
    published or submitted for publicatio

    Perspectival Plurality, Relativism, and Multiple Indexing

    Get PDF
    In this paper I focus on a recently discussed phenomenon illustrated by sentences containing predicates of taste: the phenomenon of " perspectival plurality " , whereby sentences containing two or more predicates of taste have readings according to which each predicate pertains to a different perspective. This phenomenon has been shown to be problematic for (at least certain versions of) relativism. My main aim is to further the discussion by showing that the phenomenon extends to other perspectival expressions than predicates of taste and by proposing a general solution to the problem raised by it on behalf of the relativist. The core claim of the solution (" multiple indexing ") is that utterances of sentences containing perspectival expressions should be evaluated with respect to (possibly infinite) sequences of perspective parameters

    Hierarchical index sets in algebraic modelling languages

    Get PDF
    Multi-dimensional algebraic modelling languages make extensive use of simple and compound index sets. In this paper the multi-dimensional modelling paradigm is extended with the concept of a hierarchical index set to support the use of hierarchical data structures. The appropriate reference and indexing mechanisms are introduced, together with mechanisms to support various set operations. Special attention is paid to the Cartesian product of two hierarchical index sets. The modelling of multi-stage programming models is supported through the introduction of a hierarchical indexing mechanism. The extensions proposed in this paper are compared to existing facilities designed to support the modelling of hierarchical structures

    Visualization of database structures for information retrieval

    Get PDF
    This paper describes the Book House system, which is designed to support children's information retrieval in libraries as part of their education. It is a shareware program available on CD‐ROM or floppy disks, and comprises functionality for database searching as well as for classifying and storing book information in the database. The system concept is based on an understanding of children's domain structures and their capabilities for categorization of information needs in connection with their activities in schools, in school libraries or in public libraries. These structures are visualized in the interface by using metaphors and multimedia technology. Through the use of text, images and animation, the Book House encourages children ‐ even at a very early age ‐ to learn by doing in an enjoyable way, which plays on their previous experiences with computer games. Both words and pictures can be used for searching; this makes the system suitable for all age groups. Even children who have not yet learned to read properly can, by selecting pictures, search for and find those books they would like to have read aloud. Thus, at the very beginning of their school life, they can learn to search for books on their own. For the library community, such a system will provide an extended service which will increase the number of children's own searches and also improve the relevance, quality and utilization of the book collections in the libraries. A market research report on the need for an annual indexing service for books in the Book House format is in preparation by the Danish Library Centre A/S

    #Socialtagging: Defining its Role in the Academic Library

    Get PDF
    The information environment is rapidly changing, affecting the ways in which information is organized and accessed. User needs and expectations have also changed due to the overwhelming influence of Web 2.0 tools. Conventional information systems no longer support evolving user needs. Based on current research, we explore a method that integrates the structure of controlled languages with the flexibility and adaptability of social tagging. This article discusses the current research and usage of social tagging and Web 2.0 applications within the academic library. Types of tags, the semiotics of tagging and its influence on indexing are covered

    Matching Queries to Frequently Asked Questions: Search Functionality for the MRSA Web-Portal

    Get PDF
    As part of the long-term EUREGIO MRSA-net project a system was developed which enables health care workers and the general public to quickly find answers to their questions regarding the MRSA pathogen. This paper focuses on how these questions can be answered using Information Retrieval (IR) and Natural Language Processing (NLP) techniques on a Frequently-Asked-Questions-style (FAQ) database
    • 

    corecore