34,299 research outputs found
The NASA Astrophysics Data System: Architecture
The powerful discovery capabilities available in the ADS bibliographic
services are possible thanks to the design of a flexible search and retrieval
system based on a relational database model. Bibliographic records are stored
as a corpus of structured documents containing fielded data and metadata, while
discipline-specific knowledge is segregated in a set of files independent of
the bibliographic data itself.
The creation and management of links to both internal and external resources
associated with each bibliography in the database is made possible by
representing them as a set of document properties and their attributes.
To improve global access to the ADS data holdings, a number of mirror sites
have been created by cloning the database contents and software on a variety of
hardware and software platforms.
The procedures used to create and manage the database and its mirrors have
been written as a set of scripts that can be run in either an interactive or
unsupervised fashion.
The ADS can be accessed at http://adswww.harvard.eduComment: 25 pages, 8 figures, 3 table
DSpace How-To Guide: Tips and tricks for managing common DSpace chores
PDF fileThis short booklet is intended to introduce the commonest non-obvious customization related tasks for newcomers to DSpace administration. It has been written against the current stable version 1.3.2 of DSpace.
We have tried to include instructions for different operating systems as required;
most customizations, however, work identically cross-platform
Machine aided indexing from natural language text
The NASA Lexical Dictionary (NLD) Machine Aided Indexing (MAI) system was designed to (1) reuse the indexing of the Defense Technical Information Center (DTIC); (2) reuse the indexing of the Department of Energy (DOE); and (3) reduce the time required for original indexing. This was done by automatically generating appropriate NASA thesaurus terms from either the other agency's index terms, or, for original indexing, from document titles and abstracts. The NASA STI Program staff devised two different ways to generate thesaurus terms from text. The first group of programs identified noun phrases by a parsing method that allowed for conjunctions and certain prepositions, on the assumption that indexable concepts are found in such phrases. Results were not always satisfactory, and it was noted that indexable concepts often occurred outside of noun phrases. The first method also proved to be too slow for the ultimate goal of interactive (online) MAI. The second group of programs used the knowledge base (KB), word proximity, and frequency of word and phrase occurrence to identify indexable concepts. Both methods are described and illustrated. Online MAI has been achieved, as well as several spinoff benefits, which are also described
BlogForever D2.4: Weblog spider prototype and associated methodology
The purpose of this document is to present the evaluation of different solutions for capturing blogs, established methodology and to describe the developed blog spider prototype
Universal Indexes for Highly Repetitive Document Collections
Indexing highly repetitive collections has become a relevant problem with the
emergence of large repositories of versioned documents, among other
applications. These collections may reach huge sizes, but are formed mostly of
documents that are near-copies of others. Traditional techniques for indexing
these collections fail to properly exploit their regularities in order to
reduce space.
We introduce new techniques for compressing inverted indexes that exploit
this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar
compression of the differential inverted lists, instead of the usual practice
of gap-encoding them. We show that, in this highly repetitive setting, our
compression methods significantly reduce the space obtained with classical
techniques, at the price of moderate slowdowns. Moreover, our best methods are
universal, that is, they do not need to know the versioning structure of the
collection, nor that a clear versioning structure even exists.
We also introduce compressed self-indexes in the comparison. These are
designed for general strings (not only natural language texts) and represent
the text collection plus the index structure (not an inverted index) in
integrated form. We show that these techniques can compress much further, using
a small fraction of the space required by our new inverted indexes. Yet, they
are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sk{\l}odowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094
Spoken query processing for interactive information retrieval
It has long been recognised that interactivity improves the effectiveness of information retrieval systems. Speech is the most natural and interactive medium of communication and recent progress in speech recognition is making it possible to build systems that interact with the user via speech. However, given the typical length of queries submitted to information retrieval systems, it is easy to imagine that the effects of word recognition errors in spoken queries must be severely destructive on the system's effectiveness. The experimental work reported in this paper shows that the use of classical information retrieval techniques for spoken query processing is robust to considerably high levels of word recognition errors, in particular for long queries. Moreover, in the case of short queries, both standard relevance feedback and pseudo relevance feedback can be effectively employed to improve the effectiveness of spoken query processing
- …