4,222 research outputs found
Universal Indexes for Highly Repetitive Document Collections
Indexing highly repetitive collections has become a relevant problem with the
emergence of large repositories of versioned documents, among other
applications. These collections may reach huge sizes, but are formed mostly of
documents that are near-copies of others. Traditional techniques for indexing
these collections fail to properly exploit their regularities in order to
reduce space.
We introduce new techniques for compressing inverted indexes that exploit
this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar
compression of the differential inverted lists, instead of the usual practice
of gap-encoding them. We show that, in this highly repetitive setting, our
compression methods significantly reduce the space obtained with classical
techniques, at the price of moderate slowdowns. Moreover, our best methods are
universal, that is, they do not need to know the versioning structure of the
collection, nor that a clear versioning structure even exists.
We also introduce compressed self-indexes in the comparison. These are
designed for general strings (not only natural language texts) and represent
the text collection plus the index structure (not an inverted index) in
integrated form. We show that these techniques can compress much further, using
a small fraction of the space required by our new inverted indexes. Yet, they
are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sk{\l}odowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094
Impliance: A Next Generation Information Management Appliance
ably successful in building a large market and adapting to the changes of the
last three decades, its impact on the broader market of information management
is surprisingly limited. If we were to design an information management system
from scratch, based upon today's requirements and hardware capabilities, would
it look anything like today's database systems?" In this paper, we introduce
Impliance, a next-generation information management system consisting of
hardware and software components integrated to form an easy-to-administer
appliance that can store, retrieve, and analyze all types of structured,
semi-structured, and unstructured information. We first summarize the trends
that will shape information management for the foreseeable future. Those trends
imply three major requirements for Impliance: (1) to be able to store, manage,
and uniformly query all data, not just structured records; (2) to be able to
scale out as the volume of this data grows; and (3) to be simple and robust in
operation. We then describe four key ideas that are uniquely combined in
Impliance to address these requirements, namely the ideas of: (a) integrating
software and off-the-shelf hardware into a generic information appliance; (b)
automatically discovering, organizing, and managing all data - unstructured as
well as structured - in a uniform way; (c) achieving scale-out by exploiting
simple, massive parallel processing, and (d) virtualizing compute and storage
resources to unify, simplify, and streamline the management of Impliance.
Impliance is an ambitious, long-term effort to define simpler, more robust, and
more scalable information systems for tomorrow's enterprises.Comment: This article is published under a Creative Commons License Agreement
(http://creativecommons.org/licenses/by/2.5/.) You may copy, distribute,
display, and perform the work, make derivative works and make commercial use
of the work, but, you must attribute the work to the author and CIDR 2007.
3rd Biennial Conference on Innovative Data Systems Research (CIDR) January
710, 2007, Asilomar, California, US
Recommended from our members
ICOPER Project - Deliverable 4.3 ISURE: Recommendations for extending effective reuse, embodied in the ICOPER CD&R
The purpose of this document is to capture the ideas and recommendations, within and beyond the ICOPER community, concerning the reuse of learning content, including appropriate methodologies as well as established strategies for remixing and repurposing reusable resources. The overall remit of this work focuses on describing the key issues that are related to extending effective reuse embodied in such materials. The objective of this investigation, is to support the reuse of learning content whilst considering how it could be originally created and then adapted with that ‘reuse’ in mind. In these circumstances a survey on effective reuse best practices can often provide an insight into the main challenges and benefits involved in the process of creating, remixing and repurposing what we are now designating as Reusable Learning Content (RLC).
Several key issues are analysed in this report: Recommendations for extending effective reuse, building upon those described in the previous related deliverables 4.1 Content Development Methodologies and 4.2 Quality Control and Web 2.0 technologies. The findings of this current survey, however, provide further recommendations and strategies for using and developing this reusable learning content. In the spirit of ‘reuse’, this work also aims to serve as a foundation for the many different stakeholders and users within, and beyond, the ICOPER community who are interested in reusing learning resources.
This report analyses a variety of information. Evidence has been gathered from a qualitative survey that has focused on the technical and pedagogical recommendations suggested by a Special Interest Group (SIG) on the most innovative practices with respect to new media content authors (for content authoring or modification) and course designers (for unit creation). This extended community includes a wider collection of OER specialists. This collected evidence, in the form of video and audio interviews, has also been represented as multimedia assets potentially helpful for learning and useful as learning content in the New Media Space (See section 4 for further details).
Section 2 of this report introduces the concept of reusable learning content and reusability. Section 3 discusses an application created by the ICOPER community to enhance the opportunities for developing reusable content. Section 4 of this report provides an overview of the methodology used for the qualitative survey. Section 5 presents a summary of thematic findings. Section 6 highlights a list of recommendations for effective reuse of educational content, which were derived from thematic analysis described in Appendix A. Finally, section 7 summarises the key outcomes of this work
Designing Reusable Systems that Can Handle Change - Description-Driven Systems : Revisiting Object-Oriented Principles
In the age of the Cloud and so-called Big Data systems must be increasingly
flexible, reconfigurable and adaptable to change in addition to being developed
rapidly. As a consequence, designing systems to cater for evolution is becoming
critical to their success. To be able to cope with change, systems must have
the capability of reuse and the ability to adapt as and when necessary to
changes in requirements. Allowing systems to be self-describing is one way to
facilitate this. To address the issues of reuse in designing evolvable systems,
this paper proposes a so-called description-driven approach to systems design.
This approach enables new versions of data structures and processes to be
created alongside the old, thereby providing a history of changes to the
underlying data models and enabling the capture of provenance data. The
efficacy of the description-driven approach is exemplified by the CRISTAL
project. CRISTAL is based on description-driven design principles; it uses
versions of stored descriptions to define various versions of data which can be
stored in diverse forms. This paper discusses the need for capturing holistic
system description when modelling large-scale distributed systems.Comment: 8 pages, 1 figure and 1 table. Accepted by the 9th Int Conf on the
Evaluation of Novel Approaches to Software Engineering (ENASE'14). Lisbon,
Portugal. April 201
- …