Search CORE

579 research outputs found

Efficient data representation for XML in peer-based systems

Author: Christopher Foley
Gourlay R.
Tripney B.
Wilson J.
Publication venue: 'Emerald'
Publication date: 01/01/2010
Field of study

Purpose - New directions in the provision of end-user computing experiences mean that the best way to share data between small mobile computing devices needs to be determined. Partitioning large structures so that they can be shared efficiently provides a basis for data-intensive applications on such platforms. The partitioned structure can be compressed using dictionary-based approaches and then directly queried without firstly decompressing the whole structure. Design/methodology/approach - The paper describes an architecture for partitioning XML into structural and dictionary elements and the subsequent manipulation of the dictionary elements to make the best use of available space. Findings - The results indicate that considerable savings are available by removing duplicate dictionaries. The paper also identifies the most effective strategy for defining dictionary scope. Research limitations/implications - This evaluation is based on a range of benchmark XML structures and the approach to minimising dictionary size shows benefit in the majority of these. Where structures are small and regular, the benefits of efficient dictionary representation are lost. The authors' future research now focuses on heuristics for further partitioning of structural elements. Practical implications - Mobile applications that need access to large data collections will benefit from the findings of this research. Traditional client/server architectures are not suited to dealing with high volume demands from a multitude of small mobile devices. Peer data sharing provides a more scalable solution and the experiments that the paper describes demonstrate the most effective way of sharing data in this context. Social implications - Many services are available via smartphone devices but users are wary of exploiting the full potential because of the need to conserve battery power. The approach mitigates this challenge and consequently expands the potential for users to benefit from mobile information systems. This will have impact in areas such as advertising, entertainment and education but will depend on the acceptability of file sharing being extended from the desktop to the mobile environment. Originality/value - The original work characterises the most effective way of sharing large data sets between small mobile devices. This will save battery power on devices such as smartphones, thus providing benefits to users of such devices

Crossref

University of Strathclyde Institutional Repository

Enlighten

Compressed materialised views of semi-structured data

Author: Gourlay Richard
Tripney Brian
Wilson John
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2007
Field of study

Query performance issues over semi-structured data have led to the emergence of materialised XML views as a means of restricting the data structure processed by a query. However preserving the conventional representation of such views remains a significant limiting factor especially in the context of mobile devices where processing power, memory usage and bandwidth are significant factors. To explore the concept of a compressed materialised view, we extend our earlier work on structural XML compression to produce a combination of structural summarisation and data compression techniques. These techniques provide a basis for efficiently dealing with both structural queries and valuebased predicates. We evaluate the effectiveness of such a scheme, presenting results and performance measures that show advantages of using such structures

Crossref

University of Strathclyde Institutional Repository

Enlighten

Designing a resource-efficient data structure for mobile data systems

Author: Gourlay Richard Scott
Publication venue
Publication date: 01/07/2006
Field of study

Designing data structures for use in mobile devices requires attention on optimising data volumes with associated benefits for data transmission, storage space and battery use. For semi-structured data, tree summarisation techniques can be used to reduce the volume of structured elements while dictionary compression can efficiently deal with value-based predicates. This project seeks to investigate and evaluate an integration of the two approaches. The key strength of this technique is that both structural and value predicates could be resolved within one graph while further allowing for compression of the resulting data structure. As the current trend is towards the requirement for working with larger semi-structured data sets this work would allow for the utilisation of much larger data sets whilst reducing requirements on bandwidth and minimising the memory necessary both for the storage and querying of the data

University of Strathclyde Institutional Repository

bdbms -- A Database Management System for Biological Data

Author: Aref Walid G.
Eltabakh Mohamed Y.
Ouzzani Mourad
Publication venue
Publication date: 01/12/2006
Field of study

Biologists are increasingly using databases for storing and managing their data. Biological databases typically consist of a mixture of raw data, metadata, sequences, annotations, and related data obtained from various sources. Current database technology lacks several functionalities that are needed by biological databases. In this paper, we introduce bdbms, an extensible prototype database management system for supporting biological data. bdbms extends the functionalities of current DBMSs to include: (1) Annotation and provenance management including storage, indexing, manipulation, and querying of annotation and provenance as first class objects in bdbms, (2) Local dependency tracking to track the dependencies and derivations among data items, (3) Update authorization to support data curation via content-based authorization, in contrast to identity-based authorization, and (4) New access methods and their supporting operators that support pattern matching on various types of compressed biological data types. This paper presents the design of bdbms along with the techniques proposed to support these functionalities including an extension to SQL. We also outline some open issues in building bdbms.Comment: This article is published under a Creative Commons License Agreement (http://creativecommons.org/licenses/by/2.5/.) You may copy, distribute, display, and perform the work, make derivative works and make commercial use of the work, but, you must attribute the work to the author and CIDR 2007. 3rd Biennial Conference on Innovative Data Systems Research (CIDR) January 710, 2007, Asilomar, California, US

arXiv.org e-Print Archive

CiteSeerX

Purdue E-Pubs

Efficient storage and decoding of SURF feature points

Author: McCusker Kealan
McGuinness Kevin
O'Connor Noel E.
O'Hare Neil
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2012
Field of study

Practical use of SURF feature points in large-scale indexing and retrieval engines requires an efficient means for storing and decoding these features. This paper investigates several methods for compression and storage of SURF feature points, considering both storage consumption and disk-read efficiency. We compare each scheme with a baseline plain-text encoding scheme as used by many existing SURF implementations. Our final proposed scheme significantly reduces both the time required to load and decode feature points, and the space required to store them on disk

Crossref

Irish Universities

DCU Online Research Access Service

TopSig: Topology Preserving Document Signatures

Author: De Vries Christopher M.
Geva Shlomo
Publication venue
Publication date: 01/01/2011
Field of study

Performance comparisons between File Signatures and Inverted Files for text retrieval have previously shown several significant shortcomings of file signatures relative to inverted files. The inverted file approach underpins most state-of-the-art search engine algorithms, such as Language and Probabilistic models. It has been widely accepted that traditional file signatures are inferior alternatives to inverted files. This paper describes TopSig, a new approach to the construction of file signatures. Many advances in semantic hashing and dimensionality reduction have been made in recent times, but these were not so far linked to general purpose, signature file based, search engines. This paper introduces a different signature file approach that builds upon and extends these recent advances. We are able to demonstrate significant improvements in the performance of signature file based indexing and retrieval, performance that is comparable to that of state of the art inverted file based systems, including Language models and BM25. These findings suggest that file signatures offer a viable alternative to inverted files in suitable settings and from the theoretical perspective it positions the file signatures model in the class of Vector Space retrieval models.Comment: 12 pages, 8 figures, CIKM 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

Queensland University of Technology ePrints Archive

간결한 자료구조를 활용한 반구조화된 문서 형식들의 공간 효율적 표현법

Author: 이준희
Publication venue: 서울대학교 대학원
Publication date: 01/02/2021
Field of study

학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2021. 2. Srinivasa Rao Satti.Numerous big data are generated from a plethora of sources. Most of the data stored as files contain a non-fixed type of schema, so that the files are suitable to be maintained as semi-structured document formats. A number of those formats, such as XML (eXtensible Markup Language), JSON (JavaScript Object Notation), and YAML (YAML Ain't Markup Language) are suggested to sustain hierarchy in the original corpora of data. Several data models structuring the gathered data - including RDF (Resource Description Framework) - depend on the semi-structured document formats to be serialized and transferred for future processing. Since the semi-structured document formats focus on readability and verbosity, redundant space is required to organize and maintain the document. Even though general-purpose compression schemes are widely used to compact the documents, applying those algorithms hinder future handling of the corpora, owing to loss of internal structures. The area of succinct data structures is widely investigated and researched in theory, to provide answers to the queries while the encoded data occupy space close to the information-theoretic lower bound. Bit vectors and trees are the notable succinct data structures. Nevertheless, there were few attempts to apply the idea of succinct data structures to represent the semi-structured documents in space-efficient manner. In this dissertation we propose a unified, space-efficient representation of various semi-structured document formats. The core functionality of this representation is its compactness and query-ability derived from enriched functions of succinct data structures. Incorporation of (a) bit indexed arrays, (b) succinct ordinal trees, and (c) compression techniques engineers the compact representation. We implement this representation in practice, and show by experiments that construction of this representation decreases the disk usage by up to 60% while occupying 90% less RAM. We also allow processing a document in partial manner, to allow processing of larger corpus of big data even in the constrained environment. In parallel to establishing the aforementioned compact semi-structured document representation, we provide and reinforce some of the existing compression schemes in this dissertation. We first suggest an idea to encode an array of integers that is not necessarily sorted. This compaction scheme improves upon the existing universal code systems, by assistance of succinct bit vector structure. We show that our suggested algorithm reduces space usage by up to 44% while consuming 15% less time than the original code system, while the algorithm additionally supports random access of elements upon the encoded array. We also reinforce the SBH bitmap index compression algorithm. The main strength of this scheme is the use of intermediate super-bucket during operations, giving better performance on querying through a combination of compressed bitmap indexes. Inspired from splits done during the intermediate process of the SBH algorithm, we give an improved compression mechanism supporting parallelism that could be utilized in both CPUs and GPUs. We show by experiments that this CPU parallel processing optimization diminishes compression and decompression times by up to 38% in a 4-core machine without modifying the bitmap compressed form. For GPUs, the new algorithm gives 48% faster query processing time in the experiments, compared to the previous existing bitmap index compression schemes.셀 수 없는 빅 데이터가 다양한 원본로부터 생성되고 있다. 이들 데이터의 대부분은 고정되지 않은 종류의 스키마를 포함한 파일 형태로 저장되는데, 이로 인하여 반구조화된 문서 형식을 이용하여 파일을 유지하는 것이 적합하다. XML, JSON 및 YAML과 같은 종류의 반구조화된 문서 형식이 데이터에 내재하는 구조를 유지하기 위하여 제안되었다. 수집된 데이터를 구조화하는 RDF와 같은 여러 데이터 모델들은 사후 처리를 위한 저장 및 전송을 위하여 반구조화된 문서 형식에 의존한다. 반구조화된 문서 형식은 가독성과 다변성에 집중하기 때문에, 문서를 구조화하고 유지하기 위하여 추가적인 공간을 필요로 한다. 문서를 압축시키기 위하여 일반적인 압축 기법들이 널리 사용되고 있으나, 이들 기법들을 적용하게 되면 문서의 내부 구조의 손실로 인하여 데이터의 사후 처리가 어렵게 된다. 데이터를 정보이론적 하한에 가까운 공간만을 사용하여 저장을 가능하게 하면서 질의에 대한 응답을 제공하는 간결한 자료구조는 이론적으로 널리 연구되고 있는 분야이다. 비트열과 트리가 널리 알려진 간결한 자료구조들이다. 그러나 반구조화된 문서들을 저장하는 데 간결한 자료구조의 아이디어를 적용한 연구는 거의 진행되지 않았다. 본 학위논문을 통해 우리는 다양한 종류의 반구조화된 문서 형식을 통일되게 표현하는 공간 효율적 표현법을 제시한다. 이 기법의 주요한 기능은 간결한 자료구조가 강점으로 가지는 특성에 기반한 간결성과 질의 가능성이다. 비트열로 인덱싱된 배열, 간결한 순서 있는 트리 및 다양한 압축 기법을 통합하여 해당 표현법을 고안하였다. 이 기법은 실재적으로 구현되었고, 실험을 통하여 이 기법을 적용한 반구조화된 문서들은 최대 60% 적은 디스크 공간과 90% 적은 메모리 공간을 통해 표현될 수 있다는 것을 보인다. 더불어 본 학위논문에서 반구조화된 문서들은 분할적으로 표현이 가능함을 보이고, 이를 통하여 제한된 환경에서도 빅 데이터를 표현한 문서들을 처리할 수 있다는 것을 보인다. 앞서 언급한 공간 효율적 반구조화된 문서 표현법을 구축함과 동시에, 본 학위논문에서 이미 존재하는 압축 기법 중 일부를 추가적으로 개선한다. 첫째로, 본 학위논문에서는 정렬 여부에 관계없는 정수 배열을 부호화하는 아이디어를 제시한다. 이 기법은 이미 존재하는 범용 코드 시스템을 개선한 형태로, 간결한 비트열 자료구조를 이용한다. 제안된 알고리즘은 기존 범용 코드 시스템에 비해 최대 44\% 적은 공간을 사용할 뿐만 아니라 15\% 적은 부호화 시간을 필요로 하며, 기존 시스템에서 제공하지 않는 부호화된 배열에서의 임의 접근을 지원한다. 또한 본 학위논문에서는 비트맵 인덱스 압축에 사용되는 SBH 알고리즘을 개선시킨다. 해당 기법의 주된 강점은 부호화와 복호화 진행 시 중간 매개인 슈퍼버켓을 사용함으로써 여러 압축된 비트맵 인덱스에 대한 질의 성능을 개선시키는 것이다. 위 압축 알고리즘의 중간 과정에서 진행되는 분할에서 영감을 얻어, 본 학위논문에서 CPU 및 GPU에 적용 가능한 개선된 병렬화 압축 매커니즘을 제시한다. 실험을 통해 CPU 병렬 최적화가 이루어진 알고리즘은 압축된 형태의 변형 없이 4코어 컴퓨터에서 최대 38\%의 압축 및 해제 시간을 감소시킨다는 것을 보인다. GPU 병렬 최적화는 기존에 존재하는 GPU 비트맵 압축 기법에 비해 48\% 빠른 질의 처리 시간을 필요로 함을 확인한다.Chapter 1 Introduction 1 1.1 Contribution 3 1.2 Organization 5 Chapter 2 Background 6 2.1 Model of Computation 6 2.2 Succinct Data Structures 7 Chapter 3 Space-efficient Representation of Integer Arrays 9 3.1 Introduction 9 3.2 Preliminaries 10 3.2.1 Universal Code System 10 3.2.2 Bit Vector 13 3.3 Algorithm Description 13 3.3.1 Main Principle 14 3.3.2 Optimization in the Implementation 16 3.4 Experimental Results 16 Chapter 4 Space-efficient Parallel Compressed Bitmap Index Processing 19 4.1 Introduction 19 4.2 Related Work 23 4.2.1 Byte-aligned Bitmap Code (BBC) 24 4.2.2 Word-Aligned Hybrid (WAH) 27 4.2.3 WAH-derived Algorithms 28 4.2.4 GPU-based WAH Algorithms 31 4.2.5 Super Byte-aligned Hybrid (SBH) 33 4.3 Parallelizing SBH 38 4.3.1 CPU Parallelism 38 4.3.2 GPU Parallelism 39 4.4 Experimental Results 40 4.4.1 Plain Version 41 4.4.2 Parallelized Version 46 4.4.3 Summary 49 Chapter 5 Space-efficient Representation of Semi-structured Document Formats 50 5.1 Preliminaries 50 5.1.1 Semi-structured Document Formats 50 5.1.2 Resource Description Framework 57 5.1.3 Succinct Ordinal Tree Representations 60 5.1.4 String Compression Schemes 64 5.2 Representation 66 5.2.1 Bit String Indexed Array 67 5.2.2 Main Structure 68 5.2.3 Single Document as a Collection of Chunks 72 5.2.4 Supporting Queries 73 5.3 Experimental Results 75 5.3.1 Datasets 76 5.3.2 Construction Time 78 5.3.3 RAM Usage during Construction 80 5.3.4 Disk Usage and Serialization Time 83 5.3.5 Chunk Division 83 5.3.6 String Compression 88 5.3.7 Query Time 89 Chapter 6 Conclusion 94 Bibliography 96 요약 109 Acknowledgements 111Docto

SNU Open Repository and Archive

Enhancing Content-And-Structure Information Retrieval using a Native XML Database

Author: Pehcevski Jovan
Thom James A.
Vercoustre Anne-Marie
Publication venue
Publication date: 01/01/2004
Field of study

Three approaches to content-and-structure XML retrieval are analysed in this paper: first by using Zettair, a full-text information retrieval system; second by using eXist, a native XML database, and third by using a hybrid XML retrieval system that uses eXist to produce the final answers from likely relevant articles retrieved by Zettair. INEX 2003 content-and-structure topics can be classified in two categories: the first retrieving full articles as final answers, and the second retrieving more specific elements within articles as final answers. We show that for both topic categories our initial hybrid system improves the retrieval effectiveness of a native XML database. For ranking the final answer elements, we propose and evaluate a novel retrieval model that utilises the structural relationships between the answer elements of a native XML database and retrieves Coherent Retrieval Elements. The final results of our experiments show that when the XML retrieval task focusses on highly relevant elements our hybrid XML retrieval system with the Coherent Retrieval Elements module is 1.8 times more effective than Zettair and 3 times more effective than eXist, and yields an effective content-and-structure XML retrieval

arXiv.org e-Print Archive

CiteSeerX

INRIA a CCSD electronic archive server

Hal-Diderot

Compressed Text Indexes:From Theory to Practice!

Author: Ferragina Paolo
Gonzalez Rodrigo
Navarro Gonzalo
Venturini Rossano
Publication venue
Publication date: 01/01/2007
Field of study

A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This technology represents a breakthrough over the text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications. The goal of this paper is to fill this gap. First, we present the existing implementations of compressed indexes from a practitioner's point of view. Second, we introduce the Pizza&Chili site, which offers tuned implementations and a standardized API for the most successful compressed full-text self-indexes, together with effective testbeds and scripts for their automatic validation and test. Third, we show the results of our extensive experiments on these codes with the aim of demonstrating the practical relevance of this novel and exciting technology

arXiv.org e-Print Archive

CiteSeerX

Archivio della Ricerca - Università di Pisa