3,517 research outputs found

    Content-Aware DataGuides for Indexing Large Collections of XML Documents

    Get PDF
    XML is well-suited for modelling structured data with textual content. However, most indexing approaches perform structure and content matching independently, combining the retrieved path and keyword occurrences in a third step. This paper shows that retrieval in XML documents can be accelerated significantly by processing text and structure simultaneously during all retrieval phases. To this end, the Content-Aware DataGuide (CADG) enhances the wellknown DataGuide with (1) simultaneous keyword and path matching and (2) a precomputed content/structure join. Extensive experiments prove the CADG to be 50-90% faster than the DataGuide for various sorts of query and document, including difficult cases such as poorly structured queries and recursive document paths. A new query classification scheme identifies precise query characteristics with a predominant influence on the performance of the individual indices. The experiments show that the CADG is applicable to many real-world applications, in particular large collections of heterogeneously structured XML documents

    myTrustedCloud: Trusted cloud infrastructure for security-critical computation and data managment

    Get PDF
    Copyright @ 2012 IEEECloud Computing provides an optimal infrastructure to utilise and share both computational and data resources whilst allowing a pay-per-use model, useful to cost-effectively manage hardware investment or to maximise its utilisation. Cloud Computing also offers transitory access to scalable amounts of computational resources, something that is particularly important due to the time and financial constraints of many user communities. The growing number of communities that are adopting large public cloud resources such as Amazon Web Services [1] or Microsoft Azure [2] proves the success and hence usefulness of the Cloud Computing paradigm. Nonetheless, the typical use cases for public clouds involve non-business critical applications, particularly where issues around security of utilization of applications or deposited data within shared public services are binding requisites. In this paper, a use case is presented illustrating how the integration of Trusted Computing technologies into an available cloud infrastructure - Eucalyptus - allows the security-critical energy industry to exploit the flexibility and potential economical benefits of the Cloud Computing paradigm for their business-critical applications

    High Performance Document Store Implementation in Rust

    Get PDF
    Databases are a core part of any application which requires persistence of data. The performance of applications involving the use of database systems is directly proportional to how fast their database read-write operations are. The aim of this project was to build a high- performance document store which can support variety of applications which require data storage and retrieval of some kind. This document store can be used as an independently running backend service which can be utilized by search engines, applications which deal with keeping records, etc. We used Rust to make this document store which is fast, robust, and memory efficient. The document store is a server which can return documents based on the key provided in the request. It has the capability to read WebArchive (warc extension) files as a feed to the linear hashing based datastore. The paper focuses on the implementation of this application and how it is a relevant backend system for a search engine. We performed various tests for the functionalities that are offered by the application and documented the noteworthy results. The linear hash-table has the capability to insert 10,000 records with key-value pairs sized 16 bytes in 10 seconds where a similar implementation in JavaScript takes around 40 seconds for the same. The insertion time in our implementation increases logarithmically. The hash-table supports retrieval of 10,000 similar sized records in under 5 seconds. The WebArchive parser utility supports the parsing of 10,000 records of a compressed (gzip extension) warc file in an average of 70 seconds. This is approximately the same as the warcio library in Python. This time increases linearly with the number of records that are read, in our application. Along with the conversion of warc records into a format suitable for the linear hash-table, it takes the application an average of 80 seconds. The warc utility also has the ability to write warc files. It can write 10,000 warc files with a body of size around 60bytes in an average of 2 seconds. Detailed performance comparisons with other similar tools are also documented in the paper

    Flattening an object algebra to provide performance

    Get PDF
    Algebraic transformation and optimization techniques have been the method of choice in relational query execution, but applying them in object-oriented (OO) DBMSs is difficult due to the complexity of OO query languages. This paper demonstrates that the problem can be simplified by mapping an OO data model to the binary relational model implemented by Monet, a state-of-the-art database kernel. We present a generic mapping scheme to flatten data models and study the case of straightforward OO model. We show how flattening enabled us to implement a query algebra, using only a very limited set of simple operations. The required primitives and query execution strategies are discussed, and their performance is evaluated on the 1-GByte TPC-D (Transaction-processing Performance Council's Benchmark D), showing that our divide-and-conquer approach yields excellent result

    Audiovisual preservation strategies, data models and value-chains

    No full text
    This is a report on preservation strategies, models and value-chains for digital file-based audiovisual content. The report includes: (a)current and emerging value-chains and business-models for audiovisual preservation;(b) a comparison of preservation strategies for audiovisual content including their strengths and weaknesses, and(c) a review of current preservation metadata models, and requirements for extension to support audiovisual files

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
    • 

    corecore