232 research outputs found

    Scalable Storage for Digital Libraries

    Get PDF
    I propose a storage system optimised for digital libraries. Its key features are its heterogeneous scalability; its integration and exploitation of rich semantic metadata associated with digital objects; its use of a name space; and its aggressive performance optimisation in the digital library domain

    NV-tree : a scalable disk-based high-dimensional index

    Get PDF
    This thesis presents the NV-tree (Nearest Vector tree), which addresses thespecific problem of efficiently and effectively finding the approximatek-nearest neighbors within large collections of high-dimensional data points.The NV-tree is a very compact index, as only six bytes are kept in the in-dex for each high-dimensional descriptor. It thus scales extremely well whenindexing large collections of high-dimensional descriptors. The NV-tree ef-ficiently produces results of good quality, even at such a large scale that theindices can no longer be kept entirely in main memory. We demonstrate thiswith extensive experiments presenting results from various collection sizesfrom 36 million up to nearly 30 billion SIFT (Scale Invariant Feature Trans-form) descriptors.We also study the conditions under which a nearest neighbour search pro-vides meaningful results. Following this analysis we compare the NV-tree toLSH (Locality Sensitive Hashing), the most popular method for -distancesearch, showing that the NV-tree outperforms LSH when it comes to theproblem of nearest neighbour retrieval. Beyond this analysis we also dis-cuss how the NV-tree index can be used in practise in industrial applicationsand address two frequently overlooked requirements: dynamicity—the abil-ity to cope with on-line insertions of new high-dimensional items into theindexed collection—and durability—the ability to recover from crashes andavoid losing the indexed data if a failure occurs. As far as we know, no othernearest neighbor algorithm published so far is able to cope with all threerequirements: scale, dynamicity and durability.Í þessari ritgerð setjum við fram vísinn NV-tré (e. NV-tree) sem lausn áákveðnu afmörkuðu vandamáli: að finna, á hraðvirkan og markvirkan hátt,nálgun áknæstu nágrönnum í stóru safni margvíðra gagnapunkta. NV-tréðer mjög fyrirferðarlítill vísir, þar sem aðeins sex bæti eru geymd fyrir hvernmargvíðan lýsivektor (e. descriptor). NV-tréð skalast því mjög vel þegar þvíer beitt á stór söfn margvíðra lýsivektora. NV-tréð skilar góðum niðurstöðumá skömmum tíma, jafnvel þegar vísarnir komast ekki fyrir í minni. Viðsýnum fram á þetta með niðurstöðum tilrauna á söfnum sem innihalda frá 36milljónum upp í nærri 30 milljarða SIFT (e. Scale Invariant Feature Trans-form) lýsivektora. Við rannsökum einnig þau skilyrði sem þurfa að vera fyrir hendi til að leitað næstu nágrönnum skili merkingarbærum niðurstöðum. Í framhaldi afþeirri greiningu berum við NV-tréð saman við LSH (e. Locality SensitiveHashing), sem er vinsælasta aðferðin fyrir -fjarlægðarleit, og sýnum að NV-tréð er mun hraðvirkara en LSH. Til viðbótar við þessa greiningu ræðumvið hagnýtingu NV-trésins í iðnaði og uppfyllum tvær þarfir sem oft er litiðframhjá: breytileika (e. dynamicity)—getu til að höndla í rauntíma viðbæ-tur við lýsingasafnið—og varanleika (e. durability)—getu til að endurheimtavísinn og forðast gagnatap ef um tölvubilun er að ræða. Að því er við bestvitum, uppfyllir enginn annar þekktur vísir allar þessar þrjár þarfir: skalan-leika, breytileika og varanleika

    Database machines in support of very large databases

    Get PDF
    Software database management systems were developed in response to the needs of early data processing applications. Database machine research developed as a result of certain performance deficiencies of these software systems. This thesis discusses the history of database machines designed to improve the performance of database processing and focuses primarily on the Teradata DBC/1012, the only successfully marketed database machine that supports very large databases today. Also reviewed is the response of IBM to the performance needs of its database customers; this response has been in terms of improvements in both software and hardware support for database processing. In conclusion, an analysis is made of the future of database machines, in particular the DBC/1012, in light of recent IBM enhancements and its immense customer base

    Data partitioning and load balancing in parallel disk systems

    Get PDF
    Parallel disk systems provide opportunities for exploiting I/O parallelism in two possible ways, namely via inter-request and intra-request parallelism. In this paper we discuss the main issues in performance tuning of such systems, namely striping and load balancing, and show their relationship to response time and throughput. We outline the main components of an intelligent file system that optimizes striping by taking into account the requirements of the applications, and performs load balancing by judicious file allocation and dynamic redistributions of the data when access patterns change. Our system uses simple but effective heuristics that incur only little overhead. We present performance experiments based on synthetic workloads and real-life traces

    A Survey on Spatial Indexing

    Get PDF
    Spatial information processing has been a centre of attention of research in the previous decade. In spatial databases, data related with spatial coordinates and extents are retrieved based on spatial proximity. A large number of spatial indexes have been proposed to make ease of efficient indexing of spatial objects in large databases and spatial data retrieval. The goal of this paper is to review the advance techniques of the access methods. This paper tries to classify the existing multidimensional access methods, according to the types of indexing, and their performance over spatial queries. K-d trees out performs quad tress without requiring additional memory usage

    The 1993 Space and Earth Science Data Compression Workshop

    Get PDF
    The Earth Observing System Data and Information System (EOSDIS) is described in terms of its data volume, data rate, and data distribution requirements. Opportunities for data compression in EOSDIS are discussed

    ULTRA-FAST AND MEMORY-EFFICIENT LOOKUPS FOR CLOUD, NETWORKED SYSTEMS, AND MASSIVE DATA MANAGEMENT

    Get PDF
    Systems that process big data (e.g., high-traffic networks and large-scale storage) prefer data structures and algorithms with small memory and fast processing speed. Efficient and fast algorithms play an essential role in system design, despite the improvement of hardware. This dissertation is organized around a novel algorithm called Othello Hashing. Othello Hashing supports ultra-fast and memory-efficient key-value lookup, and it fits the requirements of the core algorithms of many large-scale systems and big data applications. Using Othello hashing, combined with domain expertise in cloud, computer networks, big data, and bioinformatics, I developed the following applications that resolve several major challenges in the area. Concise: Forwarding Information Base. A Forwarding Information Base is a data structure used by the data plane of a forwarding device to determine the proper forwarding actions for packets. The polymorphic property of Othello Hashing the separation of its query and control functionalities, which is a perfect match to the programmable networks such as Software Defined Networks. Using Othello Hashing, we built a fast and scalable FIB named \textit{Concise}. Extensive evaluation results on three different platforms show that Concise outperforms other FIB designs. SDLB: Cloud Load Balancer. In a cloud network, the layer-4 load balancer servers is a device that acts as a reverse proxy and distributes network or application traffic across a number of servers. We built a software load balancer with Othello Hashing techniques named SDLB. SDLB is able to accomplish two functionalities of the SDLB using one Othello query: to find the designated server for packets of ongoing sessions and to distribute new or session-free packets. MetaOthello: Taxonomic Classification of Metagenomic Sequences. Metagenomic read classification is a critical step in the identification and quantification of microbial species sampled by high-throughput sequencing. Due to the growing popularity of metagenomic data in both basic science and clinical applications, as well as the increasing volume of data being generated, efficient and accurate algorithms are in high demand. We built a system to support efficient classification of taxonomic sequences using its k-mer signatures. SeqOthello: RNA-seq Sequence Search Engine. Advances in the study of functional genomics produced a vast supply of RNA-seq datasets. However, how to quickly query and extract information from sequencing resources remains a challenging problem and has been the bottleneck for the broader dissemination of sequencing efforts. The challenge resides in both the sheer volume of the data and its nature of unstructured representation. Using the Othello Hashing techniques, we built the SeqOthello sequence search engine. SeqOthello is a reference-free, alignment-free, and parameter-free sequence search system that supports arbitrary sequence query against large collections of RNA-seq experiments, which enables large-scale integrative studies using sequence-level data
    corecore