138,872 research outputs found

    Scalable Locality-Sensitive Hashing for Similarity Search in High-Dimensional, Large-Scale Multimedia Datasets

    Full text link
    Similarity search is critical for many database applications, including the increasingly popular online services for Content-Based Multimedia Retrieval (CBMR). These services, which include image search engines, must handle an overwhelming volume of data, while keeping low response times. Thus, scalability is imperative for similarity search in Web-scale applications, but most existing methods are sequential and target shared-memory machines. Here we address these issues with a distributed, efficient, and scalable index based on Locality-Sensitive Hashing (LSH). LSH is one of the most efficient and popular techniques for similarity search, but its poor referential locality properties has made its implementation a challenging problem. Our solution is based on a widely asynchronous dataflow parallelization with a number of optimizations that include a hierarchical parallelization to decouple indexing and data storage, locality-aware data partition strategies to reduce message passing, and multi-probing to limit memory usage. The proposed parallelization attained an efficiency of 90% in a distributed system with about 800 CPU cores. In particular, the original locality-aware data partition reduced the number of messages exchanged in 30%. Our parallel LSH was evaluated using the largest public dataset for similarity search (to the best of our knowledge) with 10910^9 128-d SIFT descriptors extracted from Web images. This is two orders of magnitude larger than datasets that previous LSH parallelizations could handle

    A Survey on Large Scale Metadata Server for Big Data Storage

    Full text link
    Big Data is defined as high volume of variety of data with an exponential data growth rate. Data are amalgamated to generate revenue, which results a large data silo. Data are the oils of modern IT industries. Therefore, the data are growing at an exponential pace. The access mechanism of these data silos are defined by metadata. The metadata are decoupled from data server for various beneficial reasons. For instance, ease of maintenance. The metadata are stored in metadata server (MDS). Therefore, the study on the MDS is mandatory in designing of a large scale storage system. The MDS requires many parameters to augment with its architecture. The architecture of MDS depends on the demand of the storage system's requirements. Thus, MDS is categorized in various ways depending on the underlying architecture and design methodology. The article surveys on the various kinds of MDS architecture, designs, and methodologies. This article emphasizes on clustered MDS (cMDS) and the reports are prepared based on a) Bloom filter−-based MDS, b) Client−-funded MDS, c) Geo−-aware MDS, d) Cache−-aware MDS, e) Load−-aware MDS, f) Hash−-based MDS, and g) Tree−-based MDS. Additionally, the article presents the issues and challenges of MDS for mammoth sized data.Comment: Submitted to ACM for possible publicatio

    Comparative Analysis of Distributed and Parallel File Systems' Internal Techniques

    Full text link
    A file system optimization is the most common task in the file system field. Usually, it is seen as the key file system problem. Moreover, it is possible to state that optimization is dominant in commercial development. A problem of a new file system architecture development arises more frequently in academia. End-user can treat file system performance as the key problem of file system evolving as technology. Such understanding arises from common treatment of persistent memory as slow subsystem. As a result, problem of improving performance of data processing treats as a problem of file system performance optimization. However, evolution of physical technologies of persistent data storage requires significant changing of concepts and approaches of file systems' internal techniques. Generally speaking, only trying to improve the file system efficiency cannot resolve all issue of file systems as technological direction. Moreover, it can impede evolution of file system technology at whole. It is impossible to satisfy end-user's expectations by means of file systems optimization only. New persistent storage technologies can question about file systems necessity at whole without suggestion of revolutionary new file system's approaches. However, file system contains paradigm of information structuring that is very important for end-user as a human being. It needs to distinguish the two classes of tasks: (1) optimization task; (2) task of elaboration a new architecture vision or paradigm. But, frequently, project goal degenerates into optimization task which is meant really elaboration of a new paradigm. End-user expectations are complex and contradictory set of requirements. Only optimization tasks cannot resolve the all current needs of end-user in the file system field. End-user's expectations require resolving tasks of a new architecture vision or paradigm elaboration

    Implicit LOD using points ordering for processing and visualisation in Point Cloud Servers

    Full text link
    Lidar datasets now commonly reach Billions of points and are very dense. Using these point cloud becomes challenging, as the high number of points is intractable for most applications and for visualisation.In this work we propose a new paradigm to easily get a portable geometric Level Of Details (LOD) inside a Point Cloud Server.The main idea is to not store the LOD information in an external additional file, but instead to store it implicitly by exploiting the order of the points.The point cloud is divided into groups (patches). These patches are ordered so that their order gradually provides more and more details on the patch. We demonstrate the interest of our method with several classical uses of LOD, such as visualisation of massive point cloud, algorithm acceleration, fast density peak detection and correction.Comment: this article is a split of the previous one because the previous article covered two topics to lousily relate

    TripleID-Q: RDF Query Processing Framework using GPU

    Full text link
    Resource Description Framework (RDF) data represents information linkage around the Internet. It uses Inter- nationalized Resources Identifier (IRI) which can be referred to external information. Typically, an RDF data is serialized as a large text file which contains millions of relationships. In this work, we propose a framework based on TripleID-Q, for query processing of large RDF data in a GPU. The key elements of the framework are 1) a compact representation suitable for a Graphics Processing Unit (GPU) and 2) its simple representation conversion method which optimizes the preprocessing overhead. Together with the framework, we propose parallel algorithms which utilize thousands of GPU threads to look for specific data for a given query as well as to perform basic query operations such as union, join, and filter. The TripleID representation is smaller than the original representation 3-4 times. Querying from TripleID using a GPU is up to 108 times faster than using the traditional RDF tool. The speedup can be more than 1,000 times over the traditional RDF store when processing a complex query with union and join of many subqueries.Comment: 14 page

    ReHub. Extending Hub Labels for Reverse k-Nearest Neighbor Queries on Large-Scale networks

    Full text link
    Quite recently, the algorithmic community has focused on solving multiple shortest-path query problems beyond simple vertex-to-vertex queries, especially in the context of road networks. Unfortunately, this research cannot be generalized for large-scale graphs, e.g., social or collaboration networks, or to efficiently answer Reverse k-Nearest Neighbor (RkNN) queries, which are of practical relevance to a wide range of applications. To remedy this, we propose ReHub, a novel main-memory algorithm that extends the Hub Labeling technique to efficiently answer RkNN queries on large-scale networks. Our experimentation will show that ReHub is the best overall solution for this type of queries, requiring only minimal preprocessing and providing very fast query times

    Architectures for High Performance Computing and Data Systems using Byte-Addressable Persistent Memory

    Full text link
    Non-volatile, byte addressable, memory technology with performance close to main memory promises to revolutionise computing systems in the near future. Such memory technology provides the potential for extremely large memory regions (i.e. > 3TB per server), very high performance I/O, and new ways of storing and sharing data for applications and workflows. This paper outlines an architecture that has been designed to exploit such memory for High Performance Computing and High Performance Data Analytics systems, along with descriptions of how applications could benefit from such hardware

    Big Data Analytics in Bioinformatics: A Machine Learning Perspective

    Full text link
    Bioinformatics research is characterized by voluminous and incremental datasets and complex data analytics methods. The machine learning methods used in bioinformatics are iterative and parallel. These methods can be scaled to handle big data using the distributed and parallel computing technologies. Usually big data tools perform computation in batch-mode and are not optimized for iterative processing and high data dependency among operations. In the recent years, parallel, incremental, and multi-view machine learning algorithms have been proposed. Similarly, graph-based architectures and in-memory big data tools have been developed to minimize I/O cost and optimize iterative processing. However, there lack standard big data architectures and tools for many important bioinformatics problems, such as fast construction of co-expression and regulatory networks and salient module identification, detection of complexes over growing protein-protein interaction data, fast analysis of massive DNA, RNA, and protein sequence data, and fast querying on incremental and heterogeneous disease networks. This paper addresses the issues and challenges posed by several big data problems in bioinformatics, and gives an overview of the state of the art and the future research opportunities.Comment: 20 pages survey paper on Big data analytics in Bioinformatic

    Large-scale image analysis using docker sandboxing

    Full text link
    With the advent of specialized hardware such as Graphics Processing Units (GPUs), large scale image localization, classification and retrieval have seen increased prevalence. Designing scalable software architecture that co-evolves with such specialized hardware is a challenge in the commercial setting. In this paper, we describe one such architecture (\textit{Cortexica}) that leverages scalability of GPUs and sandboxing offered by docker containers. This allows for the flexibility of mixing different computer architectures as well as computational algorithms with the security of a trusted environment. We illustrate the utility of this framework in a commercial setting i.e., searching for multiple products in an image by combining image localisation and retrieval

    DXRAM's Fault-Tolerance Mechanisms Meet High Speed I/O Devices

    Full text link
    In-memory key-value stores provide consistent low-latency access to all objects which is important for interactive large-scale applications like social media networks or online graph analytics and also opens up new application areas. But, when storing the data in RAM on thousands of servers one has to consider server failures. Only a few in-memory key-value stores provide automatic online recovery of failed servers. The most prominent example of these systems is RAMCloud. Another system with sophisticated fault-tolerance mechanisms is DXRAM which is optimized for small data objects. In this report, we detail the remote replication process which is based on logs, investigate selection strategies for the reorganization of these logs and evaluate the reorganization performance for sequential, random, zipf and hot-and-cold distributions in DXRAM. This is also the first time DXRAM's backup system is evaluated with high speed I/O devices, specifically with 56 GBit/s InfiniBand interconnect and PCI-e SSDs. Furthermore, we discuss the copyset replica distribution to reduce the probability for data loss and the adaptations to the original approach for DXRAM.Comment: 21 pages, 20 figure
    • …
    corecore