2,201 research outputs found

    Storage Solutions for Big Data Systems: A Qualitative Study and Comparison

    Full text link
    Big data systems development is full of challenges in view of the variety of application areas and domains that this technology promises to serve. Typically, fundamental design decisions involved in big data systems design include choosing appropriate storage and computing infrastructures. In this age of heterogeneous systems that integrate different technologies for optimized solution to a specific real world problem, big data system are not an exception to any such rule. As far as the storage aspect of any big data system is concerned, the primary facet in this regard is a storage infrastructure and NoSQL seems to be the right technology that fulfills its requirements. However, every big data application has variable data characteristics and thus, the corresponding data fits into a different data model. This paper presents feature and use case analysis and comparison of the four main data models namely document oriented, key value, graph and wide column. Moreover, a feature analysis of 80 NoSQL solutions has been provided, elaborating on the criteria and points that a developer must consider while making a possible choice. Typically, big data storage needs to communicate with the execution engine and other processing and visualization technologies to create a comprehensive solution. This brings forth second facet of big data storage, big data file formats, into picture. The second half of the research paper compares the advantages, shortcomings and possible use cases of available big data file formats for Hadoop, which is the foundation for most big data computing technologies. Decentralized storage and blockchain are seen as the next generation of big data storage and its challenges and future prospects have also been discussed

    Impliance: A Next Generation Information Management Appliance

    Full text link
    ably successful in building a large market and adapting to the changes of the last three decades, its impact on the broader market of information management is surprisingly limited. If we were to design an information management system from scratch, based upon today's requirements and hardware capabilities, would it look anything like today's database systems?" In this paper, we introduce Impliance, a next-generation information management system consisting of hardware and software components integrated to form an easy-to-administer appliance that can store, retrieve, and analyze all types of structured, semi-structured, and unstructured information. We first summarize the trends that will shape information management for the foreseeable future. Those trends imply three major requirements for Impliance: (1) to be able to store, manage, and uniformly query all data, not just structured records; (2) to be able to scale out as the volume of this data grows; and (3) to be simple and robust in operation. We then describe four key ideas that are uniquely combined in Impliance to address these requirements, namely the ideas of: (a) integrating software and off-the-shelf hardware into a generic information appliance; (b) automatically discovering, organizing, and managing all data - unstructured as well as structured - in a uniform way; (c) achieving scale-out by exploiting simple, massive parallel processing, and (d) virtualizing compute and storage resources to unify, simplify, and streamline the management of Impliance. Impliance is an ambitious, long-term effort to define simpler, more robust, and more scalable information systems for tomorrow's enterprises.Comment: This article is published under a Creative Commons License Agreement (http://creativecommons.org/licenses/by/2.5/.) You may copy, distribute, display, and perform the work, make derivative works and make commercial use of the work, but, you must attribute the work to the author and CIDR 2007. 3rd Biennial Conference on Innovative Data Systems Research (CIDR) January 710, 2007, Asilomar, California, US

    Benchmarking Scalability of NoSQL Databases for Geospatial Queries

    Get PDF
    NoSQL databases provide an edge when it comes to dealing with big unstructured data. Flexibility, agility, and scalability offered by NoSQL databases become increasingly essential when dealing with geospatial data. The proliferation of geospatial applications has tremendously increased the variety, velocity, and volume of data that the data stores must manage. Such characteristics of big spatial data surpassed the capability and anticipated use cases of relational databases. Because we can choose from an extensive collection of NoSQL databases these days, it becomes vital for organizations to make an informed decision. NoSQL Database benchmarks provide system architects, who shoulder a considerable burden of selecting the right technology for their data stores, with a vital start point and source of information. The major utility of these benchmarks is reproducing experiments on similar experimental data that can verify and optimize the process of selecting an optimum tool for data management needs in the early phases of the development. The goal of this research is to develop a benchmark that can compare the performance of NoSQL databases for querying complex geospatial data. We have analyzed throughputs, latencies, and runtime of MongoDB and Couchbase to identify the correct fit for our use case. This way we have also demonstrated a systematic process that can be followed to make an optimum choice of datastore. This benchmark can be extended easily to any NoSQL database that supports geospatial querying

    Efficient top K temporal spatial keyword search

    Get PDF
    Massive amount of data that are geo-tagged and associated with text information are being generated at an unprecedented scale in many emerging applications such as location based services and social networks. Due to their importance, a large body of work has focused on efficiently computing various spatial keyword queries. In this paper, we study the top-k temporal spatial keyword query which considers three important constraints during the search including time, spatial proximity and textual relevance. A novel index structure, namely SSG-tree, to efficiently insert/delete spatio-temporal web objects with high rates. Base on SSG-tree an efficient algorithm is developed to support top-k temporal spatial keyword query. We show via extensive experimentation with real spatial databases that our method has increased performance over alternate techniques

    PaaS Cloud Service for Cost-Effective Harvesting, Processing and Linking of Unstructured Open Government Data

    Get PDF
    Selle projekti eesmärk on luua pilveteenus, mis võimaldaks struktueerimata avalike andmete töötlemist, selleks, et luua semantiline andmete (veebis olevatest dokumentidest leitud organisatsioonide, kohanimede ja isikunimede) ressursikirjeldusraamistiku - Resource Description Framework (RDF) - graaf, mis on ka masinloetav. Pilveteenus saab sisendiks veebiroomaja toodetud logifaili üle 3 miljoni reaga. Igal real on veebiaadress avalikule dokumendile, mis avatakse, loetakse ning kasutades - tööriista eestikeelsest tekstist nimeolemite leidmiseks- Estnltk-d, eraldatakse organisatsiooonide ja kohtade nimetused ja inimeste nimed. Seejärel lisatakse leitud nimed/nimetused RDF graafi, kasutades olemasolevat Pythoni teeki RDFlib. RDF graafis nimed/nimetused lingitakse nende veebiaadressidega, kus asub seda nime/nimetust sisaldav avalik dokument. Dokumendid arhiveeritakse lugemise hetkel neis olnud sisuga. Lisaks sisaldab teenus igakuist andmete ülekontrollimist, et tuvastada dokumentide muutusi ja vajadusel värskendada RDF graafe. Genereeritud RDF graafe kasutatakse SPARQL päringute tegemiseks, mida saavad teha kasutajad graafilise kasutajaliidese kaudu või masinad veebiteenust kasutades. Projekti oluline väljakutse on luua arhitektuur, mis töötleks andmeid võimalikult kiiresti, sest sisendfail on suur (test-logifailis on üle 3 miljoni rea, kus igal real olev URL võib viidata mahukale dokumendile). Selleks jooksutab teenus seal kus võimalik, protsesse paralleelselt, kasutades Google’i virtuaalmasinaid (Google Compute Engine) ja iga virtuaalmasina kõiki protsessoreid.The aim of this project is to develop a cloud platform service for transforming Open Government Data to Linked Open Government Data. This service receives log file, created by web crawler, with URLs (over 3000000) to some open document as an input. It then opens the document, reads its content and with using "Open source tools for Estonian natural language processing" (Estnltk), finds names of locations, organizations and people. Using Psython library "RDFlib", these names are added to the Resource Description Framework (RDF) graph, so that the names become linked to the URLs that refer to the documents. In order to archive current state of accessed document, this service downloads all processed documents. The service also enables monthly updates system of the already processed documents in order to generate new RDF relations if some of the documents have changed. Generated RDFs are publicly available and the service includes SPARQL endpoint for userss (graphical user interface) and machines (web services) for cost-effective querying of linked entities from the RDF files. An important challenge of this service is to speed up its performance, because the documents behind these 3+ billion URLs may be large. To achieve that, parallel processes are run where possible: using several virtual machines and all CPUs in a virtual machine. This is tested in Google Compute Engin

    An Extensible "SCHEMA-LESS" Database Framework for Managing High-Throughput Semi-Structured Documents

    Get PDF
    Object-Relational database management system is an integrated hybrid cooperative approach to combine the best practices of both the relational model utilizing SQL queries and the object-oriented, semantic paradigm for supporting complex data creation. In this paper, a highly scalable, information on demand database framework, called NETMARK, is introduced. NETMARK takes advantages of the Oracle 8i object-relational database using physical addresses data types for very efficient keyword search of records spanning across both context and content. NETMARK was originally developed in early 2000 as a research and development prototype to solve the vast amounts of unstructured and semistructured documents existing within NASA enterprises. Today, NETMARK is a flexible, high-throughput open database framework for managing, storing, and searching unstructured or semi-structured arbitrary hierarchal models, such as XML and HTML

    Lean Middleware

    Get PDF
    This paper describes an approach to achieving data integration across multiple sources in an enterprise, in a manner that is cost efficient and economically scalable. We present an approach that does not rely on major investment in structured, heavy-weight database systems for data storage or heavy-weight middleware responsible for integrated access. The approach is centered around pushing any required data structure and semantics functionality (schema) to application clients, as well as pushing integration specification and functionality to clients where integration can be performed on-the-fly
    corecore