1,133 research outputs found

    Storage Solutions for Big Data Systems: A Qualitative Study and Comparison

    Full text link
    Big data systems development is full of challenges in view of the variety of application areas and domains that this technology promises to serve. Typically, fundamental design decisions involved in big data systems design include choosing appropriate storage and computing infrastructures. In this age of heterogeneous systems that integrate different technologies for optimized solution to a specific real world problem, big data system are not an exception to any such rule. As far as the storage aspect of any big data system is concerned, the primary facet in this regard is a storage infrastructure and NoSQL seems to be the right technology that fulfills its requirements. However, every big data application has variable data characteristics and thus, the corresponding data fits into a different data model. This paper presents feature and use case analysis and comparison of the four main data models namely document oriented, key value, graph and wide column. Moreover, a feature analysis of 80 NoSQL solutions has been provided, elaborating on the criteria and points that a developer must consider while making a possible choice. Typically, big data storage needs to communicate with the execution engine and other processing and visualization technologies to create a comprehensive solution. This brings forth second facet of big data storage, big data file formats, into picture. The second half of the research paper compares the advantages, shortcomings and possible use cases of available big data file formats for Hadoop, which is the foundation for most big data computing technologies. Decentralized storage and blockchain are seen as the next generation of big data storage and its challenges and future prospects have also been discussed

    Towards Conceptual and Logical Modelling of NoSQL Databases

    Get PDF
    NoSQL databases support the ability to handle large volumes of data in the absence of an explicit data schema. On the other hand, schema information is sometimes essential for applications during data retrieval. Consequently, there are approaches to schema construction in, e.g., the JSON DB and graph DB communities. The difference between a conceptual and database schema is often vague in this case. We use functional constructs – typed attributes for a conceptual view of DB that provide a sufficiently structured approach for expressing semantics of document and graph data. Attribute names are natural language expressions. Such typed functional data objects can be manipulated by terms of a typed λ-calculus, providing powerful nonprocedural query features for considered data structures. The calculus is extendible. Logical, arithmetic, and aggregation functions can be included there. Conceptual and database modelling merge in this case

    Disaster Data Management in Cloud Environments

    Get PDF
    Facilitating decision-making in a vital discipline such as disaster management requires information gathering, sharing, and integration on a global scale and across governments, industries, communities, and academia. A large quantity of immensely heterogeneous disaster-related data is available; however, current data management solutions offer few or no integration capabilities and limited potential for collaboration. Moreover, recent advances in cloud computing, Big Data, and NoSQL have opened the door for new solutions in disaster data management. In this thesis, a Knowledge as a Service (KaaS) framework is proposed for disaster cloud data management (Disaster-CDM) with the objectives of 1) facilitating information gathering and sharing, 2) storing large amounts of disaster-related data from diverse sources, and 3) facilitating search and supporting interoperability and integration. Data are stored in a cloud environment taking advantage of NoSQL data stores. The proposed framework is generic, but this thesis focuses on the disaster management domain and data formats commonly present in that domain, i.e., file-style formats such as PDF, text, MS Office files, and images. The framework component responsible for addressing simulation models is SimOnto. SimOnto, as proposed in this work, transforms domain simulation models into an ontology-based representation with the goal of facilitating integration with other data sources, supporting simulation model querying, and enabling rule and constraint validation. Two case studies presented in this thesis illustrate the use of Disaster-CDM on the data collected during the Disaster Response Network Enabled Platform (DR-NEP) project. The first case study demonstrates Disaster-CDM integration capabilities by full-text search and querying services. In contrast to direct full-text search, Disaster-CDM full-text search also includes simulation model files as well as text contained in image files. Moreover, Disaster-CDM provides querying capabilities and this case study demonstrates how file-style data can be queried by taking advantage of a NoSQL document data store. The second case study focuses on simulation models and uses SimOnto to transform proprietary simulation models into ontology-based models which are then stored in a graph database. This case study demonstrates Disaster-CDM benefits by showing how simulation models can be queried and how model compliance with rules and constraints can be validated

    Querying heterogeneous data in NoSQL document stores

    Get PDF
    La problĂ©matique de cette thĂšse porte sur l'interrogation de donnĂ©es hĂ©tĂ©rogĂšnes dans les systĂšmes de stockage "not-only SQL" (noSQL) orientĂ©s documents. Ces derniers ont connu un important dĂ©veloppement ces derniĂšres annĂ©es en raison de leur capacitĂ© Ă  gĂ©rer de maniĂšre flexible et efficace d'importantes masses de documents. Ils reposent sur le principe "schema-less" consistant Ă  ne plus considĂ©rer un schĂ©ma unique pour un ensemble de donnĂ©es, appelĂ© collection de documents. Cette flexibilitĂ© dans la structuration des donnĂ©es complexifie l'interrogation pour les utilisateurs qui doivent connaĂźtre l'ensemble des diffĂ©rents schĂ©mas des donnĂ©es manipulĂ©es lors de l'Ă©criture de requĂȘtes. Les travaux dĂ©veloppĂ©s dans cette thĂšse sont menĂ©s dans le cadre du projet neoCampus. Ils se focalisent sur l'interrogation de documents structurellement hĂ©tĂ©rogĂšnes, en particulier sur le problĂšme de schĂ©mas variables. Nous proposons la construction d'un dictionnaire de donnĂ©es qui permet de retrouver tous les schĂ©mas des documents. Chaque clef, entrĂ©e du dictionnaire, correspond Ă  un chemin absolu ou partiel existant dans au moins un document de la collection. Cette clef est associĂ©e aux diffĂ©rents chemins absolus correspondants dans l'ensemble de la collection de documents. Le dictionnaire est alors exploitĂ© pour rĂ©Ă©crire de maniĂšre automatique et transparente les requĂȘtes des utilisateurs. Les requĂȘtes utilisateurs sont Ă©tablies sur la base des clĂ©s du dictionnaire (chemins partiels ou absolus) et sont automatiquement rĂ©Ă©crites en exploitant le dictionnaire afin de prendre en compte l'ensemble des chemins absolus existants dans les documents de la collection. Dans cette thĂšse, nous menons une Ă©tude de l'Ă©tat de l'art des travaux s'attachant Ă  rĂ©soudre l'interrogation de documents structurellement hĂ©tĂ©rogĂšnes, et nous en proposons une classification. Ensuite, nous comparons ces travaux en fonction de critĂšres qui permettent de positionner et diffĂ©rencier notre contribution. Nous dĂ©finissions formellement les concepts classiques liĂ©s aux systĂšmes orientĂ©s documents (document, collection, etc), puis nous Ă©tendons cette formalisation par des concepts supplĂ©mentaires : chemins absolus et partiels, schĂ©mas de document, dictionnaire. Pour la manipulation et l'interrogation des documents, nous dĂ©finissons un noyau algĂ©brique minimal fermĂ© composĂ© de cinq opĂ©rateurs : sĂ©lection, projection, des-imbrication (unnest), agrĂ©gation et jointure (left-join). Nous dĂ©finissons chaque opĂ©rateur et expliquons son Ă©valuation par un moteur de requĂȘte classique. Ensuite, nous Ă©tablissons la rĂ©Ă©criture de chacun des opĂ©rateurs Ă  partir du dictionnaire. Nous dĂ©finissons le processus de rĂ©Ă©criture des requĂȘtes utilisateurs qui produit une requĂȘte Ă©valuable par un moteur de requĂȘte classique en conservant la logique des opĂ©rateurs classiques (chemins inexistants, valeurs nulles). Nous montrons comment la rĂ©Ă©criture d'une requĂȘte initialement construite avec des chemins partiels et/ou absolus permet de rĂ©soudre le problĂšme d'hĂ©tĂ©rogĂ©nĂ©itĂ© structurelle des documents. Enfin, nous menons des expĂ©rimentations afin de valider les concepts formels que nous introduisons tout au long de cette thĂšse. Nous Ă©valuons la construction et la maintenance du dictionnaire en changeant la configuration en termes de nombre de structures par collection Ă©tudiĂ©e et de taille de collection. Puis, nous Ă©valuons le moteur de rĂ©Ă©criture de requĂȘtes en le comparant Ă  une Ă©valuation de requĂȘte dans un contexte sans hĂ©tĂ©rogĂ©nĂ©itĂ© structurelle puis dans un contexte de multi-requĂȘtes. Toutes nos expĂ©rimentations ont Ă©tĂ© menĂ©es sur des collection synthĂ©tiques avec plusieurs niveaux d'imbrications, diffĂ©rents nombres de structure par collection, et diffĂ©rentes tailles de collections. RĂ©cemment, nous avons intĂ©grĂ© notre contribution dans le projet neOCampus afin de gĂ©rer l'hĂ©tĂ©rogĂ©nĂ©itĂ© lors de l'interrogation des donnĂ©es de capteurs implantĂ©s dans le campus de l'universitĂ© Toulouse III-Paul Sabatier.This thesis discusses the problems related to querying heterogeneous data in document-oriented systems. Document-oriented "not-only SQL" (noSQL) storage systems have undergone significant development in recent years due to their ability to manage large amounts of documents in a flexible and efficient manner. These systems rely on the "schema-less" concept where no there is no requirement to consider a single schema for a set of data, called a collection of documents. This flexibility in data structures makes the query formulation more complex and users need to know all the different schemas of the data manipulated during the query formulation. The work developed in this thesis subscribes into the frame of neOCampus project. It focuses on issues in the manipulation and the querying of structurally heterogeneous document collections, mainly the problem of variable schemas. We propose the construction of a dictionary of data that makes it possible to find all the schemas of the documents. Each key, a dictionary entry, corresponds to an absolute or partial path existing in at least one document of the collection. This key is associated to all the corresponding absolute paths throughout the collection of heterogeneous documents. The dictionary is then exploited to automatically and transparently reformulate queries from users. The user queries are formulated using the dictionary keys (partial or absolute paths) and are automatically reformulated using the dictionary to consider all the existing paths in all documents in the collection. In this thesis, we conduct a state-of-the-art survey of the work related to solving the problem of querying data of heterogeneous structures, and we propose a classification. Then, we compare these works according to criteria that make it possible to position our contribution. We formally define the classical concepts related to document-oriented systems (document, collection, etc). Then, we extend this formalisation with additional concepts: absolute and partial paths, document schemas, dictionary. For manipulating and querying heterogeneous documents, we define a closed minimal algebraic kernel composed of five operators: selection, projection, unnest, aggregation and join (left join). We define each operator and explain its classical evaluation by the native document querying engine. Then we establish the reformulation rules of each of these operators based on the use of the dictionary. We define the process of reformulating user queries that produces a query that can be evaluated by most document querying engines while keeping the logic of the classical operators (misleading paths, null values). We show how the reformulation of a query initially constructed with partial and/or absolute paths makes it possible to solve the problem of structural heterogeneity of documents. Finally, we conduct experiments to validate the formal concepts that we introduce throughout this thesis. We evaluate the construction and maintenance of the dictionary by changing the configuration in terms of number of structures per collection studied and collection size. Then, we evaluate the query reformulation engine by comparing it to a query evaluation in a context without structural heterogeneity and then in a context of executing multiple queries. All our experiments were conducted on synthetic collections with several levels of nesting, different numbers of structures per collection, and on varying collection sizes. Recently, we deployed our contributions in the neOCampus project to query heterogeneous sensors data installed at different classrooms and the library at the campus of the university of Toulouse III-Paul Sabatier

    Collaborative knowledge as a service applied to the disaster management domain

    Get PDF
    Cloud computing offers services which promise to meet continuously increasing computing demands by using a large number of networked resources. However, data heterogeneity remains a major hurdle for data interoperability and data integration. In this context, a Knowledge as a Service (KaaS) approach has been proposed with the aim of generating knowledge from heterogeneous data and making it available as a service. In this paper, a Collaborative Knowledge as a Service (CKaaS) architecture is proposed, with the objective of satisfying consumer knowledge needs by integrating disparate cloud knowledge through collaboration among distributed KaaS entities. The NIST cloud computing reference architecture is extended by adding a KaaS layer that integrates diverse sources of data stored in a cloud environment. CKaaS implementation is domain-specific; therefore, this paper presents its application to the disaster management domain. A use case demonstrates collaboration of knowledge providers and shows how CKaaS operates with simulation models

    Bridging the gap between the semantic web and big data: answering SPARQL queries over NoSQL databases

    Get PDF
    Nowadays, the database field has gotten much more diverse, and as a result, a variety of non-relational (NoSQL) databases have been created, including JSON-document databases and key-value stores, as well as extensible markup language (XML) and graph databases. Due to the emergence of a new generation of data services, some of the problems associated with big data have been resolved. In addition, in the haste to address the challenges of big data, NoSQL abandoned several core databases features that make them extremely efficient and functional, for instance the global view, which enables users to access data regardless of how it is logically structured or physically stored in its sources. In this article, we propose a method that allows us to query non-relational databases based on the ontology-based access data (OBDA) framework by delegating SPARQL protocol and resource description framework (RDF) query language (SPARQL) queries from ontology to the NoSQL database. We applied the method on a popular database called Couchbase and we discussed the result obtained

    The Forgotten Document-Oriented Database Management Systems: An Overview and Benchmark of Native XML DODBMSes in Comparison with JSON DODBMSes

    Get PDF
    In the current context of Big Data, a multitude of new NoSQL solutions for storing, managing, and extracting information and patterns from semi-structured data have been proposed and implemented. These solutions were developed to relieve the issue of rigid data structures present in relational databases, by introducing semi-structured and flexible schema design. As current data generated by different sources and devices, especially from IoT sensors and actuators, use either XML or JSON format, depending on the application, database technologies that store and query semi-structured data in XML format are needed. Thus, Native XML Databases, which were initially designed to manipulate XML data using standardized querying languages, i.e., XQuery and XPath, were rebranded as NoSQL Document-Oriented Databases Systems. Currently, the majority of these solutions have been replaced with the more modern JSON based Database Management Systems. However, we believe that XML-based solutions can still deliver performance in executing complex queries on heterogeneous collections. Unfortunately nowadays, research lacks a clear comparison of the scalability and performance for database technologies that store and query documents in XML versus the more modern JSON format. Moreover, to the best of our knowledge, there are no Big Data-compliant benchmarks for such database technologies. In this paper, we present a comparison for selected Document-Oriented Database Systems that either use the XML format to encode documents, i.e., BaseX, eXist-db, and Sedna, or the JSON format, i.e., MongoDB, CouchDB, and Couchbase. To underline the performance differences we also propose a benchmark that uses a heterogeneous complex schema on a large DBLP corpus.Comment: 28 pages, 6 figures, 7 table
    • 

    corecore