33 research outputs found

    Schema Profiling of Document Stores

    Get PDF
    In document stores, schema is a soft concept and the documents in a collection can have different schemata; this gives designers and implementers augmented flexibility but requires an extra effort to understand the rules that drove the use of alternative schemata when heterogeneous documents are to be analyzed or integrated. In this paper we outline a technique, called schema profiling, to explain the schema variants within a collection in document stores by capturing the hidden rules explaining the use of these variants; we express these rules in the form of a decision tree, called schema profile, whose main feature is the coexistence of value-based and schema-based conditions. Consistently with the requirements we elicited from real users, we aim at creating explicative, precise, and concise schema profiles; to quantitatively assess these qualities we introduce a novel measure of entropy

    Evaluating cloud database migration options using workload models

    Get PDF
    A key challenge in porting enterprise software systems to the cloud is the migration of their database. Choosing a cloud provider and service option (e.g., a database-as-a-service or a manually configured set of virtual machines) typically requires the estimation of the cost and migration duration for each considered option. Many organisations also require this information for budgeting and planning purposes. Existing cloud migration research focuses on the software components, and therefore does not address this need. We introduce a two-stage approach which accurately estimates the migration cost, migration duration and cloud running costs of relational databases. The first stage of our approach obtains workload and structure models of the database to be migrated from database logs and the database schema. The second stage performs a discrete-event simulation using these models to obtain the cost and duration estimates. We implemented software tools that automate both stages of our approach. An extensive evaluation compares the estimates from our approach against results from real-world cloud database migrations

    Querying heterogeneous data in NoSQL document stores

    Get PDF
    La problématique de cette thèse porte sur l'interrogation de données hétérogènes dans les systèmes de stockage "not-only SQL" (noSQL) orientés documents. Ces derniers ont connu un important développement ces dernières années en raison de leur capacité à gérer de manière flexible et efficace d'importantes masses de documents. Ils reposent sur le principe "schema-less" consistant à ne plus considérer un schéma unique pour un ensemble de données, appelé collection de documents. Cette flexibilité dans la structuration des données complexifie l'interrogation pour les utilisateurs qui doivent connaître l'ensemble des différents schémas des données manipulées lors de l'écriture de requêtes. Les travaux développés dans cette thèse sont menés dans le cadre du projet neoCampus. Ils se focalisent sur l'interrogation de documents structurellement hétérogènes, en particulier sur le problème de schémas variables. Nous proposons la construction d'un dictionnaire de données qui permet de retrouver tous les schémas des documents. Chaque clef, entrée du dictionnaire, correspond à un chemin absolu ou partiel existant dans au moins un document de la collection. Cette clef est associée aux différents chemins absolus correspondants dans l'ensemble de la collection de documents. Le dictionnaire est alors exploité pour réécrire de manière automatique et transparente les requêtes des utilisateurs. Les requêtes utilisateurs sont établies sur la base des clés du dictionnaire (chemins partiels ou absolus) et sont automatiquement réécrites en exploitant le dictionnaire afin de prendre en compte l'ensemble des chemins absolus existants dans les documents de la collection. Dans cette thèse, nous menons une étude de l'état de l'art des travaux s'attachant à résoudre l'interrogation de documents structurellement hétérogènes, et nous en proposons une classification. Ensuite, nous comparons ces travaux en fonction de critères qui permettent de positionner et différencier notre contribution. Nous définissions formellement les concepts classiques liés aux systèmes orientés documents (document, collection, etc), puis nous étendons cette formalisation par des concepts supplémentaires : chemins absolus et partiels, schémas de document, dictionnaire. Pour la manipulation et l'interrogation des documents, nous définissons un noyau algébrique minimal fermé composé de cinq opérateurs : sélection, projection, des-imbrication (unnest), agrégation et jointure (left-join). Nous définissons chaque opérateur et expliquons son évaluation par un moteur de requête classique. Ensuite, nous établissons la réécriture de chacun des opérateurs à partir du dictionnaire. Nous définissons le processus de réécriture des requêtes utilisateurs qui produit une requête évaluable par un moteur de requête classique en conservant la logique des opérateurs classiques (chemins inexistants, valeurs nulles). Nous montrons comment la réécriture d'une requête initialement construite avec des chemins partiels et/ou absolus permet de résoudre le problème d'hétérogénéité structurelle des documents. Enfin, nous menons des expérimentations afin de valider les concepts formels que nous introduisons tout au long de cette thèse. Nous évaluons la construction et la maintenance du dictionnaire en changeant la configuration en termes de nombre de structures par collection étudiée et de taille de collection. Puis, nous évaluons le moteur de réécriture de requêtes en le comparant à une évaluation de requête dans un contexte sans hétérogénéité structurelle puis dans un contexte de multi-requêtes. Toutes nos expérimentations ont été menées sur des collection synthétiques avec plusieurs niveaux d'imbrications, différents nombres de structure par collection, et différentes tailles de collections. Récemment, nous avons intégré notre contribution dans le projet neOCampus afin de gérer l'hétérogénéité lors de l'interrogation des données de capteurs implantés dans le campus de l'université Toulouse III-Paul Sabatier.This thesis discusses the problems related to querying heterogeneous data in document-oriented systems. Document-oriented "not-only SQL" (noSQL) storage systems have undergone significant development in recent years due to their ability to manage large amounts of documents in a flexible and efficient manner. These systems rely on the "schema-less" concept where no there is no requirement to consider a single schema for a set of data, called a collection of documents. This flexibility in data structures makes the query formulation more complex and users need to know all the different schemas of the data manipulated during the query formulation. The work developed in this thesis subscribes into the frame of neOCampus project. It focuses on issues in the manipulation and the querying of structurally heterogeneous document collections, mainly the problem of variable schemas. We propose the construction of a dictionary of data that makes it possible to find all the schemas of the documents. Each key, a dictionary entry, corresponds to an absolute or partial path existing in at least one document of the collection. This key is associated to all the corresponding absolute paths throughout the collection of heterogeneous documents. The dictionary is then exploited to automatically and transparently reformulate queries from users. The user queries are formulated using the dictionary keys (partial or absolute paths) and are automatically reformulated using the dictionary to consider all the existing paths in all documents in the collection. In this thesis, we conduct a state-of-the-art survey of the work related to solving the problem of querying data of heterogeneous structures, and we propose a classification. Then, we compare these works according to criteria that make it possible to position our contribution. We formally define the classical concepts related to document-oriented systems (document, collection, etc). Then, we extend this formalisation with additional concepts: absolute and partial paths, document schemas, dictionary. For manipulating and querying heterogeneous documents, we define a closed minimal algebraic kernel composed of five operators: selection, projection, unnest, aggregation and join (left join). We define each operator and explain its classical evaluation by the native document querying engine. Then we establish the reformulation rules of each of these operators based on the use of the dictionary. We define the process of reformulating user queries that produces a query that can be evaluated by most document querying engines while keeping the logic of the classical operators (misleading paths, null values). We show how the reformulation of a query initially constructed with partial and/or absolute paths makes it possible to solve the problem of structural heterogeneity of documents. Finally, we conduct experiments to validate the formal concepts that we introduce throughout this thesis. We evaluate the construction and maintenance of the dictionary by changing the configuration in terms of number of structures per collection studied and collection size. Then, we evaluate the query reformulation engine by comparing it to a query evaluation in a context without structural heterogeneity and then in a context of executing multiple queries. All our experiments were conducted on synthetic collections with several levels of nesting, different numbers of structures per collection, and on varying collection sizes. Recently, we deployed our contributions in the neOCampus project to query heterogeneous sensors data installed at different classrooms and the library at the campus of the university of Toulouse III-Paul Sabatier

    Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data

    Get PDF
    Thesis (Ph.D.) - Indiana University, Computer Sciences, 2015As Big Data processing problems evolve, many modern applications demonstrate special characteristics. Data exists in the form of both large historical datasets and high-speed real-time streams, and many analysis pipelines require integrated parallel batch processing and stream processing. Despite the large size of the whole dataset, most analyses focus on specific subsets according to certain criteria. Correspondingly, integrated support for efficient queries and post- query analysis is required. To address the system-level requirements brought by such characteristics, this dissertation proposes a scalable architecture for integrated queries, batch analysis, and streaming analysis of Big Data in the cloud. We verify its effectiveness using a representative application domain - social media data analysis - and tackle related research challenges emerging from each module of the architecture by integrating and extending multiple state-of-the-art Big Data storage and processing systems. In the storage layer, we reveal that existing text indexing techniques do not work well for the unique queries of social data, which put constraints on both textual content and social context. To address this issue, we propose a flexible indexing framework over NoSQL databases to support fully customizable index structures, which can embed necessary social context information for efficient queries. The batch analysis module demonstrates that analysis workflows consist of multiple algorithms with different computation and communication patterns, which are suitable for different processing frameworks. To achieve efficient workflows, we build an integrated analysis stack based on YARN, and make novel use of customized indices in developing sophisticated analysis algorithms. In the streaming analysis module, the high-dimensional data representation of social media streams poses special challenges to the problem of parallel stream clustering. Due to the sparsity of the high-dimensional data, traditional synchronization method becomes expensive and severely impacts the scalability of the algorithm. Therefore, we design a novel strategy that broadcasts the incremental changes rather than the whole centroids of the clusters to achieve scalable parallel stream clustering algorithms. Performance tests using real applications show that our solutions for parallel data loading/indexing, queries, analysis tasks, and stream clustering all significantly outperform implementations using current state-of-the-art technologies

    Mantenimiento de la Consistencia Lógica en Cassandra

    Get PDF
    A diferencia de las bases de datos relacionales, en bases de datos NoSQL como Cassandra es muy común que exista duplicidad de los datos en diferentes tablas. Esto es debido a que, normalmente, las tablas son diseñadas en base a las consultas y a la ausencia de relaciones entre ellas para primar el rendi miento en las consultas. Por tanto, si los datos no se actualizan convenientemente, se pueden producir inconsistencias en la información almacenada. Es relativa mente fácil introducir defectos que originan inconsistencia de datos en Cassan dra, sobre todo durante la evolución de un sistema en el que se crean nuevas tablas, y éstos son difíciles de detectar utilizando técnicas convencionales de pruebas dinámicas. El desarrollador es quien debe preocuparse de mantener esta consistencia incluyendo y actualizando los procedimientos adecuados. Este tra bajo propone un enfoque preventivo a estos problemas, estableciendo los proce sos necesarios para asegurar la calidad de los datos desde el punto de vista de su consistencia, facilitando así las tareas del desarrollador. Estos procesos incluyen: (1) un análisis estático considerando el modelo conceptual, las consultas y el mo delo lógico de la aplicación, para identificar qué elementos (tablas o columnas) de la base de datos se ven afectados por un cambio, y (2) la determinación y ejecución de las operaciones que aseguren la consistencia de la información.Ministerio de Economía y Competitividad TIN2016-76956-C3-1-R/-2-RPrincipado de Asturias GRUPIN14-00

    An ontology-based secure design framework for graph-based databases

    Get PDF
    Graph-based databases are concerned with performance and flexibility. Most of the existing approaches used to design secure NoSQL databases are limited to the final implementation stage, and do not involve the design of security and access control issues at higher abstraction levels. Ensuring security and access control for Graph-based databases is difficult, as each approach differs significantly depending on the technology employed. In this paper, we propose the first technology-ascetic framework with which to design secure Graph-based databases. Our proposal raises the abstraction level by using ontologies to simultaneously model database and security requirements together. This is supported by the TITAN framework, which facilitates the way in which both aspects are dealt with. The great advantages of our approach are, therefore, that it: allows database designers to focus on the simultaneous protection of security and data while ignoring the implementation details; facilitates the secure design and rapid migration of security rules by deriving specific security measures for each underlying technology, and enables database designers to employ ontology reasoning in order to verify whether the security rules are consistent. We show the applicability of our proposal by applying it to a case study based on a hospital data access control.This work has been developed within the AETHER-UA (PID2020-112540RB-C43), AETHER-UMA (PID2020-112540RB-C41) and AETHER-UCLM (PID2020-112540RB-C42), ALBA (TED2021-130355B-C31, TED2021-130355B-C33), PRESECREL (PID2021-124502OB-C42) projects funded by the “Ministerio de Ciencia e Innovación”, Andalusian PAIDI program with grant (P18-RT-2799) and the BALLADER Project (PROMETEO/2021/088) funded by the “Consellería de Innovación, Universidades, Ciencia Sociedad Digital”, Generalitat Valenciana

    Uma Ferramenta para Extração de Esquemas de Bancos de Dados NoSQL Orientados a Documentos

    Get PDF
    TCC(graduação) - Universidade Federal de Santa Catarina. Centro Tecnológico. Sistemas de Informação.Os bancos de dados NoSQL têm se tornando cada vez mais populares no desenvolvimento de aplicações, entre outras características, devido à sua capacidade de lidar com grandes volumes de dados e pela ausência de um esquema de dados explícito. Embora a maioria dos bancos de dados NoSQL não tenha esquema, as informações sobre as propriedades estruturais dos dados persistidos são essenciais durante o desenvolvimento de aplicações. Sem o conhecimento dessas propriedades estruturais, atividades de desenvolvimento de aplicações ou análise de dados tornam- se um trabalho custoso e, às vezes, impraticáveis. Sendo assim, o presente trabalho propõe o desenvolvimento de uma ferramenta que extraia o esquema de uma coleção de documentos no formato JSON, armazenados em um banco de dados NoSQL orientado a documentos, com o objetivo de facilitar diversas tarefas de manipulação posterior desses dados, como a recuperação, validação, integração e análise de dados. Na fase de extração das estruturas dos documentos são aplicadas operações de agregação visando obter um documento para cada estrutura distinta e também é proposta uma estrutura global para agrupar essas estruturas a fim de gerar um único esquema no formato JSON Schema. Finalmente, experimentos realizados em datasets reais do DBPedia, Foursquare e GitHub, além de um dataset hipotético, demonstram que os resultados de tempo de processamento e a completude dos esquemas gerados são comparáveis com os resultados encontrados em abordagens do estado da arte.NoSQL databases are becoming increasingly popular in application development, among other features, because of their ability to handle large data volumes and also their ability to be schemaless. Although most NoSQL databases are schemaless, information about the structural properties of stored data is essential during the application development. Without the knowledge of these structural properties, application development or data analysis activities become costly and sometimes unfeasible. Thus, this work proposes a tool that, given a collection of data in JSON format, stored in a document-oriented NoSQL database, performs the extraction of its schema, with the purpose of facilitating further data manipulation tasks, like data retrieval, integration, validation and analysis. In the extraction phase of the document structure, aggregation operations are applied to obtain a document for each distinct structure. Besides, a global structure is proposed to group these structures in order to present a single schema in JSON Schema format. Finally, experiments based on real DBPedia, Foursquare and GitHub datasets, as well as a hypothetical dataset, demonstrate that the results of processing time and completeness of the schemes generated are comparable with the results found in state of the art approaches

    JSONoid: Monoid-based Enrichment for Configurable and Scalable Data-Driven Schema Discovery

    Full text link
    Schema discovery is an important aspect to working with data in formats such as JSON. Unlike relational databases, JSON data sets often do not have associated structural information. Consumers of such datasets are often left to browse through data in an attempt to observe commonalities in structure across documents to construct suitable code for data processing. However, this process is time-consuming and error-prone. Existing distributed approaches to mining schemas present a significant usability advantage as they provide useful metadata for large data sources. However, depending on the data source, ad hoc queries for estimating other properties to help with crafting an efficient data pipeline can be expensive. We propose JSONoid, a distributed schema discovery process augmented with additional metadata in the form of monoid data structures that are easily maintainable in a distributed setting. JSONoid subsumes several existing approaches to distributed schema discovery with similar performance. Our approach also adds significant useful additional information about data values to discovered schemas with linear scalability
    corecore