6 research outputs found

    Querying heterogeneous data in NoSQL document stores

    Get PDF
    La problématique de cette thèse porte sur l'interrogation de données hétérogènes dans les systèmes de stockage "not-only SQL" (noSQL) orientés documents. Ces derniers ont connu un important développement ces dernières années en raison de leur capacité à gérer de manière flexible et efficace d'importantes masses de documents. Ils reposent sur le principe "schema-less" consistant à ne plus considérer un schéma unique pour un ensemble de données, appelé collection de documents. Cette flexibilité dans la structuration des données complexifie l'interrogation pour les utilisateurs qui doivent connaître l'ensemble des différents schémas des données manipulées lors de l'écriture de requêtes. Les travaux développés dans cette thèse sont menés dans le cadre du projet neoCampus. Ils se focalisent sur l'interrogation de documents structurellement hétérogènes, en particulier sur le problème de schémas variables. Nous proposons la construction d'un dictionnaire de données qui permet de retrouver tous les schémas des documents. Chaque clef, entrée du dictionnaire, correspond à un chemin absolu ou partiel existant dans au moins un document de la collection. Cette clef est associée aux différents chemins absolus correspondants dans l'ensemble de la collection de documents. Le dictionnaire est alors exploité pour réécrire de manière automatique et transparente les requêtes des utilisateurs. Les requêtes utilisateurs sont établies sur la base des clés du dictionnaire (chemins partiels ou absolus) et sont automatiquement réécrites en exploitant le dictionnaire afin de prendre en compte l'ensemble des chemins absolus existants dans les documents de la collection. Dans cette thèse, nous menons une étude de l'état de l'art des travaux s'attachant à résoudre l'interrogation de documents structurellement hétérogènes, et nous en proposons une classification. Ensuite, nous comparons ces travaux en fonction de critères qui permettent de positionner et différencier notre contribution. Nous définissions formellement les concepts classiques liés aux systèmes orientés documents (document, collection, etc), puis nous étendons cette formalisation par des concepts supplémentaires : chemins absolus et partiels, schémas de document, dictionnaire. Pour la manipulation et l'interrogation des documents, nous définissons un noyau algébrique minimal fermé composé de cinq opérateurs : sélection, projection, des-imbrication (unnest), agrégation et jointure (left-join). Nous définissons chaque opérateur et expliquons son évaluation par un moteur de requête classique. Ensuite, nous établissons la réécriture de chacun des opérateurs à partir du dictionnaire. Nous définissons le processus de réécriture des requêtes utilisateurs qui produit une requête évaluable par un moteur de requête classique en conservant la logique des opérateurs classiques (chemins inexistants, valeurs nulles). Nous montrons comment la réécriture d'une requête initialement construite avec des chemins partiels et/ou absolus permet de résoudre le problème d'hétérogénéité structurelle des documents. Enfin, nous menons des expérimentations afin de valider les concepts formels que nous introduisons tout au long de cette thèse. Nous évaluons la construction et la maintenance du dictionnaire en changeant la configuration en termes de nombre de structures par collection étudiée et de taille de collection. Puis, nous évaluons le moteur de réécriture de requêtes en le comparant à une évaluation de requête dans un contexte sans hétérogénéité structurelle puis dans un contexte de multi-requêtes. Toutes nos expérimentations ont été menées sur des collection synthétiques avec plusieurs niveaux d'imbrications, différents nombres de structure par collection, et différentes tailles de collections. Récemment, nous avons intégré notre contribution dans le projet neOCampus afin de gérer l'hétérogénéité lors de l'interrogation des données de capteurs implantés dans le campus de l'université Toulouse III-Paul Sabatier.This thesis discusses the problems related to querying heterogeneous data in document-oriented systems. Document-oriented "not-only SQL" (noSQL) storage systems have undergone significant development in recent years due to their ability to manage large amounts of documents in a flexible and efficient manner. These systems rely on the "schema-less" concept where no there is no requirement to consider a single schema for a set of data, called a collection of documents. This flexibility in data structures makes the query formulation more complex and users need to know all the different schemas of the data manipulated during the query formulation. The work developed in this thesis subscribes into the frame of neOCampus project. It focuses on issues in the manipulation and the querying of structurally heterogeneous document collections, mainly the problem of variable schemas. We propose the construction of a dictionary of data that makes it possible to find all the schemas of the documents. Each key, a dictionary entry, corresponds to an absolute or partial path existing in at least one document of the collection. This key is associated to all the corresponding absolute paths throughout the collection of heterogeneous documents. The dictionary is then exploited to automatically and transparently reformulate queries from users. The user queries are formulated using the dictionary keys (partial or absolute paths) and are automatically reformulated using the dictionary to consider all the existing paths in all documents in the collection. In this thesis, we conduct a state-of-the-art survey of the work related to solving the problem of querying data of heterogeneous structures, and we propose a classification. Then, we compare these works according to criteria that make it possible to position our contribution. We formally define the classical concepts related to document-oriented systems (document, collection, etc). Then, we extend this formalisation with additional concepts: absolute and partial paths, document schemas, dictionary. For manipulating and querying heterogeneous documents, we define a closed minimal algebraic kernel composed of five operators: selection, projection, unnest, aggregation and join (left join). We define each operator and explain its classical evaluation by the native document querying engine. Then we establish the reformulation rules of each of these operators based on the use of the dictionary. We define the process of reformulating user queries that produces a query that can be evaluated by most document querying engines while keeping the logic of the classical operators (misleading paths, null values). We show how the reformulation of a query initially constructed with partial and/or absolute paths makes it possible to solve the problem of structural heterogeneity of documents. Finally, we conduct experiments to validate the formal concepts that we introduce throughout this thesis. We evaluate the construction and maintenance of the dictionary by changing the configuration in terms of number of structures per collection studied and collection size. Then, we evaluate the query reformulation engine by comparing it to a query evaluation in a context without structural heterogeneity and then in a context of executing multiple queries. All our experiments were conducted on synthetic collections with several levels of nesting, different numbers of structures per collection, and on varying collection sizes. Recently, we deployed our contributions in the neOCampus project to query heterogeneous sensors data installed at different classrooms and the library at the campus of the university of Toulouse III-Paul Sabatier

    Automatic physical database design : recommending materialized views

    Get PDF
    This work discusses physical database design while focusing on the problem of selecting materialized views for improving the performance of a database system. We first address the satisfiability and implication problems for mixed arithmetic constraints. The results are used to support the construction of a search space for view selection problems. We proposed an approach for constructing a search space based on identifying maximum commonalities among queries and on rewriting queries using views. These commonalities are used to define candidate views for materialization from which an optimal or near-optimal set can be chosen as a solution to the view selection problem. Using a search space constructed this way, we address a specific instance of the view selection problem that aims at minimizing the view maintenance cost of multiple materialized views using multi-query optimization techniques. Further, we study this same problem in the context of a commercial database management system in the presence of memory and time restrictions. We also suggest a heuristic approach for maintaining the views while guaranteeing that the restrictions are satisfied. Finally, we consider a dynamic version of the view selection problem where the workload is a sequence of query and update statements. In this case, the views can be created (materialized) and dropped during the execution of the workload. We have implemented our approaches to the dynamic view selection problem and performed extensive experimental testing. Our experiments show that our approaches perform in most cases better than previous ones in terms of effectiveness and efficiency

    A comparison of statistical machine learning methods in heartbeat detection and classification

    Get PDF
    In health care, patients with heart problems require quick responsiveness in a clinical setting or in the operating theatre. Towards that end, automated classification of heartbeats is vital as some heartbeat irregularities are time consuming to detect. Therefore, analysis of electro-cardiogram (ECG) signals is an active area of research. The methods proposed in the literature depend on the structure of a heartbeat cycle. In this paper, we use interval and amplitude based features together with a few samples from the ECG signal as a feature vector. We studied a variety of classification algorithms focused especially on a type of arrhythmia known as the ventricular ectopic fibrillation (VEB). We compare the performance of the classifiers against algorithms proposed in the literature and make recommendations regarding features, sampling rate, and choice of the classifier to apply in a real-time clinical setting. The extensive study is based on the MIT-BIH arrhythmia database. Our main contribution is the evaluation of existing classifiers over a range sampling rates, recommendation of a detection methodology to employ in a practical setting, and extend the notion of a mixture of experts to a larger class of algorithms

    Comparing nested GPSJ queries in multidimensional databases

    No full text
    A multidimensional database can be seen as a collection of multidimensional cubes, from which information is usually extracted by aggregation; aggregated data can be calculated either from cubes containing elemental data or from views in which partially aggregated data are stored. Thus, view materialization and run-time optimization through query rewriting become crucial issues in determining the overall performance. The capability of matching two queries i s necessary to address both issues; unfortunately, most works i n this field consider only simple categories of queries. In this paper we focus on a relevant class of queries, those modeled by Nested Generalized Projection / Selection / Join (NGPSJ) expressions, in which different aggregation functions may be applied in sequence to the same measure and selections may be formulated, at different granularities, on both dimensions and measures of the cube. Given two NGPSJ expressions, we show how to recursively compute their ancestor, ..

    Frameshift mutations at the C-terminus of HIST1H1E result in a specific DNA hypomethylation signature

    Get PDF
    BACKGROUND: We previously associated HIST1H1E mutations causing Rahman syndrome with a specific genome-wide methylation pattern. RESULTS: Methylome analysis from peripheral blood samples of six affected subjects led us to identify a specific hypomethylated profile. This "episignature" was enriched for genes involved in neuronal system development and function. A computational classifier yielded full sensitivity and specificity in detecting subjects with Rahman syndrome. Applying this model to a cohort of undiagnosed probands allowed us to reach diagnosis in one subject. CONCLUSIONS: We demonstrate an epigenetic signature in subjects with Rahman syndrome that can be used to reach molecular diagnosis
    corecore