233 research outputs found

    Why-Query Support in Graph Databases

    Get PDF
    In the last few decades, database management systems became powerful tools for storing large amount of data and executing complex queries over them. In addition to extended functionality, novel types of databases appear like triple stores, distributed databases, etc. Graph databases implementing the property-graph model belong to this development branch and provide a new way for storing and processing data in the form of a graph with nodes representing some entities and edges describing connections between them. This consideration makes them suitable for keeping data without a rigid schema for use cases like social-network processing or data integration. In addition to a flexible storage, graph databases provide new querying possibilities in the form of path queries, detection of connected components, pattern matching, etc. However, the schema flexibility and graph queries come with additional costs. With limited knowledge about data and little experience in constructing the complex queries, users can create such ones, which deliver unexpected results. Forced to debug queries manually and overwhelmed by the amount of query constraints, users can get frustrated by using graph databases. What is really needed, is to improve usability of graph databases by providing debugging and explaining functionality for such situations. We have to assist users in the discovery of what were the reasons of unexpected results and what can be done in order to fix them. The unexpectedness of result sets can be expressed in terms of their size or content. In the first case, users have to solve the empty-answer, too-many-, or too-few-answers problems. In the second case, users care about the result content and miss some expected answers or wonder about presence of some unexpected ones. Considering the typical problems of receiving no or too many results by querying graph databases, in this thesis we focus on investigating the problems of the first group, whose solutions are usually represented by why-empty, why-so-few, and why-so-many queries. Our objective is to extend graph databases with debugging functionality in the form of why-queries for unexpected query results on the example of pattern matching queries, which are one of general graph-query types. We present a comprehensive analysis of existing debugging tools in the state-of-the-art research and identify their common properties. From them, we formulate the following features of why-queries, which we discuss in this thesis, namely: holistic support of different cardinality-based problems, explanation of unexpected results and query reformulation, comprehensive analysis of explanations, and non-intrusive user integration. To support different cardinality-based problems, we develop methods for explaining no, too few, and too many results. To cover different kinds of explanations, we present two types: subgraph- and modification-based explanations. The first type identifies the reasons of unexpectedness in terms of query subgraphs and delivers differential graphs as answers. The second one reformulates queries in such a way that they produce better results. Considering graph queries to be complex structures with multiple constraints, we investigate different ways of generating explanations starting from the most general one that considers only a query topology through coarse-grained rewriting up to fine-grained modification that allows fine changes of predicates and topology. To provide a comprehensive analysis of explanations, we propose to compare them on three levels including a syntactic description, a content, and a size of a result set. In order to deliver user-aware explanations, we discuss two models for non-intrusive user integration in the generation process. With the techniques proposed in this thesis, we are able to provide fundamentals for debugging of pattern-matching queries, which deliver no, too few, or too many results, in graph databases implementing the property-graph model

    Automated intrusion recovery for web applications

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (pages 93-97).In this dissertation, we develop recovery techniques for web applications and demonstrate that automated recovery from intrusions and user mistakes is practical as well as effective. Web applications play a critical role in users' lives today, making them an attractive target for attackers. New vulnerabilities are routinely found in web application software, and even if the software is bug-free, administrators may make security mistakes such as misconfiguring permissions; these bugs and mistakes virtually guarantee that every application will eventually be compromised. To clean up after a successful attack, administrators need to find its entry point, track down its effects, and undo the attack's corruptions while preserving legitimate changes. Today this is all done manually, which results in days of wasted effort with no guarantee that all traces of the attack have been found or that no legitimate changes were lost. To address this problem, we propose that automated intrusion recovery should be an integral part of web application platforms. This work develops several ideas-retroactive patching, automated UI replay, dependency tracking, patch-based auditing, and distributed repair-that together recover from past attacks that exploited a vulnerability, by retroactively fixing the vulnerability and repairing the system state to make it appear as if the vulnerability never existed. Repair tracks down and reverts effects of the attack on other users within the same application and on other applications, while preserving legitimate changes. Using techniques resulting from these ideas, an administrator can easily recover from past attacks that exploited a bug using nothing more than a patch fixing the bug, with no manual effort on her part to find the attack or track its effects. The same techniques can also recover from attacks that exploit past configuration mistakes-the administrator only has to point out the past request that resulted in the mistake. We built three prototype systems, WARP, POIROT, and AIRE, to explore these ideas. Using these systems, we demonstrate that we can recover from challenging attacks in real distributed web applications with little or no changes to application source code; that recovery time is a fraction of the original execution time for attacks with a few affected requests; and that support for recovery adds modest runtime overhead during the application's normal operation.by Ramesh Chandra.Ph.D

    A Survey of Prevent and Detect Access Control Vulnerabilities

    Full text link
    Broken access control is one of the most common security vulnerabilities in web applications. These vulnerabilities are the major cause of many data breach incidents, which result in privacy concern and revenue loss. However, preventing and detecting access control vulnerabilities proactively in web applications could be difficult. Currently, these vulnerabilities are actively detected by bug bounty hunters post-deployment, which creates attack windows for malicious access. To solve this problem proactively requires security awareness and expertise from developers, which calls for systematic solutions. This survey targets to provide a structured overview of approaches that tackle access control vulnerabilities. It firstly discusses the unique feature of access control vulnerabilities, then studies the existing works proposed to tackle access control vulnerabilities in web applications, which span the spectrum of software development from software design and implementation, software analysis and testing, and runtime monitoring. At last we discuss the open problem in this field

    Metadata-driven data integration

    Get PDF
    Cotutela: Universitat Politècnica de Catalunya i Université Libre de Bruxelles, IT4BI-DC programme for the joint Ph.D. degree in computer science.Data has an undoubtable impact on society. Storing and processing large amounts of available data is currently one of the key success factors for an organization. Nonetheless, we are recently witnessing a change represented by huge and heterogeneous amounts of data. Indeed, 90% of the data in the world has been generated in the last two years. Thus, in order to carry on these data exploitation tasks, organizations must first perform data integration combining data from multiple sources to yield a unified view over them. Yet, the integration of massive and heterogeneous amounts of data requires revisiting the traditional integration assumptions to cope with the new requirements posed by such data-intensive settings. This PhD thesis aims to provide a novel framework for data integration in the context of data-intensive ecosystems, which entails dealing with vast amounts of heterogeneous data, from multiple sources and in their original format. To this end, we advocate for an integration process consisting of sequential activities governed by a semantic layer, implemented via a shared repository of metadata. From an stewardship perspective, this activities are the deployment of a data integration architecture, followed by the population of such shared metadata. From a data consumption perspective, the activities are virtual and materialized data integration, the former an exploratory task and the latter a consolidation one. Following the proposed framework, we focus on providing contributions to each of the four activities. We begin proposing a software reference architecture for semantic-aware data-intensive systems. Such architecture serves as a blueprint to deploy a stack of systems, its core being the metadata repository. Next, we propose a graph-based metadata model as formalism for metadata management. We focus on supporting schema and data source evolution, a predominant factor on the heterogeneous sources at hand. For virtual integration, we propose query rewriting algorithms that rely on the previously proposed metadata model. We additionally consider semantic heterogeneities in the data sources, which the proposed algorithms are capable of automatically resolving. Finally, the thesis focuses on the materialized integration activity, and to this end, proposes a method to select intermediate results to materialize in data-intensive flows. Overall, the results of this thesis serve as contribution to the field of data integration in contemporary data-intensive ecosystems.Les dades tenen un impacte indubtable en la societat. La capacitat d’emmagatzemar i processar grans quantitats de dades disponibles és avui en dia un dels factors claus per l’èxit d’una organització. No obstant, avui en dia estem presenciant un canvi representat per grans volums de dades heterogenis. En efecte, el 90% de les dades mundials han sigut generades en els últims dos anys. Per tal de dur a terme aquestes tasques d’explotació de dades, les organitzacions primer han de realitzar una integració de les dades, combinantles a partir de diferents fonts amb l’objectiu de tenir-ne una vista unificada d’elles. Per això, aquest fet requereix reconsiderar les assumpcions tradicionals en integració amb l’objectiu de lidiar amb els requisits imposats per aquests sistemes de tractament massiu de dades. Aquesta tesi doctoral té com a objectiu proporcional un nou marc de treball per a la integració de dades en el context de sistemes de tractament massiu de dades, el qual implica lidiar amb una gran quantitat de dades heterogènies, provinents de múltiples fonts i en el seu format original. Per això, proposem un procés d’integració compost d’una seqüència d’activitats governades per una capa semàntica, la qual és implementada a partir d’un repositori de metadades compartides. Des d’una perspectiva d’administració, aquestes activitats són el desplegament d’una arquitectura d’integració de dades, seguit per la inserció d’aquestes metadades compartides. Des d’una perspectiva de consum de dades, les activitats són la integració virtual i materialització de les dades, la primera sent una tasca exploratòria i la segona una de consolidació. Seguint el marc de treball proposat, ens centrem en proporcionar contribucions a cada una de les quatre activitats. La tesi inicia proposant una arquitectura de referència de software per a sistemes de tractament massiu de dades amb coneixement semàntic. Aquesta arquitectura serveix com a planell per a desplegar un conjunt de sistemes, sent el repositori de metadades al seu nucli. Posteriorment, proposem un model basat en grafs per a la gestió de metadades. Concretament, ens centrem en donar suport a l’evolució d’esquemes i fonts de dades, un dels factors predominants en les fonts de dades heterogènies considerades. Per a l’integració virtual, proposem algorismes de rescriptura de consultes que usen el model de metadades previament proposat. Com a afegitó, considerem heterogeneïtat semàntica en les fonts de dades, les quals els algorismes de rescriptura poden resoldre automàticament. Finalment, la tesi es centra en l’activitat d’integració materialitzada. Per això proposa un mètode per a seleccionar els resultats intermedis a materialitzar un fluxes de tractament intensiu de dades. En general, els resultats d’aquesta tesi serveixen com a contribució al camp d’integració de dades en els ecosistemes de tractament massiu de dades contemporanisLes données ont un impact indéniable sur la société. Le stockage et le traitement de grandes quantités de données disponibles constituent actuellement l’un des facteurs clés de succès d’une entreprise. Néanmoins, nous assistons récemment à un changement représenté par des quantités de données massives et hétérogènes. En effet, 90% des données dans le monde ont été générées au cours des deux dernières années. Ainsi, pour mener à bien ces tâches d’exploitation des données, les organisations doivent d’abord réaliser une intégration des données en combinant des données provenant de sources multiples pour obtenir une vue unifiée de ces dernières. Cependant, l’intégration de quantités de données massives et hétérogènes nécessite de revoir les hypothèses d’intégration traditionnelles afin de faire face aux nouvelles exigences posées par les systèmes de gestion de données massives. Cette thèse de doctorat a pour objectif de fournir un nouveau cadre pour l’intégration de données dans le contexte d’écosystèmes à forte intensité de données, ce qui implique de traiter de grandes quantités de données hétérogènes, provenant de sources multiples et dans leur format d’origine. À cette fin, nous préconisons un processus d’intégration constitué d’activités séquentielles régies par une couche sémantique, mise en oeuvre via un dépôt partagé de métadonnées. Du point de vue de la gestion, ces activités consistent à déployer une architecture d’intégration de données, suivies de la population de métadonnées partagées. Du point de vue de la consommation de données, les activités sont l’intégration de données virtuelle et matérialisée, la première étant une tâche exploratoire et la seconde, une tâche de consolidation. Conformément au cadre proposé, nous nous attachons à fournir des contributions à chacune des quatre activités. Nous commençons par proposer une architecture logicielle de référence pour les systèmes de gestion de données massives et à connaissance sémantique. Une telle architecture consiste en un schéma directeur pour le déploiement d’une pile de systèmes, le dépôt de métadonnées étant son composant principal. Ensuite, nous proposons un modèle de métadonnées basé sur des graphes comme formalisme pour la gestion des métadonnées. Nous mettons l’accent sur la prise en charge de l’évolution des schémas et des sources de données, facteur prédominant des sources hétérogènes sous-jacentes. Pour l’intégration virtuelle, nous proposons des algorithmes de réécriture de requêtes qui s’appuient sur le modèle de métadonnées proposé précédemment. Nous considérons en outre les hétérogénéités sémantiques dans les sources de données, que les algorithmes proposés sont capables de résoudre automatiquement. Enfin, la thèse se concentre sur l’activité d’intégration matérialisée et propose à cette fin une méthode de sélection de résultats intermédiaires à matérialiser dans des flux des données massives. Dans l’ensemble, les résultats de cette thèse constituent une contribution au domaine de l’intégration des données dans les écosystèmes contemporains de gestion de données massivesPostprint (published version

    A research roadmap towards achieving scalability in model driven engineering

    Get PDF
    International audienceAs Model-Driven Engineering (MDE) is increasingly applied to larger and more complex systems, the current generation of modelling and model management technologies are being pushed to their limits in terms of capacity and eciency. Additional research and development is imperative in order to enable MDE to remain relevant with industrial practice and to continue delivering its widely recognised productivity , quality, and maintainability benefits. Achieving scalabil-ity in modelling and MDE involves being able to construct large models and domain-specific languages in a systematic manner, enabling teams of modellers to construct and refine large models in a collaborative manner, advancing the state of the art in model querying and transformations tools so that they can cope with large models (of the scale of millions of model elements), and providing an infrastructure for ecient storage, indexing and retrieval of large models. This paper attempts to provide a research roadmap for these aspects of scalability in MDE and outline directions for work in this emerging research area

    Sensei : enforcing secure coding guidelines in the integrated development environment

    Get PDF
    We discuss the potential benefits, requirements, and implementation challenges of a security-by-design approach in which an integrated development environment (IDE) plugin assists software developers to write code that complies with secure coding guidelines. We discuss how such a plugin can enable a company's policy-setting security experts and developers to pass their knowledge on to each other more efficiently, and to let developers more effectively put that knowledge into practice. This is achieved by letting the team members develop customized rule sets that formalize coding guidelines and by letting the plugin check the compliance of code being written to those rule sets in real time, similar to an as-you-type spell checker. Upon detected violations, the plugin suggests options to quickly fix them and offers additional information for the developer. We share our experience with proof-of-concept designs and implementations rolled out in multiple companies, and present some future research and development directions

    Scalable and Declarative Information Extraction in a Parallel Data Analytics System

    Get PDF
    Informationsextraktions (IE) auf sehr großen Datenmengen erfordert hochkomplexe, skalierbare und anpassungsfähige Systeme. Obwohl zahlreiche IE-Algorithmen existieren, ist die nahtlose und erweiterbare Kombination dieser Werkzeuge in einem skalierbaren System immer noch eine große Herausforderung. In dieser Arbeit wird ein anfragebasiertes IE-System für eine parallelen Datenanalyseplattform vorgestellt, das für konkrete Anwendungsdomänen konfigurierbar ist und für Textsammlungen im Terabyte-Bereich skaliert. Zunächst werden konfigurierbare Operatoren für grundlegende IE- und Web-Analytics-Aufgaben definiert, mit denen komplexe IE-Aufgaben in Form von deklarativen Anfragen ausgedrückt werden können. Alle Operatoren werden hinsichtlich ihrer Eigenschaften charakterisiert um das Potenzial und die Bedeutung der Optimierung nicht-relationaler, benutzerdefinierter Operatoren (UDFs) für Data Flows hervorzuheben. Anschließend wird der Stand der Technik in der Optimierung nicht-relationaler Data Flows untersucht und herausgearbeitet, dass eine umfassende Optimierung von UDFs immer noch eine Herausforderung ist. Darauf aufbauend wird ein erweiterbarer, logischer Optimierer (SOFA) vorgestellt, der die Semantik von UDFs mit in die Optimierung mit einbezieht. SOFA analysiert eine kompakte Menge von Operator-Eigenschaften und kombiniert eine automatisierte Analyse mit manuellen UDF-Annotationen, um die umfassende Optimierung von Data Flows zu ermöglichen. SOFA ist in der Lage, beliebige Data Flows aus unterschiedlichen Anwendungsbereichen logisch zu optimieren, was zu erheblichen Laufzeitverbesserungen im Vergleich mit anderen Techniken führt. Als Viertes wird die Anwendbarkeit des vorgestellten Systems auf Korpora im Terabyte-Bereich untersucht und systematisch die Skalierbarkeit und Robustheit der eingesetzten Methoden und Werkzeuge beurteilt um schließlich die kritischsten Herausforderungen beim Aufbau eines IE-Systems für sehr große Datenmenge zu charakterisieren.Information extraction (IE) on very large data sets requires highly complex, scalable, and adaptive systems. Although numerous IE algorithms exist, their seamless and extensible combination in a scalable system still is a major challenge. This work presents a query-based IE system for a parallel data analysis platform, which is configurable for specific application domains and scales for terabyte-sized text collections. First, configurable operators are defined for basic IE and Web Analytics tasks, which can be used to express complex IE tasks in the form of declarative queries. All operators are characterized in terms of their properties to highlight the potential and importance of optimizing non-relational, user-defined operators (UDFs) for dataflows. Subsequently, we survey the state of the art in optimizing non-relational dataflows and highlight that a comprehensive optimization of UDFs is still a challenge. Based on this observation, an extensible, logical optimizer (SOFA) is introduced, which incorporates the semantics of UDFs into the optimization process. SOFA analyzes a compact set of operator properties and combines automated analysis with manual UDF annotations to enable a comprehensive optimization of data flows. SOFA is able to logically optimize arbitrary data flows from different application areas, resulting in significant runtime improvements compared to other techniques. Finally, the applicability of the presented system to terabyte-sized corpora is investigated. Hereby, we systematically evaluate scalability and robustness of the employed methods and tools in order to pinpoint the most critical challenges in building an IE system for very large data sets
    • …
    corecore