277 research outputs found

    Approximating expressive queries on graph-modeled data: The GeX approach

    Get PDF
    We present the GeX (Graph-eXplorer) approach for the approximate matching of complex queries on graph-modeled data. GeX generalizes existing approaches and provides for a highly expressive graph-based query language that supports queries ranging from keyword-based to structured ones. The GeX query answering model gracefully blends label approximation with structural relaxation, under the primary objective of delivering meaningfully approximated results only. GeX implements ad-hoc data structures that are exploited by a top-k retrieval algorithm which enhances the approximate matching of complex queries. An extensive experimental evaluation on real world datasets demonstrates the efficiency of the GeX query answering

    Physical Representation-based Predicate Optimization for a Visual Analytics Database

    Full text link
    Querying the content of images, video, and other non-textual data sources requires expensive content extraction methods. Modern extraction techniques are based on deep convolutional neural networks (CNNs) and can classify objects within images with astounding accuracy. Unfortunately, these methods are slow: processing a single image can take about 10 milliseconds on modern GPU-based hardware. As massive video libraries become ubiquitous, running a content-based query over millions of video frames is prohibitive. One promising approach to reduce the runtime cost of queries of visual content is to use a hierarchical model, such as a cascade, where simple cases are handled by an inexpensive classifier. Prior work has sought to design cascades that optimize the computational cost of inference by, for example, using smaller CNNs. However, we observe that there are critical factors besides the inference time that dramatically impact the overall query time. Notably, by treating the physical representation of the input image as part of our query optimization---that is, by including image transforms, such as resolution scaling or color-depth reduction, within the cascade---we can optimize data handling costs and enable drastically more efficient classifier cascades. In this paper, we propose Tahoma, which generates and evaluates many potential classifier cascades that jointly optimize the CNN architecture and input data representation. Our experiments on a subset of ImageNet show that Tahoma's input transformations speed up cascades by up to 35 times. We also find up to a 98x speedup over the ResNet50 classifier with no loss in accuracy, and a 280x speedup if some accuracy is sacrificed.Comment: Camera-ready version of the paper submitted to ICDE 2019, In Proceedings of the 35th IEEE International Conference on Data Engineering (ICDE 2019

    Graph Processing in Main-Memory Column Stores

    Get PDF
    Evermore, novel and traditional business applications leverage the advantages of a graph data model, such as the offered schema flexibility and an explicit representation of relationships between entities. As a consequence, companies are confronted with the challenge of storing, manipulating, and querying terabytes of graph data for enterprise-critical applications. Although these business applications operate on graph-structured data, they still require direct access to the relational data and typically rely on an RDBMS to keep a single source of truth and access. Existing solutions performing graph operations on business-critical data either use a combination of SQL and application logic or employ a graph data management system. For the first approach, relying solely on SQL results in poor execution performance caused by the functional mismatch between typical graph operations and the relational algebra. To the worse, graph algorithms expose a tremendous variety in structure and functionality caused by their often domain-specific implementations and therefore can be hardly integrated into a database management system other than with custom coding. Since the majority of these enterprise-critical applications exclusively run on relational DBMSs, employing a specialized system for storing and processing graph data is typically not sensible. Besides the maintenance overhead for keeping the systems in sync, combining graph and relational operations is hard to realize as it requires data transfer across system boundaries. A basic ingredient of graph queries and algorithms are traversal operations and are a fundamental component of any database management system that aims at storing, manipulating, and querying graph data. Well-established graph traversal algorithms are standalone implementations relying on optimized data structures. The integration of graph traversals as an operator into a database management system requires a tight integration into the existing database environment and a development of new components, such as a graph topology-aware optimizer and accompanying graph statistics, graph-specific secondary index structures to speedup traversals, and an accompanying graph query language. In this thesis, we introduce and describe GRAPHITE, a hybrid graph-relational data management system. GRAPHITE is a performance-oriented graph data management system as part of an RDBMS allowing to seamlessly combine processing of graph data with relational data in the same system. We propose a columnar storage representation for graph data to leverage the already existing and mature data management and query processing infrastructure of relational database management systems. At the core of GRAPHITE we propose an execution engine solely based on set operations and graph traversals. Our design is driven by the observation that different graph topologies expose different algorithmic requirements to the design of a graph traversal operator. We derive two graph traversal implementations targeting the most common graph topologies and demonstrate how graph-specific statistics can be leveraged to select the optimal physical traversal operator. To accelerate graph traversals, we devise a set of graph-specific, updateable secondary index structures to improve the performance of vertex neighborhood expansion. Finally, we introduce a domain-specific language with an intuitive programming model to extend graph traversals with custom application logic at runtime. We use the LLVM compiler framework to generate efficient code that tightly integrates the user-specified application logic with our highly optimized built-in graph traversal operators. Our experimental evaluation shows that GRAPHITE can outperform native graph management systems by several orders of magnitude while providing all the features of an RDBMS, such as transaction support, backup and recovery, security and user management, effectively providing a promising alternative to specialized graph management systems that lack many of these features and require expensive data replication and maintenance processes

    An efficient and scalable algorithm for clustering XML documents by structure

    Full text link

    Provenance in Collaborative Data Sharing

    Get PDF
    This dissertation focuses on recording, maintaining and exploiting provenance information in Collaborative Data Sharing Systems (CDSS). These are systems that support data sharing across loosely-coupled, heterogeneous collections of relational databases related by declarative schema mappings. A fundamental challenge in a CDSS is to support the capability of update exchange --- which publishes a participant\u27s updates and then translates others\u27 updates to the participant\u27s local schema and imports them --- while tolerating disagreement between them and recording the provenance of exchanged data, i.e., information about the sources and mappings involved in their propagation. This provenance information can be useful during update exchange, e.g., to evaluate provenance-based trust policies. It can also be exploited after update exchange, to answer a variety of user queries, about the quality, uncertainty or authority of the data, for applications such as trust assessment, ranking for keyword search over databases, or query answering in probabilistic databases. To address these challenges, in this dissertation we develop a novel model of provenance graphs that is informative enough to satisfy the needs of CDSS users and captures the semantics of query answering on various forms of annotated relations. We extend techniques from data integration, data exchange, incremental view maintenance and view update to define the formal semantics of unidirectional and bidirectional update exchange. We develop algorithms to perform update exchange incrementally while maintaining provenance information. We present strategies for implementing our techniques over an RDBMS and experimentally demonstrate their viability in the Orchestra prototype system. We define ProQL, a query language for provenance graphs that can be used by CDSS users to combine data querying with provenance testing as well as to compute annotations for their data, based on their provenance, that are useful for a variety of applications. Finally, we develop a prototype implementation ProQL over an RDBMS and indexing techniques to speed up provenance querying, evaluate experimentally the performance of provenance querying and the benefits of our indexing techniques

    SHRuB: searching through heuristics for the better query-execution plan

    Get PDF
    An important aspect to be considered for systems aiming at integrating similarity-queries into RDBMS is how to represent and optimize query-plans that involve traditional and complex predicates. Toward developing facilities for such integration, we developed a technique to extract a canonical queryplan command tree from an similarity-extended SQL expression. The SHRuB tool, presented in this paper, is able to interactively represent a query parsetree. We developed a catalog model which allows estimating the execution cost as well as provides hints for optimizing the query-plan by adopting a three stage heuristic. Through a case study and initial experiments, we have demonstrated that the tool is able to find a local-minimum query-execution plan. Moreover, SHRuB can be plugged on existing frameworks that support similarity queries or employed as a course-ware aid for database teaching.FAPESPCNPqCAPE

    Βελτιστοποίηση Ροής Σχεσιακών Επερωτήσεων κατά το Χρόνο Εκτέλεσης

    Get PDF
    Στα πλαίσια αυτής της διπλωματικής εργασίας, έχουμε αναπτύξει μία βιβλιοθήκη σε Java που επιτρέπει αποδοτικούς συνδυασμούς δύο ή περισσότερων επερωτήσεων σε μία, η οποία υποβάλλεται εναλλακτικά στη βάση δεδομένων. Ουσιαστικά αυτή η βιβλιοθήκη λειτουργεί σαν ένας wrapper γύρω από την εκάστοτε JDBC βιβλιοθήκη του κατασκευαστή της σχεσιακής βάσης δεδομένων. Η βιβλιοθήκη αυτή λειτουργεί σε δύο καταστάσεις. Στην πρώτη φάση, λειτουργεί σε κατάσταση "εκπαίδευσης", δηλαδή παρατηρεί τις αρχικές επερωτήσεις και τις καταγράφει μαζί με επιπρόσθετη μετά- ληροφορία σε ένα n-ary δένδρο, το οποίο ονομάζουμε "context tree". Η πρώτη φάση, κατά προσέγγιση αντιπροσωπεύει το 10\% του συνολικού χρόνου της λειτουργίας της επιχειρησιακής εφαρμογής. Μετά το τέλος της πρώτης φάσης, αυτή η βιβλιοθήκη αποφασίζει ποιες αρχικές επερωτήσεις μπορούν και αξίζει (με κριτήριο τη μείωση της καθυστέρησης) να συνδυαστούν μεταξύ τους. Στη δεύτερη φάση που ονομάζεται κατάσταση "κανονικής" λειτουργίας και που συνήθως είναι το 90\% του χρόνου, αυτή η βιβλιοθήκη χρησιμοποιεί τις εναλλακτικές, συνδυαστικές επερωτήσεις που παρήγαγε η προηγούμενη φάση για να τις υποβάλλει στη βάση δεδομένων στη θέση των αρχικών απλών επερωτήσεων. Η επανεγγραφή πραγματοποιείται στον αέρα καθώς το σύστημα είναι σε κανονική λειτουργία. Παρόλο, που οι παραγόμενες επερωτήσεις είναι πιο πολύπλοκες, αποκαλύπτουν στον επεξεργαστή επερωτήσεων της σχεσιακής βάσης δεδομένων περισσότερες ευκαιρίες για βελτιστοποίηση. Διαφορετικά αυτές οι βελτιστοποιήσεις θα παρέμεναν κρυμμένες και ανευκμετάλλευτες μέσα στην εφαρμογή. Επιπλέον, έχουμε αναπτύξει ένα μοντέλο για τη κοστολόγηση όλων των προτεινόμενων τεχνικών επανεγγραφής που μας επιτρέπει να τις συγκρίνουμε μεταξύ τους και να λαμβάνουμε υπόψιν συστημικές παραμέτρους, όπως τη μέση καθυστέρηση του δικτύου. Η διαδικασία της κοστολόγησης των διαθέσιμων εναλλακτικών και της επιλογής του βέλτιστου σχήματος για την επανεγγραφή μίας ροής επερωτήσεων, πραγματοποιείται μετά το πέρας της "εκπαιδευτικής" φάσης και πριν την έναρξη της "κανονικής" λειτουργίας. Τέλος, εφαρμόσαμε μία εκτεταμένη σειρά από ελέγχους απόδοσης, ώστε να καταγράψουμε τις βελτιώσεις στο πρόβλημα της δικτυακής καθυστέρησης, που γίνεται αντιληπτή στον τελικό χρήστη. Όλες οι εναλλακτικές στρατηγικές επανεγγραφής αποδείχθηκαν πιο αποδοτικές από το αρχικό σχήμα επερωτήσεων από 2 έως και 4 φορές! Παρόλαυτά, οι προτεινόμενες στρατηγικές δεν είναι σε όλες τις περιπτώσεις πιο αποδοτικές. Διαπιστώσαμε, ότι σε ορισμένα δίκτυα που η μέση καθυστέρηση του δικτύου είναι πολύ μικρή, οι εναλλακτικές επερωτήσεις μπορούν να είναι πιο αργές από τις αρχικές! Αυτή η διαπίστωση μαζί με πλήθος άλλωνσυμπερασμάτων παρατίθενται και ερμηνεύονται αναλυτικά στο αντίστοιχο κεφάλαιο των πειραματικών ελέγχων.Current multi user applications submit streams of relational queries against a back end database server. For many years the research community has focused its attention in developing sophisticated storage engines, more efficient query processors, scalable clustering systems, in main memory key-value caching servers that would alleviate any throughput bottlenecks and could allow for more concurrency. Query streams initiated by individual users have received some attention in the form of result set caching. However, unless the same query is resubmitted, this remedy has not proved very efficient on minimizing the latency perceived by the end user. In addition, improvements in network latency have lagged developments in bandwidth usage. In this thesis, we attempt to tackle the latency experienced by the end users, which in contrary to the throughput remains a big issue and there does not seem that any networking hardware improvements will alleviate it in the future. We have studied a specific pattern of query streams that is quite often found in most applications and is characterized by a number of query correlations and deep nesting. We verified that these two factors result in excessive numbers of round-trips, which could be avoided with either manual rewriting of the queries or whith some form of runtime rewritings on the fly. Although, the manual rewriting of these applications could result in much more efficient queries, it is not recommended as it is not always clear for which system configuration or application instance we should optimize. Furthermore, good sofware engineering practices promote code modularity and encapsulation, with more simple queries. We have developed a prototype software library, which allows for run time optimization of these query streams. We have implemented a number of alternative query rewritings that are applied during run time and essentially submit at the back end RDBMS a combined query. This combined query although more complex, reveals to the RDBMS query processor more scope for optimization, which otherwise would have remained hidden within the client application code. In addition, we developed an analytic cost model that allows us to compare different alternatives and at the same time take into account any critical system properties like network communication latency. We performed comprehensive benchmarking so as to measure any improvements, on the total latency, seen by the end user. Finally, we present experimental results where all the alternative strategies outperform the orignal queries by 2 to 4 times
    corecore