7 research outputs found

    GraphFind: enhancing graph searching by low support data mining techniques

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Biomedical and chemical databases are large and rapidly growing in size. Graphs naturally model such kinds of data. To fully exploit the wealth of information in these graph databases, a key role is played by systems that search for all exact or approximate occurrences of a query graph. To deal efficiently with graph searching, advanced methods for indexing, representation and matching of graphs have been proposed.</p> <p>Results</p> <p>This paper presents GraphFind. The system implements efficient graph searching algorithms together with advanced filtering techniques that allow approximate search. It allows users to select candidate subgraphs rather than entire graphs. It implements an effective data storage based also on low-support data mining.</p> <p>Conclusions</p> <p>GraphFind is compared with Frowns, GraphGrep and gIndex. Experiments show that GraphFind outperforms the compared systems on a very large collection of small graphs. The proposed low-support mining technique which applies to any searching system also allows a significant index space reduction.</p

    Towards Next Generation Business Process Model Repositories – A Technical Perspective on Loading and Processing of Process Models

    Get PDF
    Business process management repositories manage large collections of process models ranging in the thousands. Additionally, they provide management functions like e.g. mining, querying, merging and variants management for process models. However, most current business process management repositories are built on top of relation database management systems (RDBMS) although this leads to performance issues. These issues result from the relational algebra, the mismatch between relational tables and object oriented programming (impedance mismatch) as well as new technological developments in the last 30 years as e.g. more and cheap disk and memory space, clusters and clouds. The goal of this paper is to present current paradigms to overcome the performance problems inherent in RDBMS. Therefore, we have to fuse research about data modeling along database technologies as well as algorithm design and parallelization for the technology paradigms occurring nowadays. Based on these research streams we have shown how the performance of business process management repositories could be improved in terms of loading performance of processes (from e.g. a disk) and the computation of management techniques resulting in even faster application of such a technique. Exemplarily, applications of the compiled paradigms are presented to show their applicability

    Application of kernel functions for accurate similarity search in large chemical databases

    Get PDF
    Background Similaritysearch in chemical structure databases is an important problem with many applications in chemical genomics, drug design, and efficient chemical probe screening among others. It is widely believed that structure based methods provide an efficient way to do the query. Recently various graph kernel functions have been designed to capture the intrinsic similarity of graphs. Though successful in constructing accurate predictive and classification models, graph kernel functions can not be applied to large chemical compound database due to the high computational complexity and the difficulties in indexing similarity search for large databases. Results To bridge graph kernel function and similarity search in chemical databases, we applied a novel kernel-based similarity measurement, developed in our team, to measure similarity of graph represented chemicals. In our method, we utilize a hash table to support new graph kernel function definition, efficient storage and fast search. We have applied our method, named G-hash, to large chemical databases. Our results show that the G-hash method achieves state-of-the-art performance for k-nearest neighbor (k-NN) classification. Moreover, the similarity measurement and the index structure is scalable to large chemical databases with smaller indexing size, and faster query processing time as compared to state-of-the-art indexing methods such as Daylight fingerprints, C-tree and GraphGrep. Conclusions Efficient similarity query processing method for large chemical databases is challenging since we need to balance running time efficiency and similarity search accuracy. Our previous similarity search method, G-hash, provides a new way to perform similarity search in chemical databases. Experimental study validates the utility of G-hash in chemical databases

    SING: Subgraph search In Non-homogeneous Graphs

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Finding the subgraphs of a graph database that are isomorphic to a given query graph has practical applications in several fields, from cheminformatics to image understanding. Since subgraph isomorphism is a computationally hard problem, indexing techniques have been intensively exploited to speed up the process. Such systems filter out those graphs which cannot contain the query, and apply a subgraph isomorphism algorithm to each residual candidate graph. The applicability of such systems is limited to databases of small graphs, because their filtering power degrades on large graphs.</p> <p>Results</p> <p>In this paper, SING (Subgraph search In Non-homogeneous Graphs), a novel indexing system able to cope with large graphs, is presented. The method uses the notion of <it>feature</it>, which can be a small subgraph, subtree or path. Each graph in the database is annotated with the set of all its features. The key point is to make use of feature locality information. This idea is used to both improve the filtering performance and speed up the subgraph isomorphism task.</p> <p>Conclusions</p> <p>Extensive tests on chemical compounds, biological networks and synthetic graphs show that the proposed system outperforms the most popular systems in query time over databases of medium and large graphs. Other specific tests show that the proposed system is effective for single large graphs.</p

    관계형 RDF 저장소에서 그래프 구조적 정보를 사용한 질의 최적화 기법

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2014. 2. 김형주.As the size of Resource Description Framework (RDF) graphs has grown rapidly, SPARQL query processing on the large-scale RDF graph has become a more challenging problem. For efficient SPARQL query processing, the handling of the intermediate results is the most crucial element because it generally involves many join operators. In order to address this problem, we ropose the triple filtering method that exploits the graph-structural information of RDF data. We design the RDF Path index (RP-index) and the RDF Graph index (RGindex) for the triple filtering. These two indices uses the path information and the graph information of the RDF graph, respectively. However, these indices have the size problem due to the exponential number of the indexed patterns. We address the size problem by indexing only effective the path and graph patterns for the triple filtering. The triple filtering is performed very efficiently by a relational operator called the RDF Filter (RFLT) with little overhead compared to the original query processing. Through comprehensive experiments on large-scale RDF datasets, we demonstrate that our approaches can effectively and efficiently reduce the number of redundant intermediate results and improve the query performance.A Query Optimization Technique using Graph-Structural Information in Relational RDF Stores Chapter 1 Introduction 1 1.1 Research Motivation 3 1.2 Our Contributions 6 1.3 Outline 11 Chapter 2 Related Work 13 2.1 RDF Stores 13 2.1.1 Summary of Existing Methods of Relation-based RDF Stores 16 2.1.2 Overview of RDF-3X 18 2.2 Handling the Intermediate Results 20 2.3 Path-based and Graph Indices 21 2.4 Frequent Graph Pattern Mining 23 Chapter 3 Preliminaries 25 3.1 RDF and SPARQL 25 3.2 Path and Graph Pattern 29 3.2.1 Incoming Predicate Path 29 3.2.2 k-neighborhood Subgraph 30 3.3 Candidate Vertex Set 31 Chapter 4 R3F: RDF Triple Filtering Framework using RP-index 35 4.1 Motivating Example 35 4.2 Overall Process of R3F 37 4.3 RP-index Definition 38 4.3.1 Physical Structure of RP-index 39 4.3.2 Discriminative and Frequent Predicate Paths 40 4.3.3 Reverse Predicate 42 4.3.4 Handling Other Types of Queries 45 4.3.5 Determining RP-index Parameters 46 4.4 Processing Triple Filtering 47 4.4.1 RFLT Operator 47 4.5 Generating an Execution Plan with RFLT Operators 52 4.5.1 Filtering Effect of Vlists 54 4.5.2 Cardinality of RFLT Operator 55 4.5.3 Generating an Execution Plan 57 4.6 RP-index Building 59 4.6.1 Complexity of building RP-index 63 4.6.2 Parallel Building Methods 63 4.6.3 Incremental Maintenance 65 4.7 Experimental Results 68 4.7.1 RP-index Size 70 4.7.2 Query Evaluation Performance 73 4.7.3 Incremental Maintenance of RP-index 78 Chapter 5 RG-index: RDF Triple Filtering using the Graph Index 87 5.1 Motivating Example 87 5.2 Design of RG-index 90 5.2.1 Physical Structure of RG-index 92 5.3 Handling the Size Problem of RG-index 96 5.3.1 Discriminative Patterns 96 5.3.2 Frequent Patterns 97 5.4 Building RG-index 98 5.4.1 Overview of gSpan 98 5.4.2 RDF Graph Pattern Mining using gSpan 99 5.4.3 Complexity of building RG-index 106 5.5 Triple Filtering using RG-index 106 5.5.1 Generating an Execution Plan with RFLT Operators 107 5.6 Experimental Results 109 5.6.1 RG-index Size 111 5.6.2 Query Evaluation Performance 112 5.6.3 Index Building Time 116 Chapter 6 Conclusion and Future Work 119 6.1 Future Work 120 Appendices 125 Chapter A Related Open Source Projects 125 A.1 RDF-3X 125 A.2 gSpan 129 Chapter B Data Structure of RP-index and RG-index 133 B.1 RP-index 133 B.2 RG-index 134 Chapter C Query Sets 137Docto
    corecore