353 research outputs found

    Towards an Efficient Discovery of the Topological Representative Subgraphs

    Full text link
    With the emergence of graph databases, the task of frequent subgraph discovery has been extensively addressed. Although the proposed approaches in the literature have made this task feasible, the number of discovered frequent subgraphs is still very high to be efficiently used in any further exploration. Feature selection for graph data is a way to reduce the high number of frequent subgraphs based on exact or approximate structural similarity. However, current structural similarity strategies are not efficient enough in many real-world applications, besides, the combinatorial nature of graphs makes it computationally very costly. In order to select a smaller yet structurally irredundant set of subgraphs, we propose a novel approach that mines the top-k topological representative subgraphs among the frequent ones. Our approach allows detecting hidden structural similarities that existing approaches are unable to detect such as the density or the diameter of the subgraph. In addition, it can be easily extended using any user defined structural or topological attributes depending on the sought properties. Empirical studies on real and synthetic graph datasets show that our approach is fast and scalable

    Towards an Efficient Discovery of Topological Representative Subgraphs

    Get PDF
    National audienceLa sélection de motifs basée sur la similarité structurelle exacte ou approximative est un moyen de réduire le nombre élevé des sous-graphes fréquents. Cependant, les stratégies actuelles de similarité structurelle ne sont pas efficaces dans beaucoup de contextes réels. En outre, la nature combinatoire des graphes rend l'isomorphisme exact ou approximatif très coûteux. Dans ce papier, nous proposons une approche qui permet de sélectionner un sous-ensemble de sous-graphes topologiques représentatifs parmi les fréquents. L'approche proposée surmonte le coûteux test d'isomorphisme exact ou approximatif en mesurant la similarité structurelle globale en se basant sur un ensemble d'attributs topologiques considérés. Elle permet aussi de détecter des similaritées structurelles cachées (tels que la densité, le diamètre, etc.) qui ne sont pas considérées par les approches existantes. En outre, l'approche proposée est flexible et peut être facilement étendue avec des attributs définis par l'utilisateur selon l'application. Les analyses expérimentales sur des bases de graphes réelles et synthétiques montrent l'efficacité de notre approche

    Sparse Learning over Infinite Subgraph Features

    Full text link
    We present a supervised-learning algorithm from graph data (a set of graphs) for arbitrary twice-differentiable loss functions and sparse linear models over all possible subgraph features. To date, it has been shown that under all possible subgraph features, several types of sparse learning, such as Adaboost, LPBoost, LARS/LASSO, and sparse PLS regression, can be performed. Particularly emphasis is placed on simultaneous learning of relevant features from an infinite set of candidates. We first generalize techniques used in all these preceding studies to derive an unifying bounding technique for arbitrary separable functions. We then carefully use this bounding to make block coordinate gradient descent feasible over infinite subgraph features, resulting in a fast converging algorithm that can solve a wider class of sparse learning problems over graph data. We also empirically study the differences from the existing approaches in convergence property, selected subgraph features, and search-space sizes. We further discuss several unnoticed issues in sparse learning over all possible subgraph features.Comment: 42 pages, 24 figures, 4 table

    A Scalable Graph-Coarsening Based Index for Dynamic Graph Databases

    Get PDF
    Graph is a commonly used data structure for modeling complex data such as chemical molecules, images, social networks, and XML documents. This complex data is stored using a set of graphs, known as graph database D. To speed up query answering on graph databases, indexes are commonly used. State-of-the-art graph database indexes do not adapt or scale well to dynamic graph database use; they are static, and their ability to prune possible search responses to meet user needs worsens over time as databases change and grow. Users can re-mine indexes to gain some improvement, but it is time consuming. Users must also tune numerous parameters on an ongoing basis to optimize performance and can inadvertently worsen the query response time if they do not choose parameters wisely. Recently, a one-pass algorithm has been developed to enhance the performance of these indexes in part by using the algorithm to update them regularly. However, there are some drawbacks, most notably the need to make updates as the query workload changes. We propose a new index based on graph-coarsening to speed up query answering time in dynamic graph databases. Our index is parameter-free, query-independent, scalable, small enough to store in the main memory, and is simpler and less costly to maintain for database updates. We conducted an extensive sets of experiments on two types of databases, i.e., chemical and social network databases, to compare our graph-coarsening based index vs. hybrid-indexes as follows. First, we considered no database updates or query workload changes (static graph databases) and compared the indexes according to query vi answering time and index size for different minSup values. Second, we compared the indexes in the case of dynamic graph databases, i.e. when graphs are added to or removed from the database. Third, we compared the indexes with regard to query workload changes. Fourth, we studied the scalability of our index vs. hybrid-indexes. Experimental results show that our index outperforms hybrid-indexes (i.e. indexes updated with one-pass) for query answering time in the case of social network databases, and is comparable with these indexes for frequent and infrequent queries on chemical databases. Our graph-coarsening index can be updated up to 60 times faster in comparison to one-pass on dynamic graph databases. Moreover, our index is independent of the query workload for index update and is up to 15 times better after hybrid indexes are attuned to query workload for social network databases. This work is also published in 26th ACM International Conference on Information and Knowledge Management (CIKM) held in Singapore[18]

    Mining and analysis of real-world graphs

    Get PDF
    Networked systems are everywhere - such as the Internet, social networks, biological networks, transportation networks, power grid networks, etc. They can be very large yet enormously complex. They can contain a lot of information, either open and transparent or under the cover and coded. Such real-world systems can be modeled using graphs and be mined and analyzed through the lens of network analysis. Network analysis can be applied in recognition of frequent patterns among the connected components in a large graph, such as social networks, where visual analysis is almost impossible. Frequent patterns illuminate statistically important subgraphs that are usually small enough to analyze visually. Graph mining has different practical applications in fraud detection, outliers detection, chemical molecules, etc., based on the necessity of extracting and understanding the information yielded. Network analysis can also be used to quantitatively evaluate and improve the resilience of infrastructure networks such as the Internet or power grids. Infrastructure networks directly affect the quality of people\u27s lives. However, a disastrous incident in these networks may lead to a cascading breakdown of the whole network and serious economic consequences. In essence, network analysis can help us gain actionable insights and make better data-driven decisions based on the networks. On that note, the objective of this dissertation is to improve upon existing tools for more accurate mining and analysis of real-world networks --Abstract, page iv

    Homophily Outlier Detection in Non-IID Categorical Data

    Full text link
    Most of existing outlier detection methods assume that the outlier factors (i.e., outlierness scoring measures) of data entities (e.g., feature values and data objects) are Independent and Identically Distributed (IID). This assumption does not hold in real-world applications where the outlierness of different entities is dependent on each other and/or taken from different probability distributions (non-IID). This may lead to the failure of detecting important outliers that are too subtle to be identified without considering the non-IID nature. The issue is even intensified in more challenging contexts, e.g., high-dimensional data with many noisy features. This work introduces a novel outlier detection framework and its two instances to identify outliers in categorical data by capturing non-IID outlier factors. Our approach first defines and incorporates distribution-sensitive outlier factors and their interdependence into a value-value graph-based representation. It then models an outlierness propagation process in the value graph to learn the outlierness of feature values. The learned value outlierness allows for either direct outlier detection or outlying feature selection. The graph representation and mining approach is employed here to well capture the rich non-IID characteristics. Our empirical results on 15 real-world data sets with different levels of data complexities show that (i) the proposed outlier detection methods significantly outperform five state-of-the-art methods at the 95%/99% confidence level, achieving 10%-28% AUC improvement on the 10 most complex data sets; and (ii) the proposed feature selection methods significantly outperform three competing methods in enabling subsequent outlier detection of two different existing detectors.Comment: To appear in Data Ming and Knowledge Discovery Journa

    Efficient Subgraph Similarity Search on Large Probabilistic Graph Databases

    Full text link
    Many studies have been conducted on seeking the efficient solution for subgraph similarity search over certain (deterministic) graphs due to its wide application in many fields, including bioinformatics, social network analysis, and Resource Description Framework (RDF) data management. All these works assume that the underlying data are certain. However, in reality, graphs are often noisy and uncertain due to various factors, such as errors in data extraction, inconsistencies in data integration, and privacy preserving purposes. Therefore, in this paper, we study subgraph similarity search on large probabilistic graph databases. Different from previous works assuming that edges in an uncertain graph are independent of each other, we study the uncertain graphs where edges' occurrences are correlated. We formally prove that subgraph similarity search over probabilistic graphs is #P-complete, thus, we employ a filter-and-verify framework to speed up the search. In the filtering phase,we develop tight lower and upper bounds of subgraph similarity probability based on a probabilistic matrix index, PMI. PMI is composed of discriminative subgraph features associated with tight lower and upper bounds of subgraph isomorphism probability. Based on PMI, we can sort out a large number of probabilistic graphs and maximize the pruning capability. During the verification phase, we develop an efficient sampling algorithm to validate the remaining candidates. The efficiency of our proposed solutions has been verified through extensive experiments.Comment: VLDB201

    Geometric, Feature-based and Graph-based Approaches for the Structural Analysis of Protein Binding Sites : Novel Methods and Computational Analysis

    Get PDF
    In this thesis, protein binding sites are considered. To enable the extraction of information from the space of protein binding sites, these binding sites must be mapped onto a mathematical space. This can be done by mapping binding sites onto vectors, graphs or point clouds. To finally enable a structure on the mathematical space, a distance measure is required, which is introduced in this thesis. This distance measure eventually can be used to extract information by means of data mining techniques
    • …
    corecore