11 research outputs found
Graph Hybrid Summarization
One solution to process and analysis of massive graphs is summarization. Generating a high quality summary is the main challenge of graph summarization. In the aims of generating a summary with a better quality for a given attributed graph, both structural and attribute similarities must be considered. There are two measures named density and entropy to evaluate the quality of structural and attribute-based summaries respectively. For an attributed graph, a high quality summary is one that covers both graph structure and its attributes with the user-specified degrees of importance. Recently two methods has been proposed for summarizing a graph based on both graph structure and attribute similarities. In this paper, a new method for hybrid summarization of a given attributed graph has proposed and the quality of the summary generated by this method has compared with the recently proposed method for this purpose. Experimental results showed that our proposed method generates a summary with a better quality
Boosting for multi-graph classification
© 2014 IEEE. In this paper, we formulate a novel graph-based learning problem, multi-graph classification (MGC), which aims to learn a classifier from a set of labeled bags each containing a number of graphs inside the bag. A bag is labeled positive, if at least one graph in the bag is positive, and negative otherwise. Such a multi-graph representation can be used for many real-world applications, such as webpage classification, where a webpage can be regarded as a bag with texts and images inside the webpage being represented as graphs. This problem is a generalization of multi-instance learning (MIL) but with vital differences, mainly because instances in MIL share a common feature space whereas no feature is available to represent graphs in a multi-graph bag. To solve the problem, we propose a boosting based multi-graph classification framework (bMGC). Given a set of labeled multi-graph bags, bMGC employs dynamic weight adjustment at both bag- and graph-levels to select one subgraph in each iteration as a weak classifier. In each iteration, bag and graph weights are adjusted such that an incorrectly classified bag will receive a higher weight because its predicted bag label conflicts to the genuine label, whereas an incorrectly classified graph will receive a lower weight value if the graph is in a positive bag (or a higher weight if the graph is in a negative bag). Accordingly, bMGC is able to differentiate graphs in positive and negative bags to derive effective classifiers to form a boosting model for MGC. Experiments and comparisons on real-world multi-graph learning tasks demonstrate the algorithm performance
Kaskade: Graph Views for Efficient Graph Analytics
Graphs are an increasingly popular way to model real-world entities and
relationships between them, ranging from social networks to data lineage graphs
and biological datasets. Queries over these large graphs often involve
expensive subgraph traversals and complex analytical computations. These
real-world graphs are often substantially more structured than a generic
vertex-and-edge model would suggest, but this insight has remained mostly
unexplored by existing graph engines for graph query optimization purposes.
Therefore, in this work, we focus on leveraging structural properties of graphs
and queries to automatically derive materialized graph views that can
dramatically speed up query evaluation. We present KASKADE, the first graph
query optimization framework to exploit materialized graph views for query
optimization purposes. KASKADE employs a novel constraint-based view
enumeration technique that mines constraints from query workloads and graph
schemas, and injects them during view enumeration to significantly reduce the
search space of views to be considered. Moreover, it introduces a graph view
size estimator to pick the most beneficial views to materialize given a query
set and to select the best query evaluation plan given a set of materialized
views. We evaluate its performance over real-world graphs, including the
provenance graph that we maintain at Microsoft to enable auditing, service
analytics, and advanced system optimizations. Our results show that KASKADE
substantially reduces the effective graph size and yields significant
performance speedups (up to 50X), in some cases making otherwise intractable
queries possible
Doctor of Philosophy
dissertationServing as a record of what happened during a scientific process, often computational, provenance has become an important piece of computing. The importance of archiving not only data and results but also the lineage of these entities has led to a variety of systems that capture provenance as well as models and schemas for this information. Despite significant work focused on obtaining and modeling provenance, there has been little work on managing and using this information. Using the provenance from past work, it is possible to mine common computational structure or determine differences between executions. Such information can be used to suggest possible completions for partial workflows, summarize a set of approaches, or extend past work in new directions. These applications require infrastructure to support efficient queries and accessible reuse. In order to support knowledge discovery and reuse from provenance information, the management of those data is important. One component of provenance is the specification of the computations; workflows provide structured abstractions of code and are commonly used for complex tasks. Using change-based provenance, it is possible to store large numbers of similar workflows compactly. This storage also allows efficient computation of differences between specifications. However, querying for specific structure across a large collection of workflows is difficult because comparing graphs depends on computing subgraph isomorphism which is NP-Complete. Graph indexing methods identify features that help distinguish graphs of a collection to filter results for a subgraph containment query and reduce the number of subgraph isomorphism computations. For provenance, this work extends these methods to work for more exploratory queries and collections with significant overlap. However, comparing workflow or provenance graphs may not require exact equality; a match between two graphs may allow paired nodes to be similar yet not equivalent. This work presents techniques to better correlate graphs to help summarize collections. Using this infrastructure, provenance can be reused so that users can learn from their own and others' history. Just as textual search has been augmented with suggested completions based on past or common queries, provenance can be used to suggest how computations can be completed or which steps might connect to a given subworkflow. In addition, provenance can help further science by accelerating publication and reuse. By incorporating provenance into publications, authors can more easily integrate their results, and readers can more easily verify and repeat results. However, reusing past computations requires maintaining stronger associations with any input data and underlying code as well as providing paths for migrating old work to new hardware or algorithms. This work presents a framework for maintaining data and code as well as supporting upgrades for workflow computations