935 research outputs found
Complex graph stream mining
University of Technology Sydney. Faculty of Engineering and Information Technology.Recent years have witnessed a dramatic increase of information due to the ever development of modern technologies. The large scale of information makes data analysis, particularly data mining and knowledge discovery tasks, unprecedentedly challenging. First, data is becoming more and more interconnected. In a variety of domains such as social networks, chemical compounds, and XML documents, data is no longer represented by a flat table with instance-feature format, but exhibits complex structures indicating dependency relationships. Second, data is evolving more and more dynamically. Emerging applications such as social networks continuously generate information over time. Third, the learning tasks in many real-life applications become more and more complicated in that there are various constraints on the number of labelled data, class distributions, misclassification costs, or the number of learning tasks etc.
Considering the above challenges, this research aims to investigate theoretical foundations, study new algorithm designs and system frameworks to enable the mining of complex graph streams from three aspects, including (1) Correlated Graph Stream Mining, (2) Graph Stream Classifications, and (3) Complex Task Graph Classification.
In particular, correlated graph stream mining intends to carry out structured pattern search and support the query of similar graphs from a graph stream. Due to the dynamic changing nature of the streaming data and the inherent complexity of the graph query process, treating graph streams as static datasets is computationally infeasible or ineffective. Therefore, we proposed a novel algorithm, CGStream, to identify correlated graphs from a data stream, by using a sliding window, which covers a number of consecutive batches of stream data records. Experimental results demonstrate that the proposed algorithm is several times, or even an order of magnitude, more efficient than the straightforward algorithms.
Graph stream classification aims to build effective and efficient classification models for graph streams with continuous growing volumes and dynamic changes. We proposed two methods for complex graph stream classification. Due to the inherent complexity of graph structure, labelling graph data is very expensive. To solve this problem, we proposed a gLSU algorithm, which aims to select discriminative subgraph features with minimum redundancy by using both labelled and unlabelled graphs for graph streams. The second approach handles graph streams with imbalanced class distributions and noise. Both frameworks use an instance weighting scheme to capture the underlying concept drifts of graph streams and achieve significant performance gain on benchmark graph streams.
Complex task graph classification aims to address the graph classification problems with complex constraints. We studied two complex task graph classification problems, cost-sensitive graph classification of large-scale graphs and multi-task graph classification. As in medical diagnosis the misclassification cost/risk for different classes is inherently different and large scale graph classification is highly demanded in real-life applications, we proposed a CogBoost algorithm for cost-sensitive classification of large scale graphs. To overcome the limitation of insufficient labelled graphs for a specific learning task, we further proposed effective algorithms to leverage multiple graph learning tasks to select subgraph features and regularize multiple tasks to achieve better generalization performance for all learning tasks
FLEET: Butterfly Estimation from a Bipartite Graph Stream
We consider space-efficient single-pass estimation of the number of
butterflies, a fundamental bipartite graph motif, from a massive bipartite
graph stream where each edge represents a connection between entities in two
different partitions. We present a space lower bound for any streaming
algorithm that can estimate the number of butterflies accurately, as well as
FLEET, a suite of algorithms for accurately estimating the number of
butterflies in the graph stream. Estimates returned by the algorithms come with
provable guarantees on the approximation error, and experiments show good
tradeoffs between the space used and the accuracy of approximation. We also
present space-efficient algorithms for estimating the number of butterflies
within a sliding window of the most recent elements in the stream. While there
is a significant body of work on counting subgraphs such as triangles in a
unipartite graph stream, our work seems to be one of the few to tackle the case
of bipartite graph streams.Comment: This is the author's version of the work. It is posted here by
permission of ACM for your personal use. Not for redistribution. The
definitive version was published in Seyed-Vahid Sanei-Mehri, Yu Zhang, Ahmet
Erdem Sariyuce and Srikanta Tirthapura. "FLEET: Butterfly Estimation from a
Bipartite Graph Stream". The 28th ACM International Conference on Information
and Knowledge Managemen
Subgraph Pattern Matching over Uncertain Graphs with Identity Linkage Uncertainty
There is a growing need for methods which can capture uncertainties and
answer queries over graph-structured data. Two common types of uncertainty are
uncertainty over the attribute values of nodes and uncertainty over the
existence of edges. In this paper, we combine those with identity uncertainty.
Identity uncertainty represents uncertainty over the mapping from objects
mentioned in the data, or references, to the underlying real-world entities. We
propose the notion of a probabilistic entity graph (PEG), a probabilistic graph
model that defines a distribution over possible graphs at the entity level. The
model takes into account node attribute uncertainty, edge existence
uncertainty, and identity uncertainty, and thus enables us to systematically
reason about all three types of uncertainties in a uniform manner. We introduce
a general framework for constructing a PEG given uncertain data at the
reference level and develop highly efficient algorithms to answer subgraph
pattern matching queries in this setting. Our algorithms are based on two novel
ideas: context-aware path indexing and reduction by join-candidates, which
drastically reduce the query search space. A comprehensive experimental
evaluation shows that our approach outperforms baseline implementations by
orders of magnitude
Loom: Query-aware Partitioning of Online Graphs
As with general graph processing systems, partitioning data over a cluster of
machines improves the scalability of graph database management systems.
However, these systems will incur additional network cost during the execution
of a query workload, due to inter-partition traversals. Workload-agnostic
partitioning algorithms typically minimise the likelihood of any edge crossing
partition boundaries. However, these partitioners are sub-optimal with respect
to many workloads, especially queries, which may require more frequent
traversal of specific subsets of inter-partition edges. Furthermore, they
largely unsuited to operating incrementally on dynamic, growing graphs.
We present a new graph partitioning algorithm, Loom, that operates on a
stream of graph updates and continuously allocates the new vertices and edges
to partitions, taking into account a query workload of graph pattern
expressions along with their relative frequencies.
First we capture the most common patterns of edge traversals which occur when
executing queries. We then compare sub-graphs, which present themselves
incrementally in the graph update stream, against these common patterns.
Finally we attempt to allocate each match to single partitions, reducing the
number of inter-partition edges within frequently traversed sub-graphs and
improving average query performance.
Loom is extensively evaluated over several large test graphs with realistic
query workloads and various orderings of the graph updates. We demonstrate
that, given a workload, our prototype produces partitionings of significantly
better quality than existing streaming graph partitioning algorithms Fennel and
LDG
A Formal Framework for Linguistic Annotation
`Linguistic annotation' covers any descriptive or analytic notations applied
to raw language data. The basic data may be in the form of time functions --
audio, video and/or physiological recordings -- or it may be textual. The added
notations may include transcriptions of all sorts (from phonetic features to
discourse structures), part-of-speech and sense tagging, syntactic analysis,
`named entity' identification, co-reference annotation, and so on. While there
are several ongoing efforts to provide formats and tools for such annotations
and to publish annotated linguistic databases, the lack of widely accepted
standards is becoming a critical problem. Proposed standards, to the extent
they exist, have focussed on file formats. This paper focuses instead on the
logical structure of linguistic annotations. We survey a wide variety of
existing annotation formats and demonstrate a common conceptual core, the
annotation graph. This provides a formal framework for constructing,
maintaining and searching linguistic annotations, while remaining consistent
with many alternative data structures and file formats.Comment: 49 page
Graph ensemble boosting for imbalanced noisy graph stream classification
© 2014 IEEE. Many applications involve stream data with structural dependency, graph representations, and continuously increasing volumes. For these applications, it is very common that their class distributions are imbalanced with minority (or positive) samples being only a small portion of the population, which imposes significant challenges for learning models to accurately identify minority samples. This problem is further complicated with the presence of noise, because they are similar to minority samples and any treatment for the class imbalance may falsely focus on the noise and result in deterioration of accuracy. In this paper, we propose a classification model to tackle imbalanced graph streams with noise. Our method, graph ensemble boosting, employs an ensemble-based framework to partition graph stream into chunks each containing a number of noisy graphs with imbalanced class distributions. For each individual chunk, we propose a boosting algorithm to combine discriminative subgraph pattern selection and model learning as a unified framework for graph classification. To tackle concept drifting in graph streams, an instance level weighting mechanism is used to dynamically adjust the instance weight, through which the boosting framework can emphasize on difficult graph samples. The classifiers built from different graph chunks form an ensemble for graph stream classification. Experiments on real-life imbalanced graph streams demonstrate clear benefits of our boosting design for handling imbalanced noisy graph stream
BibRank: a language-based model for co-ranking entities in bibliographic networks
International audienceBibliographic documents are basically associated with many entities including authors, venues, affiliations, etc. While bibliographic search engines addressed mainly relevant document ranking according to a query topic, ranking other related relevant bibliographic entities is still challenging. Indeed, document relevance is the primary level that allows inferring the relevance of the other entities regardless of the query topic. In this paper, we propose a novel integrated ranking model, called BibRank, that aims at ranking both document and author entities in bibliographic networks. The underlying algorithm propagates entity scores through the network by means of citation and authorship links. Moreover, we propose to weight these relationships using content-based indicators that estimate the topical relatedness between entities. In particular, we estimate the common similarity between homogeneous entities by analyzing marginal citations. We also compare document and author language models in order to evaluate the level of author's knowledge on the document topic and the document representativeness of author's knowledge. Experiment results on the representative CiteSeerX dataset show that BibRank model outperforms state-of-the-art ranking models with a significant improvement
Private Graph Data Release: A Survey
The application of graph analytics to various domains have yielded tremendous
societal and economical benefits in recent years. However, the increasingly
widespread adoption of graph analytics comes with a commensurate increase in
the need to protect private information in graph databases, especially in light
of the many privacy breaches in real-world graph data that was supposed to
preserve sensitive information. This paper provides a comprehensive survey of
private graph data release algorithms that seek to achieve the fine balance
between privacy and utility, with a specific focus on provably private
mechanisms. Many of these mechanisms fall under natural extensions of the
Differential Privacy framework to graph data, but we also investigate more
general privacy formulations like Pufferfish Privacy that can deal with the
limitations of Differential Privacy. A wide-ranging survey of the applications
of private graph data release mechanisms to social networks, finance, supply
chain, health and energy is also provided. This survey paper and the taxonomy
it provides should benefit practitioners and researchers alike in the
increasingly important area of private graph data release and analysis
- …