370 research outputs found
Inductive queries for a drug designing robot scientist
It is increasingly clear that machine learning algorithms need to be integrated in an iterative scientific discovery loop, in which data is queried repeatedly by means of inductive queries and where the computer provides guidance to the experiments that are being performed. In this chapter, we summarise several key challenges in achieving this integration of machine learning and data mining algorithms in methods for the discovery of Quantitative Structure Activity Relationships (QSARs). We introduce the concept of a robot scientist, in which all steps of the discovery process are automated; we discuss the representation of molecular data such that knowledge discovery tools can analyse it, and we discuss the adaptation of machine learning and data mining algorithms to guide QSAR experiments
Sparse Learning over Infinite Subgraph Features
We present a supervised-learning algorithm from graph data (a set of graphs)
for arbitrary twice-differentiable loss functions and sparse linear models over
all possible subgraph features. To date, it has been shown that under all
possible subgraph features, several types of sparse learning, such as Adaboost,
LPBoost, LARS/LASSO, and sparse PLS regression, can be performed. Particularly
emphasis is placed on simultaneous learning of relevant features from an
infinite set of candidates. We first generalize techniques used in all these
preceding studies to derive an unifying bounding technique for arbitrary
separable functions. We then carefully use this bounding to make block
coordinate gradient descent feasible over infinite subgraph features, resulting
in a fast converging algorithm that can solve a wider class of sparse learning
problems over graph data. We also empirically study the differences from the
existing approaches in convergence property, selected subgraph features, and
search-space sizes. We further discuss several unnoticed issues in sparse
learning over all possible subgraph features.Comment: 42 pages, 24 figures, 4 table
Frequent Subgraph Mining via Sampling with Rigorous Guarantees
Frequent subgraph mining is a fundamental task in the analysis of collections of graphs that aims at finding all the subgraphs that appear with more than a user-specified frequency in the dataset. While several exact approaches have been proposed to solve the task, it remains computationally challenging on large graph datasets due to the complexity of the subgraph isomorphism problem inherent in the task and the huge number of candidate patterns even for fairly small subgraphs.
In this thesis, we study two statistical learning measures of complexity, VC-dimension and Rademacher averages, for subgraphs, and derive efficiently computable bounds for both. We then show how such bounds can be applied to devise efficient sampling-based approaches for rigorously approximating the solutions of the frequent subgraph mining problem, providing sample sizes which are much tighter than what would be obtained by a straightforward application of Chernoff and union bounds. We also show that our bounds can be used for true frequent subgraph mining, which requires to identify subgraphs generated with probability above a given threshold using samples from an unknown generative process.
Moreover, we carried out an extensive experimental evaluation of our methods on real datasets, which shows that our bounds lead to efficiently computable and high-quality approximations for both applications.Frequent subgraph mining is a fundamental task in the analysis of collections of graphs that aims at finding all the subgraphs that appear with more than a user-specified frequency in the dataset. While several exact approaches have been proposed to solve the task, it remains computationally challenging on large graph datasets due to the complexity of the subgraph isomorphism problem inherent in the task and the huge number of candidate patterns even for fairly small subgraphs.
In this thesis, we study two statistical learning measures of complexity, VC-dimension and Rademacher averages, for subgraphs, and derive efficiently computable bounds for both. We then show how such bounds can be applied to devise efficient sampling-based approaches for rigorously approximating the solutions of the frequent subgraph mining problem, providing sample sizes which are much tighter than what would be obtained by a straightforward application of Chernoff and union bounds. We also show that our bounds can be used for true frequent subgraph mining, which requires to identify subgraphs generated with probability above a given threshold using samples from an unknown generative process.
Moreover, we carried out an extensive experimental evaluation of our methods on real datasets, which shows that our bounds lead to efficiently computable and high-quality approximations for both applications
Peregrine: A Pattern-Aware Graph Mining System
Graph mining workloads aim to extract structural properties of a graph by
exploring its subgraph structures. General purpose graph mining systems provide
a generic runtime to explore subgraph structures of interest with the help of
user-defined functions that guide the overall exploration process. However, the
state-of-the-art graph mining systems remain largely oblivious to the shape (or
pattern) of the subgraphs that they mine. This causes them to: (a) explore
unnecessary subgraphs; (b) perform expensive computations on the explored
subgraphs; and, (c) hold intermediate partial subgraphs in memory; all of which
affect their overall performance. Furthermore, their programming models are
often tied to their underlying exploration strategies, which makes it difficult
for domain users to express complex mining tasks.
In this paper, we develop Peregrine, a pattern-aware graph mining system that
directly explores the subgraphs of interest while avoiding exploration of
unnecessary subgraphs, and simultaneously bypassing expensive computations
throughout the mining process. We design a pattern-based programming model that
treats "graph patterns" as first class constructs and enables Peregrine to
extract the semantics of patterns, which it uses to guide its exploration. Our
evaluation shows that Peregrine outperforms state-of-the-art distributed and
single machine graph mining systems, and scales to complex mining tasks on
larger graphs, while retaining simplicity and expressivity with its
"pattern-first" programming approach.Comment: This is the full version of the paper appearing in the European
Conference on Computer Systems (EuroSys), 202
Private Graph Data Release: A Survey
The application of graph analytics to various domains have yielded tremendous
societal and economical benefits in recent years. However, the increasingly
widespread adoption of graph analytics comes with a commensurate increase in
the need to protect private information in graph databases, especially in light
of the many privacy breaches in real-world graph data that was supposed to
preserve sensitive information. This paper provides a comprehensive survey of
private graph data release algorithms that seek to achieve the fine balance
between privacy and utility, with a specific focus on provably private
mechanisms. Many of these mechanisms fall under natural extensions of the
Differential Privacy framework to graph data, but we also investigate more
general privacy formulations like Pufferfish Privacy that can deal with the
limitations of Differential Privacy. A wide-ranging survey of the applications
of private graph data release mechanisms to social networks, finance, supply
chain, health and energy is also provided. This survey paper and the taxonomy
it provides should benefit practitioners and researchers alike in the
increasingly important area of private graph data release and analysis
A Survey on Graph Kernels
Graph kernels have become an established and widely-used technique for
solving classification tasks on graphs. This survey gives a comprehensive
overview of techniques for kernel-based graph classification developed in the
past 15 years. We describe and categorize graph kernels based on properties
inherent to their design, such as the nature of their extracted graph features,
their method of computation and their applicability to problems in practice. In
an extensive experimental evaluation, we study the classification accuracy of a
large suite of graph kernels on established benchmarks as well as new datasets.
We compare the performance of popular kernels with several baseline methods and
study the effect of applying a Gaussian RBF kernel to the metric induced by a
graph kernel. In doing so, we find that simple baselines become competitive
after this transformation on some datasets. Moreover, we study the extent to
which existing graph kernels agree in their predictions (and prediction errors)
and obtain a data-driven categorization of kernels as result. Finally, based on
our experimental results, we derive a practitioner's guide to kernel-based
graph classification
Labeled Subgraph Entropy Kernel
In recent years, kernel methods are widespread in tasks of similarity
measuring. Specifically, graph kernels are widely used in fields of
bioinformatics, chemistry and financial data analysis. However, existing
methods, especially entropy based graph kernels are subject to large
computational complexity and the negligence of node-level information. In this
paper, we propose a novel labeled subgraph entropy graph kernel, which performs
well in structural similarity assessment. We design a dynamic programming
subgraph enumeration algorithm, which effectively reduces the time complexity.
Specially, we propose labeled subgraph, which enriches substructure topology
with semantic information. Analogizing the cluster expansion process of gas
cluster in statistical mechanics, we re-derive the partition function and
calculate the global graph entropy to characterize the network. In order to
test our method, we apply several real-world datasets and assess the effects in
different tasks. To capture more experiment details, we quantitatively and
qualitatively analyze the contribution of different topology structures.
Experimental results successfully demonstrate the effectiveness of our method
which outperforms several state-of-the-art methods.Comment: 9 pages,5 figure
- …