11,398 research outputs found
Mining subjectively interesting attributed subgraphs
Community detection in graphs, data clustering, and local pattern mining
are three mature fields of data mining and machine learning.
In recent years, attributed subgraph mining is emerging as a new
powerful data mining task in the intersection of these areas.
Given a graph and a set of attributes for each vertex,
attributed subgraph mining aims to find cohesive subgraphs
for which (a subset of) the attribute values has exceptional values in some sense.
While research on this task can borrow from the three abovementioned fields,
the principled integration of graph and attribute data poses two challenges:
the definition of a pattern language that is intuitive and lends itself to efficient search strategies,
and the formalization of the interestingness of such patterns.
We propose an integrated solution to both of these challenges.
The proposed pattern language improves upon prior work in being both highly flexible and intuitive.
We show how an effective and principled algorithm can enumerate patterns of this language.
The proposed approach for quantifying interestingness of patterns of this language
is rooted in information theory, and is able to account for prior knowledge on the data.
Prior work typically quantifies interestingness based on the cohesion of the subgraph
and for the exceptionality of its attributes separately,
combining these in a parameterized trade-off.
Instead, in our proposal this trade-off is implicitly handled in a principled, parameter-free manner.
Extensive empirical results confirm the proposed pattern syntax is intuitive,
and the interestingness measure aligns well with actual subjective interestingness
HYPA: Efficient Detection of Path Anomalies in Time Series Data on Networks
The unsupervised detection of anomalies in time series data has important
applications in user behavioral modeling, fraud detection, and cybersecurity.
Anomaly detection has, in fact, been extensively studied in categorical
sequences. However, we often have access to time series data that represent
paths through networks. Examples include transaction sequences in financial
networks, click streams of users in networks of cross-referenced documents, or
travel itineraries in transportation networks. To reliably detect anomalies, we
must account for the fact that such data contain a large number of independent
observations of paths constrained by a graph topology. Moreover, the
heterogeneity of real systems rules out frequency-based anomaly detection
techniques, which do not account for highly skewed edge and degree statistics.
To address this problem, we introduce HYPA, a novel framework for the
unsupervised detection of anomalies in large corpora of variable-length
temporal paths in a graph. HYPA provides an efficient analytical method to
detect paths with anomalous frequencies that result from nodes being traversed
in unexpected chronological order.Comment: 11 pages with 8 figures and supplementary material. To appear at SIAM
Data Mining (SDM 2020
Graph Kernels
We present a unified framework to study graph kernels, special cases of which include the random
walk (Gärtner et al., 2003; Borgwardt et al., 2005) and marginalized (Kashima et al., 2003, 2004;
MahĂŠ et al., 2004) graph kernels. Through reduction to a Sylvester equation we improve the time
complexity of kernel computation between unlabeled graphs with n vertices from O(n^6) to O(n^3).
We find a spectral decomposition approach even more efficient when computing entire kernel matrices.
For labeled graphs we develop conjugate gradient and fixed-point methods that take O(dn^3)
time per iteration, where d is the size of the label set. By extending the necessary linear algebra to
Reproducing Kernel Hilbert Spaces (RKHS) we obtain the same result for d-dimensional edge kernels,
and O(n^4) in the infinite-dimensional case; on sparse graphs these algorithms only take O(n^2)
time per iteration in all cases. Experiments on graphs from bioinformatics and other application
domains show that these techniques can speed up computation of the kernel by an order of magnitude
or more. We also show that certain rational kernels (Cortes et al., 2002, 2003, 2004) when
specialized to graphs reduce to our random walk graph kernel. Finally, we relate our framework to
R-convolution kernels (Haussler, 1999) and provide a kernel that is close to the optimal assignment
kernel of FrĂśhlich et al. (2006) yet provably positive semi-definite
LLM4DyG: Can Large Language Models Solve Problems on Dynamic Graphs?
In an era marked by the increasing adoption of Large Language Models (LLMs)
for various tasks, there is a growing focus on exploring LLMs' capabilities in
handling web data, particularly graph data. Dynamic graphs, which capture
temporal network evolution patterns, are ubiquitous in real-world web data.
Evaluating LLMs' competence in understanding spatial-temporal information on
dynamic graphs is essential for their adoption in web applications, which
remains unexplored in the literature. In this paper, we bridge the gap via
proposing to evaluate LLMs' spatial-temporal understanding abilities on dynamic
graphs, to the best of our knowledge, for the first time. Specifically, we
propose the LLM4DyG benchmark, which includes nine specially designed tasks
considering the capability evaluation of LLMs from both temporal and spatial
dimensions. Then, we conduct extensive experiments to analyze the impacts of
different data generators, data statistics, prompting techniques, and LLMs on
the model performance. Finally, we propose Disentangled Spatial-Temporal
Thoughts (DST2) for LLMs on dynamic graphs to enhance LLMs' spatial-temporal
understanding abilities. Our main observations are: 1) LLMs have preliminary
spatial-temporal understanding abilities on dynamic graphs, 2) Dynamic graph
tasks show increasing difficulties for LLMs as the graph size and density
increase, while not sensitive to the time span and data generation mechanism,
3) the proposed DST2 prompting method can help to improve LLMs'
spatial-temporal understanding abilities on dynamic graphs for most tasks. The
data and codes will be open-sourced at publication time
Subgroup discovery for structured target concepts
The main object of study in this thesis is subgroup discovery, a theoretical framework for finding subgroups in dataâi.e., named sub-populationsâ whose behaviour with respect to a specified target concept is exceptional when compared to the rest of the dataset. This is a powerful tool that conveys crucial information to a human audience, but despite past advances has been limited to simple target concepts. In this work we propose algorithms that bring this framework to novel application domains. We introduce the concept of representative subgroups, which we use not only to ensure the fairness of a sub-population with regard to a sensitive trait, such as race or gender, but also to go beyond known trends in the data. For entities with additional relational information that can be encoded as a graph, we introduce a novel measure of robust connectedness which improves on established alternative measures of density; we then provide a method that uses this measure to discover which named sub-populations are more well-connected. Our contributions within subgroup discovery crescent with the introduction of kernelised subgroup discovery: a novel framework that enables the discovery of subgroups on i.i.d. target concepts with virtually any kind of structure. Importantly, our framework additionally provides a concrete and efficient tool that works out-of-the-box without any modification, apart from specifying the Gramian of a positive definite kernel. To use within kernelised subgroup discovery, but also on any other kind of kernel method, we additionally introduce a novel random walk graph kernel. Our kernel allows the fine tuning of the alignment between the vertices of the two compared graphs, during the count of the random walks, while we also propose meaningful structure-aware vertex labels to utilise this new capability. With these contributions we thoroughly extend the applicability of subgroup discovery and ultimately re-define it as a kernel method.Der Hauptgegenstand dieser Arbeit ist die Subgruppenentdeckung (Subgroup Discovery), ein theoretischer Rahmen fĂźr das Auffinden von Subgruppen in Datenâd. h. benannte Teilpopulationenâderen Verhalten in Bezug auf ein bestimmtes Targetkonzept im Vergleich zum Rest des Datensatzes auĂergewĂśhnlich ist. Es handelt sich hierbei um ein leistungsfähiges Instrument, das einem menschlichen Publikum wichtige Informationen vermittelt. Allerdings ist es trotz bisherigen Fortschritte auf einfache Targetkonzepte beschränkt. In dieser Arbeit schlagen wir Algorithmen vor, die diesen Rahmen auf neuartige Anwendungsbereiche Ăźbertragen. Wir fĂźhren das Konzept der repräsentativen Untergruppen ein, mit dem wir nicht nur die Fairness einer Teilpopulation in Bezug auf ein sensibles Merkmal wie Rasse oder Geschlecht sicherstellen, sondern auch Ăźber bekannte Trends in den Daten hinausgehen kĂśnnen. FĂźr Entitäten mit zusätzlicher relationalen Information, die als Graph kodiert werden kann, fĂźhren wir ein neuartiges MaĂ fĂźr robuste Verbundenheit ein, das die etablierten alternativen DichtemaĂe verbessert; anschlieĂend stellen wir eine Methode bereit, die dieses MaĂ verwendet, um herauszufinden, welche benannte Teilpopulationen besser verbunden sind. Unsere Beiträge in diesem Rahmen gipfeln in der EinfĂźhrung der kernelisierten Subgruppenentdeckung: ein neuartiger Rahmen, der die Entdeckung von Subgruppen fĂźr u.i.v. Targetkonzepten mit praktisch jeder Art von Struktur ermĂśglicht. Wichtigerweise, unser Rahmen bereitstellt zusätzlich ein konkretes und effizientes Werkzeug, das ohne jegliche Modifikation funktioniert, abgesehen von der Angabe des Gramian eines positiv definitiven Kernels. FĂźr den Einsatz innerhalb der kernelisierten Subgruppentdeckung, aber auch fĂźr jede andere Art von Kernel-Methode, fĂźhren wir zusätzlich einen neuartigen Random-Walk-Graph-Kernel ein. Unser Kernel ermĂśglicht die Feinabstimmung der Ausrichtung zwischen den Eckpunkten der beiden unter-Vergleich-gestelltenen Graphen während der Zählung der Random Walks, während wir auch sinnvolle strukturbewusste Vertex-Labels vorschlagen, um diese neue Fähigkeit zu nutzen. Mit diesen Beiträgen erweitern wir die Anwendbarkeit der Subgruppentdeckung grĂźndlich und definieren wir sie im Endeffekt als Kernel-Methode neu
Homophily Outlier Detection in Non-IID Categorical Data
Most of existing outlier detection methods assume that the outlier factors
(i.e., outlierness scoring measures) of data entities (e.g., feature values and
data objects) are Independent and Identically Distributed (IID). This
assumption does not hold in real-world applications where the outlierness of
different entities is dependent on each other and/or taken from different
probability distributions (non-IID). This may lead to the failure of detecting
important outliers that are too subtle to be identified without considering the
non-IID nature. The issue is even intensified in more challenging contexts,
e.g., high-dimensional data with many noisy features. This work introduces a
novel outlier detection framework and its two instances to identify outliers in
categorical data by capturing non-IID outlier factors. Our approach first
defines and incorporates distribution-sensitive outlier factors and their
interdependence into a value-value graph-based representation. It then models
an outlierness propagation process in the value graph to learn the outlierness
of feature values. The learned value outlierness allows for either direct
outlier detection or outlying feature selection. The graph representation and
mining approach is employed here to well capture the rich non-IID
characteristics. Our empirical results on 15 real-world data sets with
different levels of data complexities show that (i) the proposed outlier
detection methods significantly outperform five state-of-the-art methods at the
95%/99% confidence level, achieving 10%-28% AUC improvement on the 10 most
complex data sets; and (ii) the proposed feature selection methods
significantly outperform three competing methods in enabling subsequent outlier
detection of two different existing detectors.Comment: To appear in Data Ming and Knowledge Discovery Journa
Context Selection on Attributed Graphs for Outlier and Community Detection
Today\u27s applications store large amounts of complex data that combine information of different types. Attributed graphs are an example for such a complex database where each object is characterized by its relationships to other objects and its individual properties. Specifically, each node in an attributed graph may be characterized by a large number of attributes. In this thesis, we present different approaches for mining such high dimensional attributed graphs
- âŚ