169 research outputs found
Medoid Silhouette clustering with automatic cluster number selection
The evaluation of clustering results is difficult, highly dependent on the
evaluated data set and the perspective of the beholder. There are many
different clustering quality measures, which try to provide a general measure
to validate clustering results. A very popular measure is the Silhouette. We
discuss the efficient medoid-based variant of the Silhouette, perform a
theoretical analysis of its properties, provide two fast versions for the
direct optimization, and discuss the use to choose the optimal number of
clusters. We combine ideas from the original Silhouette with the well-known PAM
algorithm and its latest improvements FasterPAM. One of the versions guarantees
equal results to the original variant and provides a run speedup of .
In experiments on real data with 30000 samples and =100, we observed a
10464 speedup compared to the original PAMMEDSIL algorithm.
Additionally, we provide a variant to choose the optimal number of clusters
directly.Comment: arXiv admin note: substantial text overlap with arXiv:2209.1255
Data Aggregation for Hierarchical Clustering
Hierarchical Agglomerative Clustering (HAC) is likely the earliest and most
flexible clustering method, because it can be used with many distances,
similarities, and various linkage strategies. It is often used when the number
of clusters the data set forms is unknown and some sort of hierarchy in the
data is plausible. Most algorithms for HAC operate on a full distance matrix,
and therefore require quadratic memory. The standard algorithm also has cubic
runtime to produce a full hierarchy. Both memory and runtime are especially
problematic in the context of embedded or otherwise very resource-constrained
systems. In this section, we present how data aggregation with BETULA, a
numerically stable version of the well known BIRCH data aggregation algorithm,
can be used to make HAC viable on systems with constrained resources with only
small losses on clustering quality, and hence allow exploratory data analysis
of very large data sets
Sparse Partitioning Around Medoids
Partitioning Around Medoids (PAM, k-Medoids) is a popular clustering
technique to use with arbitrary distance functions or similarities, where each
cluster is represented by its most central object, called the medoid or the
discrete median. In operations research, this family of problems is also known
as facility location problem (FLP). FastPAM recently introduced a speedup for
large k to make it applicable for larger problems, but the method still has a
runtime quadratic in N. In this chapter, we discuss a sparse and asymmetric
variant of this problem, to be used for example on graph data such as road
networks. By exploiting sparsity, we can avoid the quadratic runtime and memory
requirements, and make this method scalable to even larger problems, as long as
we are able to build a small enough graph of sufficient connectivity to perform
local optimization. Furthermore, we consider asymmetric cases, where the set of
medoids is not identical to the set of points to be covered (or in the
interpretation of facility location, where the possible facility locations are
not identical to the consumer locations). Because of sparsity, it may be
impossible to cover all points with just k medoids for too small k, which would
render the problem unsolvable, and this breaks common heuristics for finding a
good starting condition. We, hence, consider determining k as a part of the
optimization problem and propose to first construct a greedy initial solution
with a larger k, then to optimize the problem by alternating between PAM-style
"swap" operations where the result is improved by replacing medoids with better
alternatives and "remove" operations to reduce the number of k until neither
allows further improving the result quality. We demonstrate the usefulness of
this method on a problem from electrical engineering, with the input graph
derived from cartographic data
LOSDD: Leave-Out Support Vector Data Description for Outlier Detection
Support Vector Machines have been successfully used for one-class
classification (OCSVM, SVDD) when trained on clean data, but they work much
worse on dirty data: outliers present in the training data tend to become
support vectors, and are hence considered "normal". In this article, we improve
the effectiveness to detect outliers in dirty training data with a leave-out
strategy: by temporarily omitting one candidate at a time, this point can be
judged using the remaining data only. We show that this is more effective at
scoring the outlierness of points than using the slack term of existing
SVM-based approaches. Identified outliers can then be removed from the data,
such that outliers hidden by other outliers can be identified, to reduce the
problem of masking. Naively, this approach would require training N individual
SVMs (and training SVMs when iteratively removing the worst outliers
one at a time), which is prohibitively expensive. We will discuss that only
support vectors need to be considered in each step and that by reusing SVM
parameters and weights, this incremental retraining can be accelerated
substantially. By removing candidates in batches, we can further improve the
processing time, although it obviously remains more costly than training a
single SVM
Generalized and efficient outlier detection for spatial, temporal, and high-dimensional data mining
Knowledge Discovery in Databases (KDD) ist der Prozess, nicht-triviale Muster aus großen Datenbanken zu extrahieren, mit dem Ziel, dass diese bisher unbekannt, potentiell nützlich, statistisch fundiert und verständlich sind. Der Prozess umfasst mehrere Schritte wie die Selektion, Vorverarbeitung, Evaluierung und den Analyseschritt, der als Data-Mining bekannt ist. Eine der zentralen Aufgabenstellungen im Data-Mining ist die Ausreißererkennung, das Identifizieren von Beobachtungen, die ungewöhnlich sind und mit der Mehrzahl der Daten inkonsistent erscheinen. Solche seltene Beobachtungen können verschiedene Ursachen haben:
Messfehler, ungewöhnlich starke (aber dennoch genuine) Abweichungen, beschädigte oder auch manipulierte Daten. In den letzten Jahren wurden zahlreiche Verfahren zur Erkennung von Ausreißern vorgeschlagen, die sich oft nur geringfügig zu unterscheiden scheinen, aber in den Publikationen experimental als ``klar besser'' dargestellt sind. Ein Schwerpunkt dieser Arbeit ist es, die unterschiedlichen Verfahren zusammenzuführen und in einem gemeinsamen Formalismus zu modularisieren. Damit wird einerseits die Analyse der Unterschiede vereinfacht, andererseits aber die Flexibilität der Verfahren erhöht, indem man Module hinzufügen oder ersetzen und damit die Methode an geänderte Anforderungen und Datentypen anpassen kann.
Um die Vorteile der modularisierten Struktur zu zeigen, werden
(i) zahlreiche bestehende Algorithmen in dem Schema formalisiert,
(ii) neue Module hinzugefügt, um die Robustheit, Effizienz, statistische Aussagekraft und Nutzbarkeit der Bewertungsfunktionen zu verbessern, mit denen die existierenden Methoden kombiniert werden können,
(iii) Module modifiziert, um bestehende und neue Algorithmen auf andere, oft komplexere, Datentypen anzuwenden wie geographisch annotierte Daten, Zeitreihen und hochdimensionale Räume,
(iv) mehrere Methoden in ein Verfahren kombiniert, um bessere Ergebnisse zu erzielen,
(v) die Skalierbarkeit auf große Datenmengen durch approximative oder exakte Indizierung verbessert.
Ausgangspunkt der Arbeit ist der Algorithmus Local Outlier Factor (LOF). Er wird zunächst mit kleinen Erweiterungen modifiziert, um die Robustheit und die Nutzbarkeit der Bewertung zu verbessern. Diese Methoden werden anschließend in einem gemeinsamen Rahmen zur Erkennung lokaler Ausreißer formalisiert, um die entsprechenden Vorteile auch in anderen Algorithmen nutzen zu können. Durch Abstraktion von einem einzelnen Vektorraum zu allgemeinen Datentypen können auch räumliche und zeitliche Beziehungen analysiert werden. Die Verwendung von Unterraum- und Korrelations-basierten Nachbarschaften ermöglicht dann, einen neue Arten von Ausreißern in beliebig orientierten Projektionen zu erkennen. Verbesserungen bei den Bewertungsfunktionen erlauben es, die Bewertung mit der statistischen Intuition einer Wahrscheinlichkeit zu interpretieren und nicht nur eine Ausreißer-Rangfolge zu erstellen wie zuvor. Verbesserte Modelle generieren auch Erklärungen, warum ein Objekt als Ausreißer bewertet wurde.
Anschließend werden für verschiedene Module Verbesserungen eingeführt, die unter anderem ermöglichen, die Algorithmen auf wesentlich größere Datensätze anzuwenden -- in annähernd linearer statt in quadratischer Zeit --, indem man approximative Nachbarschaften bei geringem Verlust an Präzision und Effektivität erlaubt. Des weiteren wird gezeigt, wie mehrere solcher Algorithmen mit unterschiedlichen Intuitionen gleichzeitig benutzt und die Ergebnisse in einer Methode kombiniert werden können, die dadurch unterschiedliche Arten von Ausreißern erkennen kann.
Schließlich werden für reale Datensätze neue Ausreißeralgorithmen konstruiert, die auf das spezifische Problem angepasst sind. Diese neuen Methoden erlauben es, so aufschlussreiche Ergebnisse zu erhalten, die mit den bestehenden Methoden nicht erreicht werden konnten. Da sie aus den Bausteinen der modularen Struktur entwickelt wurden, ist ein direkter Bezug zu den früheren Ansätzen gegeben. Durch Verwendung der Indexstrukturen können die Algorithmen selbst auf großen Datensätzen effizient ausgeführt werden.Knowledge Discovery in Databases (KDD) is the process of extracting non-trivial patterns in large data bases, with the focus of extracting novel, potentially useful, statistically valid and understandable patterns. The process involves multiple phases including selection, preprocessing, evaluation and the analysis step which is known as Data Mining. One of the key techniques of Data Mining is outlier detection, that is the identification of observations that are unusual and seemingly inconsistent with the majority of the data set. Such rare observations can have various reasons: they can be measurement errors, unusually extreme (but valid) measurements, data corruption or even manipulated data. Over the previous years, various outlier detection algorithms have been proposed that often appear to be only slightly different than previous but ``clearly outperform'' the others in the experiments. A key focus of this thesis is to unify and modularize the various approaches into a common formalism to make the analysis of the actual differences easier, but at the same time increase the flexibility of the approaches by allowing the addition and replacement of modules to adapt the methods to different requirements and data types.
To show the benefits of the modularized structure,
(i) several existing algorithms are formalized within the new framework
(ii) new modules are added that improve the robustness, efficiency, statistical validity and score usability and that can be combined with existing methods
(iii) modules are modified to allow existing and new algorithms to run on other, often more complex data types including spatial, temporal and high-dimensional data spaces
(iv) the combination of multiple algorithm instances into an ensemble method is discussed
(v) the scalability to large data sets is improved using approximate as well as exact indexing.
The starting point is the Local Outlier Factor (LOF) algorithm, which is extended with slight modifications to increase robustness and the usability of the produced scores. In order to get the same benefits for other methods, these methods are abstracted to a general framework for local outlier detection. By abstracting from a single vector space, other data types that involve spatial and temporal relationships can be analyzed. The use of subspace and correlation neighborhoods allows the algorithms to detect new kinds of outliers in arbitrarily oriented subspaces. Improvements in the score normalization bring back a statistic intuition of probabilities to the outlier scores that previously were only useful for ranking objects, while improved models also offer explanations of why an object was considered to be an outlier.
Subsequently, for different modules found in the framework improved modules are presented that for example allow to run the same algorithms on significantly larger data sets -- in approximately linear complexity instead of quadratic complexity -- by accepting approximated neighborhoods at little loss in precision and effectiveness. Additionally, multiple algorithms with different intuitions can be run at the same time, and the results combined into an ensemble method that is able to detect outliers of different types.
Finally, new outlier detection methods are constructed; customized for the specific problems of these real data sets. The new methods allow to obtain insightful results that could not be obtained with the existing methods. Since being constructed from the same building blocks, there however exists a strong and explicit connection to the previous approaches, and by using the indexing strategies introduced earlier, the algorithms can be executed efficiently even on large data sets
EmbAssi: Embedding Assignment Costs for Similarity Search in Large Graph Databases
The graph edit distance is an intuitive measure to quantify the dissimilarity
of graphs, but its computation is NP-hard and challenging in practice. We
introduce methods for answering nearest neighbor and range queries regarding
this distance efficiently for large databases with up to millions of graphs. We
build on the filter-verification paradigm, where lower and upper bounds are
used to reduce the number of exact computations of the graph edit distance.
Highly effective bounds for this involve solving a linear assignment problem
for each graph in the database, which is prohibitive in massive datasets.
Index-based approaches typically provide only weak bounds leading to high
computational costs verification. In this work, we derive novel lower bounds
for efficient filtering from restricted assignment problems, where the cost
function is a tree metric. This special case allows embedding the costs of
optimal assignments isometrically into space, rendering efficient
indexing possible. We propose several lower bounds of the graph edit distance
obtained from tree metrics reflecting the edit costs, which are combined for
effective filtering. Our method termed EmbAssi can be integrated into existing
filter-verification pipelines as a fast and effective pre-filtering step.
Empirically we show that for many real-world graphs our lower bounds are
already close to the exact graph edit distance, while our index construction
and search scales to very large databases
- …