110 research outputs found
Fortschritte im unüberwachten Lernen und Anwendungsbereiche: Subspace Clustering mit Hintergrundwissen, semantisches Passworterraten und erlernte Indexstrukturen
Over the past few years, advances in data science, machine learning and, in particular, unsupervised learning have enabled significant progress in many scientific fields and even in everyday life. Unsupervised learning methods are usually successful whenever they can be tailored to specific applications using appropriate requirements based on domain expertise. This dissertation shows how purely theoretical research can lead to circumstances that favor overly optimistic results, and the advantages of application-oriented research based on specific background knowledge. These observations apply to traditional unsupervised learning problems such as clustering, anomaly detection and dimensionality reduction. Therefore, this thesis presents extensions of these classical problems, such as subspace clustering and principal component analysis, as well as several specific applications with relevant interfaces to machine learning. Examples include password guessing using semantic word embeddings and learning spatial index structures using statistical models. In essence, this thesis shows that application-oriented research has many advantages for current and future research.In den letzten Jahren haben Fortschritte in der Data Science, im maschinellen Lernen und insbesondere im unüberwachten Lernen zu erheblichen Fortentwicklungen in vielen Bereichen der Wissenschaft und des täglichen Lebens geführt. Methoden des unüberwachten Lernens sind in der Regel dann erfolgreich, wenn sie durch geeignete, auf Expertenwissen basierende Anforderungen an spezifische Anwendungen angepasst werden können. Diese Dissertation zeigt, wie rein theoretische Forschung zu Umständen führen kann, die allzu optimistische Ergebnisse begünstigen, und welche Vorteile anwendungsorientierte Forschung hat, die auf spezifischem Hintergrundwissen basiert. Diese Beobachtungen gelten für traditionelle unüberwachte Lernprobleme wie Clustering, Anomalieerkennung und Dimensionalitätsreduktion. Daher werden in diesem Beitrag Erweiterungen dieser klassischen Probleme, wie Subspace Clustering und Hauptkomponentenanalyse, sowie einige spezifische Anwendungen mit relevanten Schnittstellen zum maschinellen Lernen vorgestellt. Beispiele sind das Erraten von Passwörtern mit Hilfe semantischer Worteinbettungen und das Lernen von räumlichen Indexstrukturen mit Hilfe statistischer Modelle. Im Wesentlichen zeigt diese Arbeit, dass anwendungsorientierte Forschung viele Vorteile für die aktuelle und zukünftige Forschung hat
Homophily Outlier Detection in Non-IID Categorical Data
Most of existing outlier detection methods assume that the outlier factors
(i.e., outlierness scoring measures) of data entities (e.g., feature values and
data objects) are Independent and Identically Distributed (IID). This
assumption does not hold in real-world applications where the outlierness of
different entities is dependent on each other and/or taken from different
probability distributions (non-IID). This may lead to the failure of detecting
important outliers that are too subtle to be identified without considering the
non-IID nature. The issue is even intensified in more challenging contexts,
e.g., high-dimensional data with many noisy features. This work introduces a
novel outlier detection framework and its two instances to identify outliers in
categorical data by capturing non-IID outlier factors. Our approach first
defines and incorporates distribution-sensitive outlier factors and their
interdependence into a value-value graph-based representation. It then models
an outlierness propagation process in the value graph to learn the outlierness
of feature values. The learned value outlierness allows for either direct
outlier detection or outlying feature selection. The graph representation and
mining approach is employed here to well capture the rich non-IID
characteristics. Our empirical results on 15 real-world data sets with
different levels of data complexities show that (i) the proposed outlier
detection methods significantly outperform five state-of-the-art methods at the
95%/99% confidence level, achieving 10%-28% AUC improvement on the 10 most
complex data sets; and (ii) the proposed feature selection methods
significantly outperform three competing methods in enabling subsequent outlier
detection of two different existing detectors.Comment: To appear in Data Ming and Knowledge Discovery Journa
Automation of cleaning and ensembles for outliers detection in questionnaire data
This article is focused on the automatic detection of the corrupted or inappropriate responses in questionnaire data using unsupervised outliers detection. The questionnaire surveys are often used in psychology research to collect self-report data and their preprocessing takes a lot of manual effort. Unlike with numerical data where the distance-based outliers prevail, the records in questionnaires have to be assessed from various perspectives that do not relate so much. We identify the most frequent types of errors in questionnaires. For each of them, we suggest different outliers detection methods ranking the records with the usage of normalized scores. Considering the similarity between pairs of outlier scores (some are highly uncorrelated), we propose an ensemble based on the union of outliers detected by different methods. Our outlier detection framework consists of some well-known algorithms but we also propose novel approaches addressing the typical issues of questionnaires. The selected methods are based on distance, entropy, and probability. The experimental section describes the process of assembling the methods and selecting their parameters for the final model detecting significant outliers in the real-world HBSC dataset.Web of Science206art. no. 11780
Daphne: A tool for anomaly detection
En este trabajo se presenta una nueva herramienta dirigida a la deteccion y análisis de anomalias. Ésta permite el estudio de cualquier serie temporal, tanto de una variable, como de múltiples variables. La herramienta se compone de dos partes. Un "cerebro", en el que se han implementado las metodologías para la detección de anomalias, así como las herramientas para el análisis de las mismas. Y una interfaz, que permite la interacción con el usuario. En la memoria se detallan los algoritmos y herramientas implementadas. Para demostrar el potencial de la herramienta, se presenta también un caso práctico de aplicación.Outgoin
Rigid Transformations for Stabilized Lower Dimensional Space to Support Subsurface Uncertainty Quantification and Interpretation
Subsurface datasets inherently possess big data characteristics such as vast
volume, diverse features, and high sampling speeds, further compounded by the
curse of dimensionality from various physical, engineering, and geological
inputs. Among the existing dimensionality reduction (DR) methods, nonlinear
dimensionality reduction (NDR) methods, especially Metric-multidimensional
scaling (MDS), are preferred for subsurface datasets due to their inherent
complexity. While MDS retains intrinsic data structure and quantifies
uncertainty, its limitations include unstabilized unique solutions invariant to
Euclidean transformations and an absence of out-of-sample points (OOSP)
extension. To enhance subsurface inferential and machine learning workflows,
datasets must be transformed into stable, reduced-dimension representations
that accommodate OOSP.
Our solution employs rigid transformations for a stabilized Euclidean
invariant representation for LDS. By computing an MDS input dissimilarity
matrix, and applying rigid transformations on multiple realizations, we ensure
transformation invariance and integrate OOSP. This process leverages a convex
hull algorithm and incorporates loss function and normalized stress for
distortion quantification. We validate our approach with synthetic data,
varying distance metrics, and real-world wells from the Duvernay Formation.
Results confirm our method's efficacy in achieving consistent LDS
representations. Furthermore, our proposed "stress ratio" (SR) metric provides
insight into uncertainty, beneficial for model adjustments and inferential
analysis. Consequently, our workflow promises enhanced repeatability and
comparability in NDR for subsurface energy resource engineering and associated
big data workflows.Comment: 30 pages, 17 figures, Submitted to Computational Geosciences Journa
Deep Weakly-supervised Anomaly Detection
Anomaly detection is typically posited as an unsupervised learning task in
the literature due to the prohibitive cost and difficulty to obtain large-scale
labeled anomaly data, but this ignores the fact that a very small number
(e.g.,, a few dozens) of labeled anomalies can often be made available with
small/trivial cost in many real-world anomaly detection applications. To
leverage such labeled anomaly data, we study an important anomaly detection
problem termed weakly-supervised anomaly detection, in which, in addition to a
large amount of unlabeled data, a limited number of labeled anomalies are
available during modeling. Learning with the small labeled anomaly data enables
anomaly-informed modeling, which helps identify anomalies of interest and
address the notorious high false positives in unsupervised anomaly detection.
However, the problem is especially challenging, since (i) the limited amount of
labeled anomaly data often, if not always, cannot cover all types of anomalies
and (ii) the unlabeled data is often dominated by normal instances but has
anomaly contamination. We address the problem by formulating it as a pairwise
relation prediction task. Particularly, our approach defines a two-stream
ordinal regression neural network to learn the relation of randomly sampled
instance pairs, i.e., whether the instance pair contains two labeled anomalies,
one labeled anomaly, or just unlabeled data instances. The resulting model
effectively leverages both the labeled and unlabeled data to substantially
augment the training data and learn well-generalized representations of
normality and abnormality. Comprehensive empirical results on 40 real-world
datasets show that our approach (i) significantly outperforms four
state-of-the-art methods in detecting both of the known and previously unseen
anomalies and (ii) is substantially more data-efficient.Comment: Theoretical results are refined and extended. Significant more
empirical results are added, including results on detecting previously
unknown anomalie
CLADAG 2021 BOOK OF ABSTRACTS AND SHORT PAPERS
The book collects the short papers presented at the 13th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS). The meeting has been organized by the Department of Statistics, Computer Science and Applications of the University of Florence, under the auspices of the Italian Statistical Society and the International Federation of Classification Societies (IFCS). CLADAG is a member of the IFCS, a federation of national, regional, and linguistically-based classification societies. It is a non-profit, non-political scientific organization, whose aims are to further classification research
- …