Search CORE

38 research outputs found

Recommended from our members

Stokes, Gauss, and Bayes walk into a bar...

Author: Kightley Eric P.
Publication venue: University of Colorado Boulder
Publication date: 01/01/2019
Field of study

This thesis consists of three distinct projects. The first is a study of microbial aggregate fragmentation, in which we develop a dynamical model of aggregate deformation and breakage and use it to obtain a post-fragmentation density function. The second and third projects deal with dimensionality reduction in machine learning problems. In the second project, we derive a one-pass sparsified Gaussian mixture model to perform clustering analysis on high-dimensional streaming data. The model estimates parameters in dense space while storing and performing computations in a compressed space. In the final project, we build an expert system classifier with a Bayesian network for use on high-volume streaming data. Our approach is specialized to reduce the number of observations while obtaining sufficient labeled training data in a regime of extreme class-imbalance and expensive oracle queries

CU Scholar Institutional Repository

Recommended from our members

Stokes, Gauss, and Bayes Walk into a Bar...

Author: Kightley Eric Paul
Publication venue: University of Colorado Boulder
Publication date: 01/01/2019
Field of study

CU Scholar Institutional Repository

Recommended from our members

Geometric Representation Learning

Author: Vilnis Luke
Publication venue: ScholarWorks@UMass Amherst
Publication date: 06/04/2021
Field of study

Vector embedding models are a cornerstone of modern machine learning methods for knowledge representation and reasoning. These methods aim to turn semantic questions into geometric questions by learning representations of concepts and other domain objects in a lower-dimensional vector space. In that spirit, this work advocates for density- and region-based representation learning. Embedding domain elements as geometric objects beyond a single point enables us to naturally represent breadth and polysemy, make asymmetric comparisons, answer complex queries, and provides a strong inductive bias when labeled data is scarce. We present a model for word representation using Gaussian densities, enabling asymmetric entailment judgments between concepts, and a probabilistic model for weighted transitive relations and multivariate discrete data based on a lattice of axis-aligned hyperrectangle representations (boxes). We explore the suitability of these embedding methods in different regimes of sparsity, edge weight, correlation, and independence structure, as well as extensions of the representation and different optimization strategies. We make a theoretical investigation of the representational power of the box lattice, and propose extensions to address shortcomings in modeling difficult distributions and graphs

ScholarWorks@UMass Amherst

Learning maximum excluding ellipsoids from imbalanced data with theoretical guarantees

Author: Badiche Xavier
Belkasmi Brahim
Fromont Elisa
Habrard Amaury
Metzler Guillaume
Sebban Marc
Publication venue: 'Elsevier BV'
Publication date: 01/09/2018
Field of study

International audienceIn this paper, we address the problem of learning from imbalanced data. We consider the scenario where the number of negative examples is much larger than the number of positive ones. We propose a theoretically-founded method which learns a set of local ellipsoids centered at the minority class examples while excluding the negative examples of the majority class. We address this task from a Mahalanobis-like metric learning point of view and we derive generalization guarantees on the learned metric using the uniform stability framework. Our experimental evaluation on classic benchmarks and on a proprietary dataset in bank fraud detection shows the effectiveness of our approach, particularly when the imbalancy is huge

HAL-CentraleSupelec

HAL-UJM

INRIA a CCSD electronic archive server

HAL-Rennes 1

Learning maximum excluding ellipsoids from imbalanced data with theoretical guarantees

Author: Aggarwal
Amaury Habrard
Azami
Barandela
Bellet
Bellet
Boujnouni
Bousquet
Brahim Belkasmi
Chandola
Chawla
Elisa Fromont
Frery
Goldstein
Guillaume Metzler
Heller
Khalilia
Le
Liu
Marc Sebban
Pauwels
Perrot
Quinlan
Shi
Tax
Wang
Xavier Badiche
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Depth- and Potential-Based Supervised Learning

Author: Pokotylo Oleksii
Publication venue
Publication date: 17/10/2016
Field of study

The task of supervised learning is to define a data-based rule by which the new objects are assigned to one of the classes. For this a training data set is used that contains objects with known class membership. In this thesis, two procedures for supervised classification are introduced. The first procedure is based on potential functions. The potential of a class is defined as a kernel density estimate multiplied by the class's prior probability. The method transforms the data to a potential-potential (pot-pot) plot, where each data point is mapped to a vector of potentials, similarly to the DD-plot. Separation of the classes, as well as classification of new data points, is performed on this plot, thus the bias in kernel density estimates due to insufficiently adapted multivariate kernels is compensated by a flexible classifier on the pot-pot plot. The proposed method has been implemented in the R-package ddalpha that is a software directed to fuse experience of the applicant with recent theoretical and computational achievements in the area of data depth and depth-based classification. It implements various depth functions and classifiers for multivariate and functional data under one roof. The package is expandable with user-defined custom depth methods and separators. The second classification procedure focuses on the centers of the classes and is based on data depth. The classifier adds a depth term to the objective function of the Bayes classifier, so that the cost of misclassification of a point depends not only on its belongingness to a class but also on its centrality in this class. Classification of more central points is enforced while outliers are underweighted. The proposed objective function may also be used to evaluate the performance of other classifiers instead of the usual average misclassification rate. The thesis also contains a new algorithm for the exact calculation of the Oja median. It modifies the algorithm of Ronkainen, Oja and Orponen (2003) by employing bounded regions which contain the median. The new algorithm is faster and has lower complexity than the previous one. The new algorithm has been implemented as a part of the R-package OjaNP

Kölner UniversitätsPublikationsServer

On large-scale probabilistic and statistical data analysis

Author: Munteanu Alexander
Publication venue
Publication date
Field of study

In this manuscript we develop and apply modern algorithmic data reduction techniques to tackle scalability issues and enable statistical data analysis of massive data sets. Our algorithms follow a general scheme, where a reduction technique is applied to the large-scale data to obtain a small summary of sublinear size to which a classical algorithm is applied. The techniques for obtaining these summaries depend on the problem that we want to solve. The size of the summaries is usually parametrized by an approximation parameter, expressing the trade-off between efficiency and accuracy. In some cases the data can be reduced to a size that has no or only negligible dependency on the initial number of data items. However, for other problems it turns out that sublinear summaries do not exist in the worst case. In such situations, we exploit statistical or geometric relaxations to obtain useful sublinear summaries under certain mildness assumptions. We present, in particular, the data reduction methods called coresets and subspace embeddings, and several algorithmic techniques to construct these via random projections and sampling

Eldorado - Ressourcen aus und für Lehre, Studium und Forschung

Generalized and efficient outlier detection for spatial, temporal, and high-dimensional data mining

Author: Schubert Erich
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 20/12/2013
Field of study

Knowledge Discovery in Databases (KDD) ist der Prozess, nicht-triviale Muster aus großen Datenbanken zu extrahieren, mit dem Ziel, dass diese bisher unbekannt, potentiell nützlich, statistisch fundiert und verständlich sind. Der Prozess umfasst mehrere Schritte wie die Selektion, Vorverarbeitung, Evaluierung und den Analyseschritt, der als Data-Mining bekannt ist. Eine der zentralen Aufgabenstellungen im Data-Mining ist die Ausreißererkennung, das Identifizieren von Beobachtungen, die ungewöhnlich sind und mit der Mehrzahl der Daten inkonsistent erscheinen. Solche seltene Beobachtungen können verschiedene Ursachen haben: Messfehler, ungewöhnlich starke (aber dennoch genuine) Abweichungen, beschädigte oder auch manipulierte Daten. In den letzten Jahren wurden zahlreiche Verfahren zur Erkennung von Ausreißern vorgeschlagen, die sich oft nur geringfügig zu unterscheiden scheinen, aber in den Publikationen experimental als ``klar besser'' dargestellt sind. Ein Schwerpunkt dieser Arbeit ist es, die unterschiedlichen Verfahren zusammenzuführen und in einem gemeinsamen Formalismus zu modularisieren. Damit wird einerseits die Analyse der Unterschiede vereinfacht, andererseits aber die Flexibilität der Verfahren erhöht, indem man Module hinzufügen oder ersetzen und damit die Methode an geänderte Anforderungen und Datentypen anpassen kann. Um die Vorteile der modularisierten Struktur zu zeigen, werden (i) zahlreiche bestehende Algorithmen in dem Schema formalisiert, (ii) neue Module hinzugefügt, um die Robustheit, Effizienz, statistische Aussagekraft und Nutzbarkeit der Bewertungsfunktionen zu verbessern, mit denen die existierenden Methoden kombiniert werden können, (iii) Module modifiziert, um bestehende und neue Algorithmen auf andere, oft komplexere, Datentypen anzuwenden wie geographisch annotierte Daten, Zeitreihen und hochdimensionale Räume, (iv) mehrere Methoden in ein Verfahren kombiniert, um bessere Ergebnisse zu erzielen, (v) die Skalierbarkeit auf große Datenmengen durch approximative oder exakte Indizierung verbessert. Ausgangspunkt der Arbeit ist der Algorithmus Local Outlier Factor (LOF). Er wird zunächst mit kleinen Erweiterungen modifiziert, um die Robustheit und die Nutzbarkeit der Bewertung zu verbessern. Diese Methoden werden anschließend in einem gemeinsamen Rahmen zur Erkennung lokaler Ausreißer formalisiert, um die entsprechenden Vorteile auch in anderen Algorithmen nutzen zu können. Durch Abstraktion von einem einzelnen Vektorraum zu allgemeinen Datentypen können auch räumliche und zeitliche Beziehungen analysiert werden. Die Verwendung von Unterraum- und Korrelations-basierten Nachbarschaften ermöglicht dann, einen neue Arten von Ausreißern in beliebig orientierten Projektionen zu erkennen. Verbesserungen bei den Bewertungsfunktionen erlauben es, die Bewertung mit der statistischen Intuition einer Wahrscheinlichkeit zu interpretieren und nicht nur eine Ausreißer-Rangfolge zu erstellen wie zuvor. Verbesserte Modelle generieren auch Erklärungen, warum ein Objekt als Ausreißer bewertet wurde. Anschließend werden für verschiedene Module Verbesserungen eingeführt, die unter anderem ermöglichen, die Algorithmen auf wesentlich größere Datensätze anzuwenden -- in annähernd linearer statt in quadratischer Zeit --, indem man approximative Nachbarschaften bei geringem Verlust an Präzision und Effektivität erlaubt. Des weiteren wird gezeigt, wie mehrere solcher Algorithmen mit unterschiedlichen Intuitionen gleichzeitig benutzt und die Ergebnisse in einer Methode kombiniert werden können, die dadurch unterschiedliche Arten von Ausreißern erkennen kann. Schließlich werden für reale Datensätze neue Ausreißeralgorithmen konstruiert, die auf das spezifische Problem angepasst sind. Diese neuen Methoden erlauben es, so aufschlussreiche Ergebnisse zu erhalten, die mit den bestehenden Methoden nicht erreicht werden konnten. Da sie aus den Bausteinen der modularen Struktur entwickelt wurden, ist ein direkter Bezug zu den früheren Ansätzen gegeben. Durch Verwendung der Indexstrukturen können die Algorithmen selbst auf großen Datensätzen effizient ausgeführt werden.Knowledge Discovery in Databases (KDD) is the process of extracting non-trivial patterns in large data bases, with the focus of extracting novel, potentially useful, statistically valid and understandable patterns. The process involves multiple phases including selection, preprocessing, evaluation and the analysis step which is known as Data Mining. One of the key techniques of Data Mining is outlier detection, that is the identification of observations that are unusual and seemingly inconsistent with the majority of the data set. Such rare observations can have various reasons: they can be measurement errors, unusually extreme (but valid) measurements, data corruption or even manipulated data. Over the previous years, various outlier detection algorithms have been proposed that often appear to be only slightly different than previous but ``clearly outperform'' the others in the experiments. A key focus of this thesis is to unify and modularize the various approaches into a common formalism to make the analysis of the actual differences easier, but at the same time increase the flexibility of the approaches by allowing the addition and replacement of modules to adapt the methods to different requirements and data types. To show the benefits of the modularized structure, (i) several existing algorithms are formalized within the new framework (ii) new modules are added that improve the robustness, efficiency, statistical validity and score usability and that can be combined with existing methods (iii) modules are modified to allow existing and new algorithms to run on other, often more complex data types including spatial, temporal and high-dimensional data spaces (iv) the combination of multiple algorithm instances into an ensemble method is discussed (v) the scalability to large data sets is improved using approximate as well as exact indexing. The starting point is the Local Outlier Factor (LOF) algorithm, which is extended with slight modifications to increase robustness and the usability of the produced scores. In order to get the same benefits for other methods, these methods are abstracted to a general framework for local outlier detection. By abstracting from a single vector space, other data types that involve spatial and temporal relationships can be analyzed. The use of subspace and correlation neighborhoods allows the algorithms to detect new kinds of outliers in arbitrarily oriented subspaces. Improvements in the score normalization bring back a statistic intuition of probabilities to the outlier scores that previously were only useful for ranking objects, while improved models also offer explanations of why an object was considered to be an outlier. Subsequently, for different modules found in the framework improved modules are presented that for example allow to run the same algorithms on significantly larger data sets -- in approximately linear complexity instead of quadratic complexity -- by accepting approximated neighborhoods at little loss in precision and effectiveness. Additionally, multiple algorithms with different intuitions can be run at the same time, and the results combined into an ensemble method that is able to detect outliers of different types. Finally, new outlier detection methods are constructed; customized for the specific problems of these real data sets. The new methods allow to obtain insightful results that could not be obtained with the existing methods. Since being constructed from the same building blocks, there however exists a strong and explicit connection to the previous approaches, and by using the indexing strategies introduced earlier, the algorithms can be executed efficiently even on large data sets

Digitale Hochschulschriften der LMU

Industry and Tertiary Sectors towards Clean Energy Transition

Author
Publication venue: 'MDPI AG'
Publication date: 25/10/2022
Field of study

The clean energy transition is the transition from the use of nonrenewable energy sources to renewable sources and is part of the wider transition to sustainable economies through the use of renewable energy, the adoption of energy-saving measures, and sustainable development techniques. The clean energy transition is a long and complex process that will lead to an epochal change, and it will allow safeguarding the health of the environment in the long term. For its success, it necessitates contribution from everyone, from the individual citizen to large multinationals, passing through SMEs; national and international policies play a key role in paving the way to this process. This Special Issue is focused on technical, financial, and policy-related aspects linked to the transition of industrial and service sectors towards energy saving and decarbonization. These different aspects are interrelated and, as such, they have been analyzed with an interdisciplinary approach, for example, by combining economic and technical information. The collected papers focus on energy efficiency and clean-energy key technologies, renewable sources, energy management and monitoring systems, energy policies and regulations, and economic and financial aspects

Directory of Open Access Books (DOAB)