10 research outputs found
Une approche par dissimilarité pour la caractérisation de jeux de données
La caractérisation de jeu de données reste un verrou majeur de l'analyse de données intelligente. Une majorité d'approches à ce problème agrègent les informations décrivant les attributs individuels des jeux de données, ce qui représente une perte d'information. Nous proposons une approche par dissimilarité permettant d'éviter cette agrégation, et étudions son intérêt dans la caractérisation des performances d'algorithmes de classifications, et dans la résolution de problèmes de méta-apprentissage
Revisiting Data Complexity Metrics Based on Morphology for Overlap and Imbalance: Snapshot, New Overlap Number of Balls Metrics and Singular Problems Prospect
Data Science and Machine Learning have become fundamental assets for
companies and research institutions alike. As one of its fields, supervised
classification allows for class prediction of new samples, learning from given
training data. However, some properties can cause datasets to be problematic to
classify.
In order to evaluate a dataset a priori, data complexity metrics have been
used extensively. They provide information regarding different intrinsic
characteristics of the data, which serve to evaluate classifier compatibility
and a course of action that improves performance. However, most complexity
metrics focus on just one characteristic of the data, which can be insufficient
to properly evaluate the dataset towards the classifiers' performance. In fact,
class overlap, a very detrimental feature for the classification process
(especially when imbalance among class labels is also present) is hard to
assess.
This research work focuses on revisiting complexity metrics based on data
morphology. In accordance to their nature, the premise is that they provide
both good estimates for class overlap, and great correlations with the
classification performance. For that purpose, a novel family of metrics have
been developed. Being based on ball coverage by classes, they are named after
Overlap Number of Balls. Finally, some prospects for the adaptation of the
former family of metrics to singular (more complex) problems are discussed.Comment: 23 pages, 9 figures, preprin
Homophily Outlier Detection in Non-IID Categorical Data
Most of existing outlier detection methods assume that the outlier factors
(i.e., outlierness scoring measures) of data entities (e.g., feature values and
data objects) are Independent and Identically Distributed (IID). This
assumption does not hold in real-world applications where the outlierness of
different entities is dependent on each other and/or taken from different
probability distributions (non-IID). This may lead to the failure of detecting
important outliers that are too subtle to be identified without considering the
non-IID nature. The issue is even intensified in more challenging contexts,
e.g., high-dimensional data with many noisy features. This work introduces a
novel outlier detection framework and its two instances to identify outliers in
categorical data by capturing non-IID outlier factors. Our approach first
defines and incorporates distribution-sensitive outlier factors and their
interdependence into a value-value graph-based representation. It then models
an outlierness propagation process in the value graph to learn the outlierness
of feature values. The learned value outlierness allows for either direct
outlier detection or outlying feature selection. The graph representation and
mining approach is employed here to well capture the rich non-IID
characteristics. Our empirical results on 15 real-world data sets with
different levels of data complexities show that (i) the proposed outlier
detection methods significantly outperform five state-of-the-art methods at the
95%/99% confidence level, achieving 10%-28% AUC improvement on the 10 most
complex data sets; and (ii) the proposed feature selection methods
significantly outperform three competing methods in enabling subsequent outlier
detection of two different existing detectors.Comment: To appear in Data Ming and Knowledge Discovery Journa
Trustworthiness in Mobile Cyber Physical Systems
Computing and communication capabilities are increasingly embedded in diverse objects and structures in the physical environment. They will link the ‘cyberworld’ of computing and communications with the physical world. These applications are called cyber physical systems (CPS). Obviously, the increased involvement of real-world entities leads to a greater demand for trustworthy systems. Hence, we use "system trustworthiness" here, which can guarantee continuous service in the presence of internal errors or external attacks. Mobile CPS (MCPS) is a prominent subcategory of CPS in which the physical component has no permanent location. Mobile Internet devices already provide ubiquitous platforms for building novel MCPS applications. The objective of this Special Issue is to contribute to research in modern/future trustworthy MCPS, including design, modeling, simulation, dependability, and so on. It is imperative to address the issues which are critical to their mobility, report significant advances in the underlying science, and discuss the challenges of development and implementation in various applications of MCPS
Estudio de métodos de selección de instancias
En la tesis se ha realizado un estudio de las técnicas de selección de instancias: analizando el estado del
arte y desarrollando nuevos métodos para cubrir algunas áreas que no habían recibido la debida
atención hasta el momento.
Los dos primeros capítulos presentan nuevos métodos de selección de instancias para regresión, un
tema poco estudiado hasta la fecha en la literatura. El tercer capítulo, estudia la posibilidad de cómo la
combinación de algoritmos de selección de instancias para regresión ofrece mejores resultados que los
métodos por sí mismos.
El último de los capítulos presenta una novedosa idea: la utilización de las funciones hash localmente
sensibles para diseñar dos nuevos algoritmos de selección de instancias para clasificación. La ventaja
que presenta esta solución, es que ambos algoritmos tienen complejidad lineal.
Los resultados de esta tesis han sido publicados en cuatro artículos en revistas JCR del primer cuartil.Ministerio de Economía, Industria y Competitividad, la Junta de Castilla y León y el Fondo Europeo
para el Desarrollo Regional, proyectos TIN 2011-24046, TIN 2015-67534-P (MINECO/FEDER)
y BU085P17 (JCyL/FEDER
Multikonferenz Wirtschaftsinformatik (MKWI) 2016: Technische Universität Ilmenau, 09. - 11. März 2016; Band I
Übersicht der Teilkonferenzen Band I:
• 11. Konferenz Mobilität und Digitalisierung (MMS 2016)
• Automated Process und Service Management
• Business Intelligence, Analytics und Big Data
• Computational Mobility, Transportation and Logistics
• CSCW & Social Computing
• Cyber-Physische Systeme und digitale Wertschöpfungsnetzwerke
• Digitalisierung und Privacy
• e-Commerce und e-Business
• E-Government – Informations- und Kommunikationstechnologien im öffentlichen Sektor
• E-Learning und Lern-Service-Engineering – Entwicklung, Einsatz und Evaluation technikgestützter Lehr-/Lernprozess