10 research outputs found

    Une approche par dissimilarité pour la caractérisation de jeux de données

    Get PDF
    La caractérisation de jeu de données reste un verrou majeur de l'analyse de données intelligente. Une majorité d'approches à ce problème agrègent les informations décrivant les attributs individuels des jeux de données, ce qui représente une perte d'information. Nous proposons une approche par dissimilarité permettant d'éviter cette agrégation, et étudions son intérêt dans la caractérisation des performances d'algorithmes de classifications, et dans la résolution de problèmes de méta-apprentissage

    Revisiting Data Complexity Metrics Based on Morphology for Overlap and Imbalance: Snapshot, New Overlap Number of Balls Metrics and Singular Problems Prospect

    Full text link
    Data Science and Machine Learning have become fundamental assets for companies and research institutions alike. As one of its fields, supervised classification allows for class prediction of new samples, learning from given training data. However, some properties can cause datasets to be problematic to classify. In order to evaluate a dataset a priori, data complexity metrics have been used extensively. They provide information regarding different intrinsic characteristics of the data, which serve to evaluate classifier compatibility and a course of action that improves performance. However, most complexity metrics focus on just one characteristic of the data, which can be insufficient to properly evaluate the dataset towards the classifiers' performance. In fact, class overlap, a very detrimental feature for the classification process (especially when imbalance among class labels is also present) is hard to assess. This research work focuses on revisiting complexity metrics based on data morphology. In accordance to their nature, the premise is that they provide both good estimates for class overlap, and great correlations with the classification performance. For that purpose, a novel family of metrics have been developed. Being based on ball coverage by classes, they are named after Overlap Number of Balls. Finally, some prospects for the adaptation of the former family of metrics to singular (more complex) problems are discussed.Comment: 23 pages, 9 figures, preprin

    Homophily Outlier Detection in Non-IID Categorical Data

    Full text link
    Most of existing outlier detection methods assume that the outlier factors (i.e., outlierness scoring measures) of data entities (e.g., feature values and data objects) are Independent and Identically Distributed (IID). This assumption does not hold in real-world applications where the outlierness of different entities is dependent on each other and/or taken from different probability distributions (non-IID). This may lead to the failure of detecting important outliers that are too subtle to be identified without considering the non-IID nature. The issue is even intensified in more challenging contexts, e.g., high-dimensional data with many noisy features. This work introduces a novel outlier detection framework and its two instances to identify outliers in categorical data by capturing non-IID outlier factors. Our approach first defines and incorporates distribution-sensitive outlier factors and their interdependence into a value-value graph-based representation. It then models an outlierness propagation process in the value graph to learn the outlierness of feature values. The learned value outlierness allows for either direct outlier detection or outlying feature selection. The graph representation and mining approach is employed here to well capture the rich non-IID characteristics. Our empirical results on 15 real-world data sets with different levels of data complexities show that (i) the proposed outlier detection methods significantly outperform five state-of-the-art methods at the 95%/99% confidence level, achieving 10%-28% AUC improvement on the 10 most complex data sets; and (ii) the proposed feature selection methods significantly outperform three competing methods in enabling subsequent outlier detection of two different existing detectors.Comment: To appear in Data Ming and Knowledge Discovery Journa

    Trustworthiness in Mobile Cyber Physical Systems

    Get PDF
    Computing and communication capabilities are increasingly embedded in diverse objects and structures in the physical environment. They will link the ‘cyberworld’ of computing and communications with the physical world. These applications are called cyber physical systems (CPS). Obviously, the increased involvement of real-world entities leads to a greater demand for trustworthy systems. Hence, we use "system trustworthiness" here, which can guarantee continuous service in the presence of internal errors or external attacks. Mobile CPS (MCPS) is a prominent subcategory of CPS in which the physical component has no permanent location. Mobile Internet devices already provide ubiquitous platforms for building novel MCPS applications. The objective of this Special Issue is to contribute to research in modern/future trustworthy MCPS, including design, modeling, simulation, dependability, and so on. It is imperative to address the issues which are critical to their mobility, report significant advances in the underlying science, and discuss the challenges of development and implementation in various applications of MCPS

    Estudio de métodos de selección de instancias

    Get PDF
    En la tesis se ha realizado un estudio de las técnicas de selección de instancias: analizando el estado del arte y desarrollando nuevos métodos para cubrir algunas áreas que no habían recibido la debida atención hasta el momento. Los dos primeros capítulos presentan nuevos métodos de selección de instancias para regresión, un tema poco estudiado hasta la fecha en la literatura. El tercer capítulo, estudia la posibilidad de cómo la combinación de algoritmos de selección de instancias para regresión ofrece mejores resultados que los métodos por sí mismos. El último de los capítulos presenta una novedosa idea: la utilización de las funciones hash localmente sensibles para diseñar dos nuevos algoritmos de selección de instancias para clasificación. La ventaja que presenta esta solución, es que ambos algoritmos tienen complejidad lineal. Los resultados de esta tesis han sido publicados en cuatro artículos en revistas JCR del primer cuartil.Ministerio de Economía, Industria y Competitividad, la Junta de Castilla y León y el Fondo Europeo para el Desarrollo Regional, proyectos TIN 2011-24046, TIN 2015-67534-P (MINECO/FEDER) y BU085P17 (JCyL/FEDER

    Multikonferenz Wirtschaftsinformatik (MKWI) 2016: Technische Universität Ilmenau, 09. - 11. März 2016; Band I

    Get PDF
    Übersicht der Teilkonferenzen Band I: • 11. Konferenz Mobilität und Digitalisierung (MMS 2016) • Automated Process und Service Management • Business Intelligence, Analytics und Big Data • Computational Mobility, Transportation and Logistics • CSCW & Social Computing • Cyber-Physische Systeme und digitale Wertschöpfungsnetzwerke • Digitalisierung und Privacy • e-Commerce und e-Business • E-Government – Informations- und Kommunikationstechnologien im öffentlichen Sektor • E-Learning und Lern-Service-Engineering – Entwicklung, Einsatz und Evaluation technikgestützter Lehr-/Lernprozess
    corecore