5,055 research outputs found

    A systematic review of data quality issues in knowledge discovery tasks

    Get PDF
    Hay un gran crecimiento en el volumen de datos porque las organizaciones capturan permanentemente la cantidad colectiva de datos para lograr un mejor proceso de toma de decisiones. El desafío mas fundamental es la exploración de los grandes volúmenes de datos y la extracción de conocimiento útil para futuras acciones por medio de tareas para el descubrimiento del conocimiento; sin embargo, muchos datos presentan mala calidad. Presentamos una revisión sistemática de los asuntos de calidad de datos en las áreas del descubrimiento de conocimiento y un estudio de caso aplicado a la enfermedad agrícola conocida como la roya del café.Large volume of data is growing because the organizations are continuously capturing the collective amount of data for better decision-making process. The most fundamental challenge is to explore the large volumes of data and extract useful knowledge for future actions through knowledge discovery tasks, nevertheless many data has poor quality. We presented a systematic review of the data quality issues in knowledge discovery tasks and a case study applied to agricultural disease named coffee rust

    Towards a Hierarchical Approach for Outlier Detection in Industrial Production Settings

    Get PDF
    In the context of Industry 4.0, the degree of cross-linking between machines, sensors, and production lines increases rapidly.However, this trend also offers the potential for the improve-ment of outlier scores, especially by combining outlier detectioninformation between different production levels. The latter, in turn, offer various other useful aspects like different time series resolutions or context variables. When utilizing these aspects, valuable outlier information can be extracted, which can be then used for condition-based monitoring, alert management, or predictive maintenance. In this work, we compare different types of outlier detection methods and scores in the light of the aforementioned production levels with the goal to develop a modelfor outlier detection that incorporates these production levels.The proposed model, in turn, is basically inspired by a use casefrom the field of additive manufacturing, which is also known asindustrial 3D-printing. Altogether, our model shall improve the detection of outliers by the use of a hierarchical structure that utilizes production levels in industrial scenarios

    Understanding Graph Data Through Deep Learning Lens

    Get PDF
    Deep neural network models have established themselves as an unparalleled force in the domains of vision, speech and text processing applications in recent years. However, graphs have formed a significant component of data analytics including applications in Internet of Things, social networks, pharmaceuticals and bioinformatics. An important characteristic of these deep learning techniques is their ability to learn the important features which are necessary to excel at a given task, unlike traditional machine learning algorithms which are dependent on handcrafted features. However, there have been comparatively fewer e�orts in deep learning to directly work on graph inputs. Various real-world problems can be easily solved by posing them as a graph analysis problem. Considering the direct impact of the success of graph analysis on business outcomes, importance of studying these complex graph data has increased exponentially over the years. In this thesis, we address three contributions towards understanding graph data: (i) The first contribution seeks to find anomalies in graphs using graphical models; (ii) The second contribution uses deep learning with spatio-temporal random walks to learn representations of graph trajectories (paths) and shows great promise on standard graph datasets; and (iii) The third contribution seeks to propose a novel deep neural network that implicitly models attention to allow for interpretation of graph classification.

    Community-based Outlier Detection for Edge-attributed Graphs

    Get PDF
    The study of networks has emerged in diverse disciplines as a means of analyzing complex relationship data. Beyond graph analysis tasks like graph query processing, link analysis, influence propagation, there has recently been some work in the area of outlier detection for information network data. Although various kinds of outliers have been studied for graph data, there is not much work on anomaly detection from edge-attributed graphs. In this paper, we introduce a method that detects novel outlier graph nodes by taking into account the node data and edge data simultaneously to detect anomalies. We model the problem as a community detection task, where outliers form a separate community. We propose a method that uses a probabilistic graph model (Hidden Markov Random Field) for joint modeling of nodes and edges in the network to compute Holistic Community Outliers (HCOutliers). Thus, our model presents a natural setting for heterogeneous graphs that have multiple edges/relationships between two nodes. EM (Expectation Maximization) is used to learn model parameters, and infer hidden community labels. Experimental results on synthetic datasets and the DBLP dataset show the effectiveness of our approach for finding novel outliers from networks

    On the Nature and Types of Anomalies: A Review

    Full text link
    Anomalies are occurrences in a dataset that are in some way unusual and do not fit the general patterns. The concept of the anomaly is generally ill-defined and perceived as vague and domain-dependent. Moreover, despite some 250 years of publications on the topic, no comprehensive and concrete overviews of the different types of anomalies have hitherto been published. By means of an extensive literature review this study therefore offers the first theoretically principled and domain-independent typology of data anomalies, and presents a full overview of anomaly types and subtypes. To concretely define the concept of the anomaly and its different manifestations, the typology employs five dimensions: data type, cardinality of relationship, anomaly level, data structure and data distribution. These fundamental and data-centric dimensions naturally yield 3 broad groups, 9 basic types and 61 subtypes of anomalies. The typology facilitates the evaluation of the functional capabilities of anomaly detection algorithms, contributes to explainable data science, and provides insights into relevant topics such as local versus global anomalies.Comment: 38 pages (30 pages content), 10 figures, 3 tables. Preprint; review comments will be appreciated. Improvements in version 2: Explicit mention of fifth anomaly dimension; Added section on explainable anomaly detection; Added section on variations on the anomaly concept; Various minor additions and improvement

    Cost-Sensitive Learning-based Methods for Imbalanced Classification Problems with Applications

    Get PDF
    Analysis and predictive modeling of massive datasets is an extremely significant problem that arises in many practical applications. The task of predictive modeling becomes even more challenging when data are imperfect or uncertain. The real data are frequently affected by outliers, uncertain labels, and uneven distribution of classes (imbalanced data). Such uncertainties create bias and make predictive modeling an even more difficult task. In the present work, we introduce a cost-sensitive learning method (CSL) to deal with the classification of imperfect data. Typically, most traditional approaches for classification demonstrate poor performance in an environment with imperfect data. We propose the use of CSL with Support Vector Machine, which is a well-known data mining algorithm. The results reveal that the proposed algorithm produces more accurate classifiers and is more robust with respect to imperfect data. Furthermore, we explore the best performance measures to tackle imperfect data along with addressing real problems in quality control and business analytics
    corecore