2,887 research outputs found

    A systematic review of data quality issues in knowledge discovery tasks

    Get PDF
    Hay un gran crecimiento en el volumen de datos porque las organizaciones capturan permanentemente la cantidad colectiva de datos para lograr un mejor proceso de toma de decisiones. El desafío mas fundamental es la exploración de los grandes volúmenes de datos y la extracción de conocimiento útil para futuras acciones por medio de tareas para el descubrimiento del conocimiento; sin embargo, muchos datos presentan mala calidad. Presentamos una revisión sistemática de los asuntos de calidad de datos en las áreas del descubrimiento de conocimiento y un estudio de caso aplicado a la enfermedad agrícola conocida como la roya del café.Large volume of data is growing because the organizations are continuously capturing the collective amount of data for better decision-making process. The most fundamental challenge is to explore the large volumes of data and extract useful knowledge for future actions through knowledge discovery tasks, nevertheless many data has poor quality. We presented a systematic review of the data quality issues in knowledge discovery tasks and a case study applied to agricultural disease named coffee rust

    On the role of pre and post-processing in environmental data mining

    Get PDF
    The quality of discovered knowledge is highly depending on data quality. Unfortunately real data use to contain noise, uncertainty, errors, redundancies or even irrelevant information. The more complex is the reality to be analyzed, the higher the risk of getting low quality data. Knowledge Discovery from Databases (KDD) offers a global framework to prepare data in the right form to perform correct analyses. On the other hand, the quality of decisions taken upon KDD results, depend not only on the quality of the results themselves, but on the capacity of the system to communicate those results in an understandable form. Environmental systems are particularly complex and environmental users particularly require clarity in their results. In this paper some details about how this can be achieved are provided. The role of the pre and post processing in the whole process of Knowledge Discovery in environmental systems is discussed

    Graph based Anomaly Detection and Description: A Survey

    Get PDF
    Detecting anomalies in data is a vital task, with numerous high-impact applications in areas such as security, finance, health care, and law enforcement. While numerous techniques have been developed in past years for spotting outliers and anomalies in unstructured collections of multi-dimensional points, with graph data becoming ubiquitous, techniques for structured graph data have been of focus recently. As objects in graphs have long-range correlations, a suite of novel technology has been developed for anomaly detection in graph data. This survey aims to provide a general, comprehensive, and structured overview of the state-of-the-art methods for anomaly detection in data represented as graphs. As a key contribution, we give a general framework for the algorithms categorized under various settings: unsupervised vs. (semi-)supervised approaches, for static vs. dynamic graphs, for attributed vs. plain graphs. We highlight the effectiveness, scalability, generality, and robustness aspects of the methods. What is more, we stress the importance of anomaly attribution and highlight the major techniques that facilitate digging out the root cause, or the ‘why’, of the detected anomalies for further analysis and sense-making. Finally, we present several real-world applications of graph-based anomaly detection in diverse domains, including financial, auction, computer traffic, and social networks. We conclude our survey with a discussion on open theoretical and practical challenges in the field

    Anomaly Handling in Visual Analytics

    Get PDF
    Visual analytics is an emerging field which uses visual techniques to interact with users in the analytical reasoning process. Users can choose the most appropriate representation that conveys the important content of their data by acting upon different visual displays. The data itself has many features of interest, including clusters, trends (commonalities) and anomalies. Most visualization techniques currently focus on the discovery of trends and other relations, where uncommon phenomena are treated as outliers and are either removed from the datasets or de-emphasized on the visual displays. Much less work has been done on the visual analysis of outliers, or anomalies. In this thesis, I will introduce a method to identify the different levels of “outlierness†by using interactive selection and other approaches to process outliers after detection. In one approach, the values of these outliers will be estimated from the values of their k-Nearest Neighbors and replaced to increase the consistency of the whole dataset. Other approaches will leave users with the choice of removing the outliers from the graphs or highlighting the unusual patterns on the graphs if points of interest lie in these anomalous regions. I will develop and test these anomaly handling methods within the XMDV Tool

    How to evaluate a subspace visual projection in interactive visual systems? A position paper

    Get PDF
    International audienceThis paper presents a position paper on subspace projection evaluation methods in interactive visual systems. We focus on how to evaluate real information rendered through the visual data projection for the mining of high dimensional data sets. To do this, we investigate automatic techniques that select the best visual projection and we discuss how they evaluate the projections to help the user before interactivity. When we deal with high dimensional data sets, the number of potential projections exceeds the limit of human interpretation. To find the optimal subspace representation, there are two possibilities, the first one is to find the optimal subspace which reproduces what really exists in the original data: getting the existing clusters and/or outliers in the projection. The second possibility consists in researching subspaces according to the knowledge discovery process: discovering novel, but meaningful information, such as clusters and/or outliers from the projection. The problem is that visual projection cannot be in adequation with the subspaces. In some cases, the visual projection can show some things that do not really exist in the original data space (which can be considered as an artifact). The mapping between the visual structure and the real data structure is as important as the efficiency and accuracy of the visualization. We examine and discuss the literature of Information visualization, Visual analytic, High dimensional data visualization, and interactive data mining and machine learning communities, on how to evaluate the faithfulness of the visual projection information

    A survey on pre-processing techniques: relevant issues in the context of environmental data mining

    Get PDF
    One of the important issues related with all types of data analysis, either statistical data analysis, machine learning, data mining, data science or whatever form of data-driven modeling, is data quality. The more complex the reality to be analyzed is, the higher the risk of getting low quality data. Unfortunately real data often contain noise, uncertainty, errors, redundancies or even irrelevant information. Useless models will be obtained when built over incorrect or incomplete data. As a consequence, the quality of decisions made over these models, also depends on data quality. This is why pre-processing is one of the most critical steps of data analysis in any of its forms. However, pre-processing has not been properly systematized yet, and little research is focused on this. In this paper a survey on most popular pre-processing steps required in environmental data analysis is presented, together with a proposal to systematize it. Rather than providing technical details on specific pre-processing techniques, the paper focus on providing general ideas to a non-expert user, who, after reading them, can decide which one is the more suitable technique required to solve his/her problem.Peer ReviewedPostprint (author's final draft
    corecore