12 research outputs found

    Screening tools for data quality and outlier detection applied to the Airbase ambient air pollution database

    In order to provide scientifically sound information for regulatory purposes and environmental impact assessment, long term meso- to large-scale datasets of ambient air quality provide an indispensible means for model calibration, evaluation and validation. However, the collection of high quality datasets with suitable spatial coverage for air pollution management and decision support poses many challenges. It is thus critical to establish expedient tools for the efficient assessment and data quality control of air pollution measurements in large scale national and international monitoring networks. The European Environmental Agency collects, in the Air Quality Database named AirBase, measurements of ambient air pollution at more than 6000 monitoring stations from over 30 countries. The quality of these data depends on the chosen method of measurements and QA/QC procedures applied by each country. We present a methodology to automatically screen the AirBase records for internal consistency and to detect spatio-temporal outliers nested in the data. We implemented a spatial-set outlier detection method, which considers both attribute values and spatial relationships. Specifically, we adapted the “Smooth Spatial Attribute method” that was developed for the identification of outliers in traffic sensors. The method relies on the definition of a neighbourhood for each air pollutant measurement, corresponding to a spatio-temporal domain limited in time (+/- 1 day) and distance (+/- 1 degree) around location x. It is assumed that within a given spatio-temporal domain in which the attribute values of neighbours have a relationship due to the emission, transport and reaction of air pollutants, outliers will be detected by extreme values of their attributes compared to the attribute values of their neighbours. The implemented method can be of interest as a data quality screening system when countries report their measurements to the European Environment Agency. Beyond this, it could also provide a simple solution to investigate the accuracy of station classification in AirBase.JRC.H.2-Air and Climat

    Outlier Mining Methods Based on Graph Structure Analysis

    Outlier detection in high-dimensional datasets is a fundamental and challenging problem across disciplines that has also practical implications, as removing outliers from the training set improves the performance of machine learning algorithms. While many outlier mining algorithms have been proposed in the literature, they tend to be valid or efficient for specific types of datasets (time series, images, videos, etc.). Here we propose two methods that can be applied to generic datasets, as long as there is a meaningful measure of distance between pairs of elements of the dataset. Both methods start by defining a graph, where the nodes are the elements of the dataset, and the links have associated weights that are the distances between the nodes. Then, the first method assigns an outlier score based on the percolation (i.e., the fragmentation) of the graph. The second method uses the popular IsoMap non-linear dimensionality reduction algorithm, and assigns an outlier score by comparing the geodesic distances with the distances in the reduced space. We test these algorithms on real and synthetic datasets and show that they either outperform, or perform on par with other popular outlier detection methods. A main advantage of the percolation method is that is parameter free and therefore, it does not require any training; on the other hand, the IsoMap method has two integer number parameters, and when they are appropriately selected, the method performs similar to or better than all the other methods tested.Peer ReviewedPostprint (published version

    A robust hierarchical clustering for georeferenced data

    The detection of spatially contiguous clusters is a relevant task in geostatistics since near located observations might have similar features than distant ones. Spatially compact groups can also improve clustering results interpretation according to the different detected subregions. In this paper, we propose a robust metric approach to neutralize the effect of possible outliers, i.e. an exponential transformation of a dissimilarity measure between each pair of locations based on non-parametric kernel estimator of the direct and cross variograms (Fouedjio, 2016) and on a different bandwidth identification, suitable for agglomerative hierarchical clustering techniques applied to data indexed by geographical coordinates. Simulation results are very promising showing very good performances of our proposed metric with respect to the baseline ones. Finally, the new clustering approach is applied to two real-word data sets, both giving locations and top soil heavy metal concentrations

    Research on Outlier Detection Algorithm in Data Mining

    离群点检测是数据挖掘中的一个分支,它的任务是识别其特征显著不同于其他数据的观测值。在我们平常的社会生活和自然界中,大部分的事件和对象,都是很寻常或者是平凡的。但是我们也不能因此忽视,在其中也有很多不寻常或者不平凡的对象存在的可能性。这些对象的事件背后可能蕴含着更大的研究价值,有着广阔的应用前景。因此,离群点检测是一个非常有意义的研究方向。 目前,研究者们已经提出了很多离群点检测方法,包括基于统计的离群点检测方法、基于频率的离群点检测方法、基于深度的离群点检测方法、基于距离的离群点检测方法和基于密度的离群点检测方法等。本文分析了离群点检测的研究背景、意义和国内外研究现状,研究基于距离的离群点检...Outlier detection is a branch of data mining. Its task is to identify the observations whose characteristics are significantly different from other data. In field of nature, human society, or data sets, most of the events and objects are ordinary or usual. But there are also many unusual or extraordinary objects. Value may be behind these objects. Outlier detection has broad application prospects....学位:工学硕士院系专业:软件学院_计算机软件与理论学号:2432011115227

    Estimation of the Measurement Uncertainty of Ambient Air Pollution Datasets Using Geostatistical Analysis

    We developed a methodology able to automatically estimate of measurement uncertainty in the air pollution data sets of AIRBase. The figures produced with this method were consistent with expectations from laboratory and field estimation of uncertainty and with the Data Quality Objectives of the European Directives. The proposed method based on geostatistical analysis is not able to estimate directly the measurement uncertainty. It estimates the nugget effect together with a micro-scale variability that must be minimized by accurate selection of the type of station. Based on the results obtained so far, it is likely that measurement uncertainty is best estimated using all background stations of whatever area type. So far the methodology has been used to estimate uncertainty in 4 different countries independently. This work should be continued for the whole Europe or for background station without national borders. The method has been shown to be also useful to compare the spatial continuity of air pollution in different countries that seems to be influenced by the topography of each country. Moreover, it may be used to quantify the trend of measurement uncertainty over long periods like decade with the possibility to evidence improvement in the data quality of AIRBase datasets. Thanks to the implemented outlier detection module that would also be of interest as the warning system when Member States report they measurement to the European Environment Agency, we have proposed an easy solution to investigate wrong classified stations in AIRBase.JRC.DDG.H.4-Transport and air qualit

    A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets

    The term "outlier" can generally be defined as an observation that is significantly different from the other values in a data set. The outliers may be instances of error or indicate events. The task of outlier detection aims at identifying such outliers in order to improve the analysis of data and further discover interesting and useful knowledge about unusual events within numerous applications domains. In this paper, we report on contemporary unsupervised outlier detection techniques for multiple types of data sets and provide a comprehensive taxonomy framework and two decision trees to select the most suitable technique based on data set. Furthermore, we highlight the advantages, disadvantages and performance issues of each class of outlier detection techniques under this taxonomy framework

    On the Nature and Types of Anomalies: A Review

    Anomalies are occurrences in a dataset that are in some way unusual and do not fit the general patterns. The concept of the anomaly is generally ill-defined and perceived as vague and domain-dependent. Moreover, despite some 250 years of publications on the topic, no comprehensive and concrete overviews of the different types of anomalies have hitherto been published. By means of an extensive literature review this study therefore offers the first theoretically principled and domain-independent typology of data anomalies, and presents a full overview of anomaly types and subtypes. To concretely define the concept of the anomaly and its different manifestations, the typology employs five dimensions: data type, cardinality of relationship, anomaly level, data structure and data distribution. These fundamental and data-centric dimensions naturally yield 3 broad groups, 9 basic types and 61 subtypes of anomalies. The typology facilitates the evaluation of the functional capabilities of anomaly detection algorithms, contributes to explainable data science, and provides insights into relevant topics such as local versus global anomalies.Comment: 38 pages (30 pages content), 10 figures, 3 tables. Preprint; review comments will be appreciated. Improvements in version 2: Explicit mention of fifth anomaly dimension; Added section on explainable anomaly detection; Added section on variations on the anomaly concept; Various minor additions and improvement

    Machine learning methods for the characterization and classification of complex data

    This thesis work presents novel methods for the analysis and classification of medical images and, more generally, complex data. First, an unsupervised machine learning method is proposed to order anterior chamber OCT (Optical Coherence Tomography) images according to a patient's risk of developing angle-closure glaucoma. In a second study, two outlier finding techniques are proposed to improve the results of above mentioned machine learning algorithm, we also show that they are applicable to a wide variety of data, including fraud detection in credit card transactions. In a third study, the topology of the vascular network of the retina, considering it a complex tree-like network is analyzed and we show that structural differences reveal the presence of glaucoma and diabetic retinopathy. In a fourth study we use a model of a laser with optical injection that presents extreme events in its intensity time-series to evaluate machine learning methods to forecast such extreme events.El presente trabajo de tesis desarrolla nuevos métodos para el análisis y clasificación de imágenes médicas y datos complejos en general. Primero, proponemos un método de aprendizaje automático sin supervisión que ordena imágenes OCT (tomografía de coherencia óptica) de la cámara anterior del ojo en función del grado de riesgo del paciente de padecer glaucoma de ángulo cerrado. Luego, desarrollamos dos métodos de detección automática de anomalías que utilizamos para mejorar los resultados del algoritmo anterior, pero que su aplicabilidad va mucho más allá, siendo útil, incluso, para la detección automática de fraudes en transacciones de tarjetas de crédito. Mostramos también, cómo al analizar la topología de la red vascular de la retina considerándola una red compleja, podemos detectar la presencia de glaucoma y de retinopatía diabética a través de diferencias estructurales. Estudiamos también un modelo de un láser con inyección óptica que presenta eventos extremos en la serie temporal de intensidad para evaluar diferentes métodos de aprendizaje automático para predecir dichos eventos extremos.Aquesta tesi desenvolupa nous mètodes per a l’anàlisi i la classificació d’imatges mèdiques i dades complexes. Hem proposat, primer, un mètode d’aprenentatge automàtic sense supervisió que ordena imatges OCT (tomografia de coherència òptica) de la cambra anterior de l’ull en funció del grau de risc del pacient de patir glaucoma d’angle tancat. Després, hem desenvolupat dos mètodes de detecció automàtica d’anomalies que hem utilitzat per millorar els resultats de l’algoritme anterior, però que la seva aplicabilitat va molt més enllà, sent útil, fins i tot, per a la detecció automàtica de fraus en transaccions de targetes de crèdit. Mostrem també, com en analitzar la topologia de la xarxa vascular de la retina considerant-la una xarxa complexa, podem detectar la presència de glaucoma i de retinopatia diabètica a través de diferències estructurals. Finalment, hem estudiat un làser amb injecció òptica, el qual presenta esdeveniments extrems en la sèrie temporal d’intensitat. Hem avaluat diferents mètodes per tal de predir-los.Postprint (published version

    Evaluating Spatial Outliers And Integrating Temporal Data In Air Pollution Models For The Detroit-Windsor Airshed

    The heterogeneous nature of urban air complicates human exposure estimates and creates a need for accurate, highly detailed spatiotemporal air contaminant models. The study expands on previous investigations by the Geospatial Determinants of Health Outcomes Consortium that examined relationships between air pollutant distributions and asthma exacerbations. Two approaches, the removal of spatial data outliers and the integration of spatial and temporal data, were used to refine air quality models in the Detroit and Windsor international airshed. The evaluation of associations between the resulting air quality models and asthma exacerbations in Detroit and Windsor revealed weaker correlations with spatial outliers removed but improved correlations with the addition of temporal data. Recommendations for future work include increasing the spatial and temporal resolution of the asthma datasets and incorporating Windsor NAPS data through temporal scaling to help confirm the findings of the Detroit temporal scaling