4,242 research outputs found

    A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets

    Get PDF
    The term "outlier" can generally be defined as an observation that is significantly different from the other values in a data set. The outliers may be instances of error or indicate events. The task of outlier detection aims at identifying such outliers in order to improve the analysis of data and further discover interesting and useful knowledge about unusual events within numerous applications domains. In this paper, we report on contemporary unsupervised outlier detection techniques for multiple types of data sets and provide a comprehensive taxonomy framework and two decision trees to select the most suitable technique based on data set. Furthermore, we highlight the advantages, disadvantages and performance issues of each class of outlier detection techniques under this taxonomy framework

    ์‹œ๊ณต๊ฐ„ ์ž๋ฃŒ์˜ ๋‹ค์ค‘์ฒ™๋„ ๋ถ„์„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :์ž์—ฐ๊ณผํ•™๋Œ€ํ•™ ํ†ต๊ณ„ํ•™๊ณผ,2019. 8. ์˜คํฌ์„.This thesis presents a multiscale analysis of spatio-temporal data. The content of this thesis consists of three chapters. First, we suggest an enhancement of the lifting scheme, one of the popular multiscale method, by using clustering-based network design. The proposed method is originally developed for enhancement of graph signal data, and the simulation and real data analysis results show that the proposed method has the advantage to reconstruct the noisy data compared to conventional lifting scheme method. Moreover, the advantage of the proposed method is not limited to the graph signal denoising. It is also shown that the proposed method the proposed neighborhood selection is able to combine with lifting one coefficient at a time (LOCAAT) algorithm, which is a lifting scheme algorithm frequently used in signal denoising. Second, we suggest a new lifting scheme concept which could be applied for streamflow data. It is impossible to apply the original lifting scheme to streamflow data directly because of its complex structure. In this thesis, to adapt the concept of lifting scheme to streamflow data, we suggest a new lifting scheme algorithm for streamflow data with flow-adaptive neighborhood selection, flow proportional weight generation, and flow-length adaptive removal point selection. By using the proposed method, we can successfully construct a multiscale analysis of streamflow data. Simulation study supports the performance of the lifting scheme for streamflow data is competitive for signal denoising. Besides, the proposed methods can visualize the multiscale structure of the network by adding or subtracting observations. Third, multiscale analysis for particulate matter data in Seoul is provided as a case study. We suggest a new method, which is a novel combi- nation of multiscale analysis and extreme value theory. The study starts from the idea that every climate event has its spatial or temporal event lengths. By changing the event area and duration time, we can estimate multiple extreme value parameters using generalized extreme value (GEV) distribution. Besides, we suggest a new property, called piecewise scaling property to combine multiple GEV estimators into a single equation. By using the proposed method, we can construct a return level map with ar- bitrary duration time and event area.์ด ๋…ผ๋ฌธ์€ ๋‹ค์ค‘ ์ฒ™๋„ ๋ถ„์„์„ ์‹œ๊ณต๊ฐ„ ์ž๋ฃŒ์— ์‘์šฉํ•œ ๋ฐฉ๋ฒ•๋“ค์„ ์ œ์‹œํ•œ๋‹ค. ์ฒซ์งธ, ๊ทธ๋ž˜ํ”„ ์‹ ํ˜ธ ์ž๋ฃŒ์—์„œ์˜ ๋‹ค์ค‘ ์ฒ™๋„ ๋ถ„์„ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜์ธ ๋ฆฌํ”„ํŒ… ์Šคํ‚ด์„ ๊ตฐ์ง‘ ์— ๊ธฐ๋ฐ˜ํ•œ ์ด์›ƒ ์žฌ์„ค์ •์„ ํ†ตํ•ด ๊ธฐ๋Œ€ ์˜ˆ์ธก ์˜ค์ฐจ๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ด๊ณ  ์ด๋ฅผ ํ†ตํ•ด ๋ฆฌํ”„ํŒ… ์Šคํ‚ด์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒํ•˜์˜€๋‹ค. ๋‘˜์งธ, ๊ณต๊ฐ„์ ์œผ๋กœ ๋ณต์žกํ•˜๊ณ  ๋ฐฉํ–ฅ์„ฑ์ด ์žˆ๋Š” ๊ตฌ์กฐ์—์„œ ์ƒ์„ฑ๋œ ์œ ๋Ÿ‰ ๋„คํŠธ์›Œํฌ ์ž๋ฃŒ์— ์•Œ๋งž๋Š” ๋ฆฌํ”„ํŒ… ์Šคํ‚ด์„ ๊ตฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ๋„คํŠธ์›Œํฌ์˜ ํŠน์„ฑ์„ ๋ฐ˜์˜ํ•œ ์ด์›ƒ ์„ ํƒ, ์˜ˆ์ธก ํ•„ํ„ฐ ๊ตฌ์„ฑ ๋ฐ ์˜์—ญ ์„ค์ •์œผ๋กœ ์œ ๋Ÿ‰ ๋„คํŠธ์›Œํฌ ์ž๋ฃŒ์— ๋Œ€ํ•œ ๋ฆฌํ”„ํŒ… ์Šคํ‚ด์„ ๊ตฌ์„ฑํ•˜๊ณ  ์‹œ๊ณต๊ฐ„ ์ž๋ฃŒ์—์˜ ํ™•์žฅ ๊ฐ€๋Šฅ์„ฑ์„ ์‚ดํŽด๋ณด์•˜๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ์„œ์šธํŠน๋ณ„์‹œ ๊ณ ๋†๋„ ๋ฏธ์„ธ๋จผ์ง€ ์ž๋ฃŒ๋ฅผ ๋‹ค์–‘ ํ•œ ์‹œ๊ฐ„, ๊ณต๊ฐ„ ๋ฐ ์‹œ๊ณต๊ฐ„ ์ง‘์ ์„ ํ†ตํ•ด ๋ณ€ํ™˜ํ•œ ํ›„ ์–ป์–ด์ง„ ์ผ๋ฐ˜ํ™” ๊ทน๋‹จ๊ฐ’ ๋ชจํ˜•์˜ ๋ชจ์ˆ˜๋“ค์˜ ๊ด€๊ณ„๋ฅผ ์ˆ˜๋ฌธํ•™์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ฐ•๋„-์ง€์†์‹œ๊ฐ„-๋ฐœ์ƒ๋นˆ๋„ ๊ณก์„ ์— ๋งค๋“ญ ์„ ์ถ”๊ฐ€ํ•œ ๋ณ€ํ˜•๋œ ํ˜•ํƒœ์˜ ๊ฐ•๋„-์ง€์†์‹œ๊ฐ„-๋ฐœ์ƒ๋นˆ๋„ ๊ณก์„ ์„ ๋”ฐ๋ผ ๋ชจ๋ธ๋งํ•˜์˜€๊ณ  ์‚ฌ๋ก€ ์—ฐ๊ตฌ๋ฅผ ํ†ตํ•ด ์› ์ž๋ฃŒ์˜ ๋ณต๊ท€ ์ˆ˜์ค€ ์ง€๋„๋ฅผ ์ข€ ๋” ์ •ํ™•ํžˆ ๋ฌ˜์‚ฌํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์˜€๋‹ค.Abstract i 1 Introduction 1 2 Review: Multiscale analysis 4 2.1 Wavelets 4 2.1.1 Haar transforms 5 2.2 Multiresolution analysis 6 2.3 Lifting scheme 7 2.3.1 Lifting one coefficient at a time (LOCAAT) 11 2.3.2 Other lifting scheme methods 12 3 Enhancement of lifting scheme on graph signal data via clustering-based network design 14 3.1 Graph notations 15 3.2 Previous works 16 3.3 The use of clustering under the piecewise generalized moving average model 17 3.3.1 Piecewise generalized moving average model 18 3.3.2 Optimal UPA assignment under the piecewise homogeneous model 21 3.3.3 Extension to the spatio-temporal data 23 3.4 Simulation study 24 3.4.1 Stochastic block model 24 3.4.2 Image data analysis 26 3.4.3 Blocks signal denoising 29 3.5 Real data analysis 31 3.6 Summary and discussion 32 4 Streamflow lifting scheme 34 4.1 Dataset 36 4.2 Streamflow lifting scheme 37 4.2.1 Neighborhood selection 38 4.2.2 Prediction filter construction 39 4.2.3 Removal point selection 41 4.3 Simulation study 42 4.4 Real data analysis 47 4.5 Summary and further works 50 5 Multiscale analysis for PM10 extremes in Seoul 51 5.1 Data description 51 5.2 Temporal analysis of Seoul extreme PM10 data 54 5.2.1 Temporal aggregation and conventional scale property 54 5.2.2 Temporal multiscale modeling and modified scaling property 56 5.2.3 Result 1: GEV parameter estimation via piecewise linear approximation 57 5.2.4 Result 2: Return level map by the proposed modified scaling approach 62 5.3 Spatio-temporal multiscale analysis of Seoul extreme PM10 data 65 5.3.1 Spatio-temporal aggregation of Seoul extreme PM10 data 65 5.3.2 Result: Areal aggregation of Seoul extreme PM10 data 70 5.4 Summary and discussion 73 6 Concluding remarks 76 A Generalized extreme value distribution 77 B Scaling property theory 79 Abstract (in Korean) 85Docto

    Airborne LiDAR for DEM generation: some critical issues

    Get PDF
    Airborne LiDAR is one of the most effective and reliable means of terrain data collection. Using LiDAR data for DEM generation is becoming a standard practice in spatial related areas. However, the effective processing of the raw LiDAR data and the generation of an efficient and high-quality DEM remain big challenges. This paper reviews the recent advances of airborne LiDAR systems and the use of LiDAR data for DEM generation, with special focus on LiDAR data filters, interpolation methods, DEM resolution, and LiDAR data reduction. Separating LiDAR points into ground and non-ground is the most critical and difficult step for DEM generation from LiDAR data. Commonly used and most recently developed LiDAR filtering methods are presented. Interpolation methods and choices of suitable interpolator and DEM resolution for LiDAR DEM generation are discussed in detail. In order to reduce the data redundancy and increase the efficiency in terms of storage and manipulation, LiDAR data reduction is required in the process of DEM generation. Feature specific elements such as breaklines contribute significantly to DEM quality. Therefore, data reduction should be conducted in such a way that critical elements are kept while less important elements are removed. Given the highdensity characteristic of LiDAR data, breaklines can be directly extracted from LiDAR data. Extraction of breaklines and integration of the breaklines into DEM generation are presented

    Structural Generative Descriptions for Temporal Data

    Get PDF
    In data mining problems the representation or description of data plays a fundamental role, since it defines the set of essential properties for the extraction and characterisation of patterns. However, for the case of temporal data, such as time series and data streams, one outstanding issue when developing mining algorithms is finding an appropriate data description or representation. In this thesis two novel domain-independent representation frameworks for temporal data suitable for off-line and online mining tasks are formulated. First, a domain-independent temporal data representation framework based on a novel data description strategy which combines structural and statistical pattern recognition approaches is developed. The key idea here is to move the structural pattern recognition problem to the probability domain. This framework is composed of three general tasks: a) decomposing input temporal patterns into subpatterns in time or any other transformed domain (for instance, wavelet domain); b) mapping these subpatterns into the probability domain to find attributes of elemental probability subpatterns called primitives; and c) mining input temporal patterns according to the attributes of their corresponding probability domain subpatterns. This framework is referred to as Structural Generative Descriptions (SGDs). Two off-line and two online algorithmic instantiations of the proposed SGDs framework are then formulated: i) For the off-line case, the first instantiation is based on the use of Discrete Wavelet Transform (DWT) and Wavelet Density Estimators (WDE), while the second algorithm includes DWT and Finite Gaussian Mixtures. ii) For the online case, the first instantiation relies on an online implementation of DWT and a recursive version of WDE (RWDE), whereas the second algorithm is based on a multi-resolution exponentially weighted moving average filter and RWDE. The empirical evaluation of proposed SGDs-based algorithms is performed in the context of time series classification, for off-line algorithms, and in the context of change detection and clustering, for online algorithms. For this purpose, synthetic and publicly available real-world data are used. Additionally, a novel framework for multidimensional data stream evolution diagnosis incorporating RWDE into the context of Velocity Density Estimation (VDE) is formulated. Changes in streaming data and changes in their correlation structure are characterised by means of local and global evolution coefficients as well as by means of recursive correlation coefficients. The proposed VDE framework is evaluated using temperature data from the UK and air pollution data from Hong Kong.Open Acces

    Density-based algorithms for active and anytime clustering

    Get PDF
    Data intensive applications like biology, medicine, and neuroscience require effective and efficient data mining technologies. Advanced data acquisition methods produce a constantly increasing volume and complexity. As a consequence, the need of new data mining technologies to deal with complex data has emerged during the last decades. In this thesis, we focus on the data mining task of clustering in which objects are separated in different groups (clusters) such that objects inside a cluster are more similar than objects in different clusters. Particularly, we consider density-based clustering algorithms and their applications in biomedicine. The core idea of the density-based clustering algorithm DBSCAN is that each object within a cluster must have a certain number of other objects inside its neighborhood. Compared with other clustering algorithms, DBSCAN has many attractive benefits, e.g., it can detect clusters with arbitrary shape and is robust to outliers, etc. Thus, DBSCAN has attracted a lot of research interest during the last decades with many extensions and applications. In the first part of this thesis, we aim at developing new algorithms based on the DBSCAN paradigm to deal with the new challenges of complex data, particularly expensive distance measures and incomplete availability of the distance matrix. Like many other clustering algorithms, DBSCAN suffers from poor performance when facing expensive distance measures for complex data. To tackle this problem, we propose a new algorithm based on the DBSCAN paradigm, called Anytime Density-based Clustering (A-DBSCAN), that works in an anytime scheme: in contrast to the original batch scheme of DBSCAN, the algorithm A-DBSCAN first produces a quick approximation of the clustering result and then continuously refines the result during the further run. Experts can interrupt the algorithm, examine the results, and choose between (1) stopping the algorithm at any time whenever they are satisfied with the result to save runtime and (2) continuing the algorithm to achieve better results. Such kind of anytime scheme has been proven in the literature as a very useful technique when dealing with time consuming problems. We also introduced an extended version of A-DBSCAN called A-DBSCAN-XS which is more efficient and effective than A-DBSCAN when dealing with expensive distance measures. Since DBSCAN relies on the cardinality of the neighborhood of objects, it requires the full distance matrix to perform. For complex data, these distances are usually expensive, time consuming or even impossible to acquire due to high cost, high time complexity, noisy and missing data, etc. Motivated by these potential difficulties of acquiring the distances among objects, we propose another approach for DBSCAN, called Active Density-based Clustering (Act-DBSCAN). Given a budget limitation B, Act-DBSCAN is only allowed to use up to B pairwise distances ideally to produce the same result as if it has the entire distance matrix at hand. The general idea of Act-DBSCAN is that it actively selects the most promising pairs of objects to calculate the distances between them and tries to approximate as much as possible the desired clustering result with each distance calculation. This scheme provides an efficient way to reduce the total cost needed to perform the clustering. Thus it limits the potential weakness of DBSCAN when dealing with the distance sparseness problem of complex data. As a fundamental data clustering algorithm, density-based clustering has many applications in diverse fields. In the second part of this thesis, we focus on an application of density-based clustering in neuroscience: the segmentation of the white matter fiber tracts in human brain acquired from Diffusion Tensor Imaging (DTI). We propose a model to evaluate the similarity between two fibers as a combination of structural similarity and connectivity-related similarity of fiber tracts. Various distance measure techniques from fields like time-sequence mining are adapted to calculate the structural similarity of fibers. Density-based clustering is used as the segmentation algorithm. We show how A-DBSCAN and A-DBSCAN-XS are used as novel solutions for the segmentation of massive fiber datasets and provide unique features to assist experts during the fiber segmentation process.Datenintensive Anwendungen wie Biologie, Medizin und Neurowissenschaften erfordern effektive und effiziente Data-Mining-Technologien. Erweiterte Methoden der Datenerfassung erzeugen stetig wachsende Datenmengen und Komplexit\"at. In den letzten Jahrzehnten hat sich daher ein Bedarf an neuen Data-Mining-Technologien f\"ur komplexe Daten ergeben. In dieser Arbeit konzentrieren wir uns auf die Data-Mining-Aufgabe des Clusterings, in der Objekte in verschiedenen Gruppen (Cluster) getrennt werden, so dass Objekte in einem Cluster untereinander viel \"ahnlicher sind als Objekte in verschiedenen Clustern. Insbesondere betrachten wir dichtebasierte Clustering-Algorithmen und ihre Anwendungen in der Biomedizin. Der Kerngedanke des dichtebasierten Clustering-Algorithmus DBSCAN ist, dass jedes Objekt in einem Cluster eine bestimmte Anzahl von anderen Objekten in seiner Nachbarschaft haben muss. Im Vergleich mit anderen Clustering-Algorithmen hat DBSCAN viele attraktive Vorteile, zum Beispiel kann es Cluster mit beliebiger Form erkennen und ist robust gegen\"uber Ausrei{\ss}ern. So hat DBSCAN in den letzten Jahrzehnten gro{\ss}es Forschungsinteresse mit vielen Erweiterungen und Anwendungen auf sich gezogen. Im ersten Teil dieser Arbeit wollen wir auf die Entwicklung neuer Algorithmen eingehen, die auf dem DBSCAN Paradigma basieren, um mit den neuen Herausforderungen der komplexen Daten, insbesondere teurer Abstandsma{\ss}e und unvollst\"andiger Verf\"ugbarkeit der Distanzmatrix umzugehen. Wie viele andere Clustering-Algorithmen leidet DBSCAN an schlechter Per- formanz, wenn es teuren Abstandsma{\ss}en f\"ur komplexe Daten gegen\"uber steht. Um dieses Problem zu l\"osen, schlagen wir einen neuen Algorithmus vor, der auf dem DBSCAN Paradigma basiert, genannt Anytime Density-based Clustering (A-DBSCAN), der mit einem Anytime Schema funktioniert. Im Gegensatz zu dem urspr\"unglichen Schema DBSCAN, erzeugt der Algorithmus A-DBSCAN zuerst eine schnelle Ann\"aherung des Clusterings-Ergebnisses und verfeinert dann kontinuierlich das Ergebnis im weiteren Verlauf. Experten k\"onnen den Algorithmus unterbrechen, die Ergebnisse pr\"ufen und w\"ahlen zwischen (1) Anhalten des Algorithmus zu jeder Zeit, wann immer sie mit dem Ergebnis zufrieden sind, um Laufzeit sparen und (2) Fortsetzen des Algorithmus, um bessere Ergebnisse zu erzielen. Eine solche Art eines "Anytime Schemas" ist in der Literatur als eine sehr n\"utzliche Technik erprobt, wenn zeitaufwendige Problemen anfallen. Wir stellen auch eine erweiterte Version von A-DBSCAN als A-DBSCAN-XS vor, die effizienter und effektiver als A-DBSCAN beim Umgang mit teuren Abstandsma{\ss}en ist. Da DBSCAN auf der Kardinalit\"at der Nachbarschaftsobjekte beruht, ist es notwendig, die volle Distanzmatrix auszurechen. F\"ur komplexe Daten sind diese Distanzen in der Regel teuer, zeitaufwendig oder sogar unm\"oglich zu errechnen, aufgrund der hohen Kosten, einer hohen Zeitkomplexit\"at oder verrauschten und fehlende Daten. Motiviert durch diese m\"oglichen Schwierigkeiten der Berechnung von Entfernungen zwischen Objekten, schlagen wir einen anderen Ansatz f\"ur DBSCAN vor, namentlich Active Density-based Clustering (Act-DBSCAN). Bei einer Budgetbegrenzung B, darf Act-DBSCAN nur bis zu B ideale paarweise Distanzen verwenden, um das gleiche Ergebnis zu produzieren, wie wenn es die gesamte Distanzmatrix zur Hand h\"atte. Die allgemeine Idee von Act-DBSCAN ist, dass es aktiv die erfolgversprechendsten Paare von Objekten w\"ahlt, um die Abst\"ande zwischen ihnen zu berechnen, und versucht, sich so viel wie m\"oglich dem gew\"unschten Clustering mit jeder Abstandsberechnung zu n\"ahern. Dieses Schema bietet eine effiziente M\"oglichkeit, die Gesamtkosten der Durchf\"uhrung des Clusterings zu reduzieren. So schr\"ankt sie die potenzielle Schw\"ache des DBSCAN beim Umgang mit dem Distance Sparseness Problem von komplexen Daten ein. Als fundamentaler Clustering-Algorithmus, hat dichte-basiertes Clustering viele Anwendungen in den unterschiedlichen Bereichen. Im zweiten Teil dieser Arbeit konzentrieren wir uns auf eine Anwendung des dichte-basierten Clusterings in den Neurowissenschaften: Die Segmentierung der wei{\ss}en Substanz bei Faserbahnen im menschlichen Gehirn, die vom Diffusion Tensor Imaging (DTI) erfasst werden. Wir schlagen ein Modell vor, um die \"Ahnlichkeit zwischen zwei Fasern als einer Kombination von struktureller und konnektivit\"atsbezogener \"Ahnlichkeit von Faserbahnen zu beurteilen. Verschiedene Abstandsma{\ss}e aus Bereichen wie dem Time-Sequence Mining werden angepasst, um die strukturelle \"Ahnlichkeit von Fasern zu berechnen. Dichte-basiertes Clustering wird als Segmentierungsalgorithmus verwendet. Wir zeigen, wie A-DBSCAN und A-DBSCAN-XS als neuartige L\"osungen f\"ur die Segmentierung von sehr gro{\ss}en Faserdatens\"atzen verwendet werden, und bieten innovative Funktionen, um Experten w\"ahrend des Fasersegmentierungsprozesses zu unterst\"utzen
    • โ€ฆ
    corecore