5,502 research outputs found

    Pattern mining under different conditions

    Get PDF
    New requirements and demands on pattern mining arise in modern applications, which cannot be fulfilled using conventional methods. For example, in scientific research, scientists are more interested in unknown knowledge, which usually hides in significant but not frequent patterns. However, existing itemset mining algorithms are designed for very frequent patterns. Furthermore, scientists need to repeat an experiment many times to ensure reproducibility. A series of datasets are generated at once, waiting for clustering, which can contain an unknown number of clusters with various densities and shapes. Using existing clustering algorithms is time-consuming because parameter tuning is necessary for each dataset. Many scientific datasets are extremely noisy. They contain considerably more noises than in-cluster data points. Most existing clustering algorithms can only handle noises up to a moderate level. Temporal pattern mining is also important in scientific research. Existing temporal pattern mining algorithms only consider pointbased events. However, most activities in the real-world are interval-based with a starting and an ending timestamp. This thesis developed novel pattern mining algorithms for various data mining tasks under different conditions. The first part of this thesis investigates the problem of mining less frequent itemsets in transactional datasets. In contrast to existing frequent itemset mining algorithms, this part focus on itemsets that occurred not that frequent. Algorithms NIIMiner, RaCloMiner, and LSCMiner are proposed to identify such kind of itemsets efficiently. NIIMiner utilizes the negative itemset tree to extract all patterns that occurred less than a given support threshold in a top-down depth-first manner. RaCloMiner combines existing bottom-up frequent itemset mining algorithms with a top-down itemset mining algorithm to achieve a better performance in mining less frequent patterns. LSCMiner investigates the problem of mining less frequent closed patterns. The second part of this thesis studied the problem of interval-based temporal pattern mining in the stream environment. Interval-based temporal patterns are sequential patterns in which each event is aligned with a starting and ending temporal information. The ability to handle interval-based events and stream data is lacking in existing approaches. A novel intervalbased temporal pattern mining algorithm for stream data is described in this part. The last part of this thesis studies new problems in clustering on numeric datasets. The first problem tackled in this part is shape alternation adaptivity in clustering. In applications such as scientific data analysis, scientists need to deal with a series of datasets generated from one experiment. Cluster sizes and shapes are different in those datasets. A kNN density-based clustering algorithm, kadaClus, is proposed to provide the shape alternation adaptability so that users do not need to tune parameters for each dataset. The second problem studied in this part is clustering in an extremely noisy dataset. Many real-world datasets contain considerably more noises than in-cluster data points. A novel clustering algorithm, kenClus, is proposed to identify clusters in arbitrary shapes from extremely noisy datasets. Both clustering algorithms are kNN-based, which only require one parameter k. In each part, the efficiency and effectiveness of the presented techniques are thoroughly analyzed. Intensive experiments on synthetic and real-world datasets are conducted to show the benefits of the proposed algorithms over conventional approaches

    Genetic conservation strategies of endemic plants from edaphic habitat islands: The case of Jacobaea auricula (Asteraceae)

    Get PDF
    This work was partially supported by the Spanish Ministerio de Economia y Competitividad through the projects CGL2010-16357 and CGL201232574. E. SalmeronSanchez was supported by the University of Almeria, through the projects 'Assessment, Monitoring and Applied Scientific Research for Ecological Restoration of Gypsum Mining Concessions (Majadas Viejas and Marylen) and Spreading of Results (ECORESGYP) ' sponsored by the company EXPLOTACIONES RIO DE AGUAS S.L. (TORRALBA GROUP) ; 'Provision of services, monitoring and evaluation of the environmental restoration of the mining concessions Los Yesares, Maria Morales and El Cigarron' sponsored by the company Saint Gobain Placo Iberica S. We would like to thank M. Montserrat MartinezOrtega helped with field work and initial analyses.The authors would like to thank M. Montserrat Martínez-Ortega, Luz M. Mu˜noz-Centeno, Fabi´an Martínez-Hern´andez, Sara Barrios and Teresa Malvar for their participation in DNA extractions, molecular analyses and in general for the help provided. We also thank Sara Barrios, María Santos, Santiago Andr´es, Blas Benito and Antonio Abad for the help provided in the collection of plant material. Finally, we are thankful to Francisco J. P´erez-García for his valuable comments concerning halogypsophyte species.Conservation genetics is a well-established and essential scientific field in the toolkit of conservation planning, management, and decision-making. Within its framework, phylogeography allows the definition of conservation strategies, especially in threatened endemic plants. Gypsum and salt-rich outcrops constitute a model example of an edaphic island-like habitat and contain rare and endemic species, many of them threatened. This is the case of Jacobaea auricula, an Iberian gypsohalophytic species with biological, ecological, and conservation interest. Genetic-based criteria were used to preserve the highest possible percentage of the species' genetic pool as well as to dispose of a set of genotypes for translocation and/or reinforcement planning of degraded populations. Relevant Genetics Units for Conservation (RGUCs) were selected as in situ conservation planning. As a complementary ex situ measure, the optimal contribution for the populations to maximize the genetic pool within each genetic cluster was calculated. To preserve the maximum genetic diversity and the highest percentage of rare AFLP bands possible, eight RGUCs were selected; the ex situ conservation design included twenty-one populations, gathering all haplotypes and ribotypes. Our genetic conservation proposal of J. auricula would improve the implementation of future genetic conservation measures, as a species model of endemic plants from edaphic habitat islands.Spanish Ministerio de Economia y Competitividad CGL2010-16357 CGL2012-32574University of Almeria - company EXPLOTACIONES RIO DE AGUAS S.L. (TORRALBA GROUP)company Saint Gobain Placo Iberica

    XTadGAN: Generative Adversarial Networks to Detect Extremely Rare Anomalies

    Get PDF
    A detecção de anomalias em séries temporais é fundamental para identificar atividades fraudulentas, detetar falhas em processos e monitorizar a saúde de sistemas complexos. As Redes Adversariais Generativas (GANs) têm mostrado resultados promissores neste domínio, superando quer abordagens tradicionais, quer abordagens mais recentes baseadas em machine learning. No entanto, todos estes métodos apresentam dificuldades e limitações em detetar anomalias extremamente raras. Esta dissertação tem como objetivo modificar e estender o modelo TadGAN e investigar o potencial desta abordagem para detetar anomalias extremamente raras (XTadGAN). Além disso, argumentamos que não existe uma metodologia sistemática para avaliar e comparar o desempenho de diferentes métodos de detecção de anomalias, especificamente em relação à sua sensibilidade a variações na frequência de anomalias. Nesse sentido, esta tese também explora o desenvolvimento de um índice de sensibilidade para patamares crescentes de raridade de anomalias, a ser aplicado ao nosso modelo proposto e a outros métodos de referência. O trabalho desenvolvido contribuirá significativamente para o campo da deteção de anomalias, introduzindo uma metodologia robusta e precisa para comparar o desempenho de diferentes abordagens, preenchendo assim uma lacuna crucial na investigação atual. O índice de sensibilidade proposto neste estudo é relevante, uma vez que fornece uma métrica robusta que pode ser utilizada para desenvolver testes de comparação padronizados que permitam entender melhor as vantagens e limitações de cada modelo e orientar investigação futura no sentido de melhorar o desempenho em aplicações reais. Além disso, a análise proposta lançará luz sobre como as GANs em particular, e outros métodos em geral, podem ser otimizados para detetar anomalias extremamente raras em séries temporais de forma mais precisa.Anomaly detection in time series data is critical for identifying fraudulent activities, detecting process failures, and monitoring the health of complex systems. Generative Adversarial Networks (GANs) have recently shown promising results in this domain, outperforming traditional as well as more recent machine learning approaches. However, all of these methods struggle with extremely rare anomalies. This thesis aims to modify and extend the TadGAN model and investigate the potential of this approach to better detect extremely rare anomalies (XTadGAN). Furthermore, we argue that there is an absence of a systematic methodology to assess and compare the performance of different anomaly detection methods, specifically with respect to their sensitivity to variations in the frequency of anomalies. Therefore, this thesis also explores the development of a sensitivity index for increasing orders of anomaly rarity, to be applied to our proposed extended model and other benchmark methods. The developed work will make a valuable contribution to the field of anomaly detection by introducing a robust and accurate framework for comparing the performance of different approaches, hopefully filling a crucial gap in current research. The sensitivity index proposed in this study is significant as it provides a robust metric that can be used to conduct standardized comparison tests to better understand the strengths and limitations of each model and guide future research to improve performance in real-world applications. Moreover, the proposed analysis will shed light on how GANs in particular, and other methods in general, can be optimized to more accurately detect extremely rare anomalies in time series data

    Understanding and supporting pricing decisions using multicriteria decision analysis: an application to antique silver in South Africa

    Get PDF
    This dissertation presents an application of multicriteria decision analysis to understand and support pricing decisions in fields where goods are unique and described by their characteristics. The specific application area of this research is antique silver objects, where a complete iteration of the multicritia decision process is performed. This includes two problem structurings using SODA which provide rich detail into this application area. Multi-attribute additive models are constructed, with attribute partial value functions elicited using different methods: directly (bisection methods), indirectly (MACBETH and linear interpolation) and with discrete choice experiments. The applicability and advantages of each method is discussed. Additionally, an open source R package to implement the design of discrete choice experiments is created. The multi-attribute models provide key insights into decision maker's reasoning for price; and contrasting different decision maker's models explains the market. A risk adverse relationship between multicriteria model score and price is characterised and various inverse utility functions investigated. Two decision support systems are fully developed to address the needs of Cape silver decision makers in South Africa

    A Comprehensive Survey on Rare Event Prediction

    Full text link
    Rare event prediction involves identifying and forecasting events with a low probability using machine learning and data analysis. Due to the imbalanced data distributions, where the frequency of common events vastly outweighs that of rare events, it requires using specialized methods within each step of the machine learning pipeline, i.e., from data processing to algorithms to evaluation protocols. Predicting the occurrences of rare events is important for real-world applications, such as Industry 4.0, and is an active research area in statistical and machine learning. This paper comprehensively reviews the current approaches for rare event prediction along four dimensions: rare event data, data processing, algorithmic approaches, and evaluation approaches. Specifically, we consider 73 datasets from different modalities (i.e., numerical, image, text, and audio), four major categories of data processing, five major algorithmic groupings, and two broader evaluation approaches. This paper aims to identify gaps in the current literature and highlight the challenges of predicting rare events. It also suggests potential research directions, which can help guide practitioners and researchers.Comment: 44 page

    Predicate based association rules mining with new interestingness measure

    Get PDF
    Association Rule Mining (ARM) is one of the fundamental components in the field of data mining that discovers frequent itemsets and interesting relationships for predicting the associative and correlative behaviours for new data. However, traditional ARM techniques are based on support-confidence that discovers interesting association rules (ARs) using predefined minimum support (minsupp) and minimum confidence (minconf) threshold. In addition, traditional AR techniques only consider frequent items while ignoring rare ones. Thus, a new parameter-less predicated based ARM technique was proposed to address these limitations, which was enhanced to handle the frequent and rare items at the same time. Furthermore, a new interestingness measure, called g measure, was developed to select only highly interesting rules. In this proposed technique, interesting combinations were firstly selected by considering both the frequent and the rare items from a dataset. They were then mapped to the pseudo implications using predefined logical conditions. Later, inference rules were used to validate the pseudo-implications to discover rules within the set of mapped pseudo-implications. The resultant set of interesting rules was then referred to as the predicate based association rules. Zoo, breast cancer, and car evaluation datasets were used for conducting experiments. The results of the experiments were evaluated by its comparison with various classification techniques, traditional ARM technique and the coherent rule mining technique. The predicate-based rule mining approach gained an accuracy of 93.33%. In addition, the results of the g measure were compared with a state-of-the-art interestingness measure developed for a coherent rule mining technique called the h value. Predicate rules were discovered with an average confidence value of 0.754 for the zoo dataset and 0.949 for the breast cancer dataset, while the average confidence of the predicate rules found from the car evaluation dataset was 0.582. Results of this study showed that a set of interesting and highly reliable rules were discovered, including frequent, rare and negative association rules that have a higher confidence value. This research resulted in designing a methodology in rule mining which does not rely on the minsupp and minconf threshold. Also, a complete set of association rules are discovered by the proposed technique. Finally, the interestingness measure property for the selection of combinations from datasets makes it possible to reduce the exponential searching of the rules

    Geoinformatic methodologies and quantitative tools for detecting hotspots and for multicriteria ranking and prioritization: application on biodiversity monitoring and conservation

    Get PDF
    Chi ha la responsabilità di gestire un’area protetta non solo deve essere consapevole dei problemi ambientali dell’area ma dovrebbe anche avere a disposizione dati aggiornati e appropriati strumenti metodologici per esaminare accuratamente ogni singolo problema. In effetti, il decisore ambientale deve organizzare in anticipo le fasi necessarie a fronteggiare le prevedibili variazioni che subirà la pressione antropica sulle aree protette. L’obiettivo principale della Tesi è di natura metodologica e riguarda il confronto tra differenti metodi statistici multivariati utili per l’individuazione di punti critici nello spazio e per l’ordinamento degli “oggetti ambientali” di studio e quindi per l’individuazione delle priorità di intervento ambientale. L’obiettivo ambientale generale è la conservazione del patrimonio di biodiversità. L’individuazione, tramite strumenti statistici multivariati, degli habitat aventi priorità ecologica è solamente il primo fondamentale passo per raggiungere tale obiettivo. L’informazione ecologica, integrata nel contesto antropico, è un successivo essenziale passo per effettuare valutazioni ambientali e per pianificare correttamente le azioni volte alla conservazione. Un’ampia serie di dati ed informazioni è stata necessaria per raggiungere questi obiettivi di gestione ambientale. I dati ecologici sono forniti dal Ministero dell’Ambiente Italiano e provengono al Progetto “Carta della Natura” del Paese. I dati demografici sono invece forniti dall’Istituto Italiano di Statistica (ISTAT). I dati si riferiscono a due aree geografiche italiane: la Val Baganza (Parma) e l’Oltrepò Pavese e Appennino Ligure-Emiliano. L’analisi è stata condotta a due differenti livelli spaziali: ecologico-naturalistico (l’habitat) e amministrativo (il Comune). Corrispondentemente, i risultati più significativi ottenuti sono: 1. Livello habitat: il confronto tra due metodi di ordinamento e determinazione delle priorità, il metodo del Vettore Ideale e quello della Preminenza, tramite l’utilizzo di importanti metriche ecologiche come il Valore Ecologico (E.V.) e la Sensibilità Ecologica (E.S.), fornisce dei risultati non direttamente comparabili. Il Vettore Ideale, non essendo un procedimento basato sulla ranghizzazione dei valori originali, sembra essere preferibile nel caso di paesaggi molto eterogenei in senso spaziale. Invece, il metodo della Preminenza probabilmente è da preferire in paesaggi ecologici aventi un basso grado di eterogeneità intesa nel senso di differenze non troppo grandi nel E.V. ed E.S. degli habitat. 2. Livello comunale: Al fine di prendere delle decisioni gestionali ed essendo gli habitat solo delle suddivisioni naturalistiche di un dato territorio, è necessario spostare l’attenzione sulle corrispondenti unità amministrative territoriali (i Comuni). Da questo punto di vista, l’introduzione della demografia risulta essere un elemento centrale oltre che di novità nelle analisi ecologico-ambientali. In effetti, l’analisi demografica rende il risultato di cui al punto 1 molto più realistico introducendo altre dimensioni (la pressione antropica attuale e le sue tendenze) che permettono l’individuazione di aree ecologicamente fragili. Inoltre, tale approccio individua chiaramente le responsabilità ambientali di ogni singolo ente territoriale nei riguardi della difesa della biodiversità. In effetti un ordinamento dei Comuni sulla base delle caratteristiche ambientali e demografiche, chiarisce le responsabilità gestionali di ognuno di essi. Un’applicazione concreta di questa necessaria quanto utile integrazione di dati ecologici e demografici viene discussa progettando una Rete Ecologica (E.N.). La Rete cosi ottenuta infatti presenta come elemento di novità il fatto di non essere “statica” bensì “dinamica” nel senso che la sua pianificazione tiene in considerazione il trend di pressione antropica al fine di individuare i probabili punti di futura fragilità e quindi di più critica gestione.Who has the responsibility to manage a conservation zone, not only must be aware of environmental problems but should have at his disposal updated databases and appropriate methodological instruments to examine carefully each individual case. In effect he has to arrange, in advance, the necessary steps to withstand the foreseeable variations in the trends of human pressure on conservation zones. The essential objective of this Thesis is methodological that is to compare different multivariate statistical methods useful for environmental hotspot detection and for environmental prioritization and ranking. The general environmental goal is the conservation of the biodiversity patrimony. The individuation, through multidimensional statistical tools, of habitats having top ecological priority, is only the first basic step to accomplish this aim. Ecological information integrated in the human context is an essential further step to make environmental evaluations and to plan correct conservation actions. A wide series of data and information has been necessary to accomplish environmental management tasks. Ecological data are provided by the Italian Ministry of the Environment and they refer to the Map of Italian Nature Project database. The demographic data derives from the Italian Institute of Statistics (ISTAT). The data utilized regards two Italian areas: Baganza Valley and Oltrepò Pavese and Ligurian-Emilian Apennine. The analysis has been carried out at two different spatial/scale levels: ecological-naturalistic (habitat level) and administrative (Commune level). Correspondingly, the main obtained results are: 1. Habitat level: comparing two ranking and prioritization methods, Ideal Vector and Salience, through important ecological metrics like Ecological Value (E.V.) and Ecological Sensitivity (E.S.), gives results not directly comparable. Being not based on a ranking process, Ideal Vector method seems to be used preferentially in landscapes characterized by high spatial heterogeneity. On the contrary, Salience method is probably to be preferred in ecological landscapes characterized by a low degree of heterogeneity in terms of not large differences concerning habitat E.V. and E.S.. 2. Commune level: Being habitat only a naturalistic partition of a given territory, it is necessary, for management decisions, to move towards the corresponding administrative units (Communes). From this point of view, the introduction of demography is an essential element of novelty in environmental analysis. In effect, demographic analysis makes the goal at point 1 more realistic introducing other dimensions (actual human pressure and its trend) which allows the individuation of environmentally fragile areas. Furthermore this approach individuates clearly the environmental responsibility of each administrative body for what concerns the biodiversity conservation. In effect communes’ ranking, according to environmental/demographic features, clarify the responsibilities of each administrative body. A concrete application of this necessary and useful integration of ecological and demographic data has been developed in designing an Ecological Network (E.N.).The obtained E.N. has the novelty to be not “static” but “dynamic” that is the network planning take into account the demographic pressure trends in the individuation of the probable future fragile points

    Exploring the law of text geographic information

    Full text link
    Textual geographic information is indispensable and heavily relied upon in practical applications. The absence of clear distribution poses challenges in effectively harnessing geographic information, thereby driving our quest for exploration. We contend that geographic information is influenced by human behavior, cognition, expression, and thought processes, and given our intuitive understanding of natural systems, we hypothesize its conformity to the Gamma distribution. Through rigorous experiments on a diverse range of 24 datasets encompassing different languages and types, we have substantiated this hypothesis, unearthing the underlying regularities governing the dimensions of quantity, length, and distance in geographic information. Furthermore, theoretical analyses and comparisons with Gaussian distributions and Zipf's law have refuted the contingency of these laws. Significantly, we have estimated the upper bounds of human utilization of geographic information, pointing towards the existence of uncharted territories. Also, we provide guidance in geographic information extraction. Hope we peer its true countenance uncovering the veil of geographic information.Comment: IP
    • …
    corecore