181 research outputs found

    Prediction of peptides binding to MHC class I alleles by partial periodic pattern mining

    Get PDF
    MHC (Major Histocompatibility Complex) is a key player in the immune response of an organism. It is important to be able to predict which antigenic peptides will bind to a specific MHC allele and which will not, creating possibilities for controlling immune response and for the applications of immunotherapy. However, a problem for MHC class I is the presence of bulges and loops in the peptides, changing the total length. Most machine learning methods in use today require the sequences to be of same length to successfully mine the binding motifs. We propose the use of time-based data mining methods in motif mining to be able to mine motifs position-independently. Also, the information for both binding and non-binding peptides is used on the contrary to the other methods which only rely on binding peptides. The prediction results are between 60-95% for the tested alleles

    Prediction of peptides binding to MHC class I alleles by partial periodic pattern mining

    Get PDF
    MHC (Major Histocompatibility Complex) is a key player in the immune response of an organism. It is important to be able to predict which antigenic peptides will bind to a spe-cific MHC allele and which will not, creating possibilities for controlling immune response and for the applications of immunotherapy. However a problem encountered in the computational binding prediction methods for MHC class I is the presence of bulges and loops in the peptides, changing the total length. Most machine learning methods in use to-day require the sequences to be of same length to success-fully mine the binding motifs. We propose the use of time-based data mining methods in motif mining to be able to mine motifs position-independently. Also, the information for both binding and non-binding peptides are used on the contrary to the other methods which only rely on binding peptides. The prediction results are between 70-80% for the tested alleles

    Redundancy, Deduction Schemes, and Minimum-Size Bases for Association Rules

    Full text link
    Association rules are among the most widely employed data analysis methods in the field of Data Mining. An association rule is a form of partial implication between two sets of binary variables. In the most common approach, association rules are parameterized by a lower bound on their confidence, which is the empirical conditional probability of their consequent given the antecedent, and/or by some other parameter bounds such as "support" or deviation from independence. We study here notions of redundancy among association rules from a fundamental perspective. We see each transaction in a dataset as an interpretation (or model) in the propositional logic sense, and consider existing notions of redundancy, that is, of logical entailment, among association rules, of the form "any dataset in which this first rule holds must obey also that second rule, therefore the second is redundant". We discuss several existing alternative definitions of redundancy between association rules and provide new characterizations and relationships among them. We show that the main alternatives we discuss correspond actually to just two variants, which differ in the treatment of full-confidence implications. For each of these two notions of redundancy, we provide a sound and complete deduction calculus, and we show how to construct complete bases (that is, axiomatizations) of absolutely minimum size in terms of the number of rules. We explore finally an approach to redundancy with respect to several association rules, and fully characterize its simplest case of two partial premises.Comment: LMCS accepted pape

    Optimal constraint-based decision tree induction from itemset lattices

    No full text
    International audienceIn this article we show that there is a strong connection between decision tree learning and local pattern mining. This connection allows us to solve the computationally hard problem of finding optimal decision trees in a wide range of applications by post-processing a set of patterns: we use local patterns to construct a global model. We exploit the connection between constraints in pattern mining and constraints in decision tree induction to develop a framework for categorizing decision tree mining constraints. This framework allows us to determine which model constraints can be pushed deeply into the pattern mining process, and allows us to improve the state-of-the-art of optimal decision tree induction

    Failure prediction for high-performance computing systems

    Get PDF
    The failure rate in high-performance computing (HPC) systems continues to escalate as the number of components in these systems increases. This affects the scalability and the performance of parallel applications in large-scale HPC systems. Fault tolerance (FT) mechanisms help mitigating the impact of failures on parallel applications. However, utilizing such mechanisms requires additional overhead. Besides, the overuse of FT mechanisms results in unnecessarily large overhead in the parallel applications. Knowing when and where failures will occur can greatly reduce the excessive overhead. As such, failure prediction is critical in order to effectively utilize FT mechanisms. In addition, it also helps in system administration and management, as the predicted failure can be handled beforehand with limited impact to the running systems. This dissertation proposes new proficiency metrics for failure prediction based on failure impact in UPC environment that the existing proficiency metrics tire unable to reflect. Furthermore, an efficient log message clustering algorithm is proposed for system event log data preprocessing and analysis. Then, two novel association rule mining approaches are introduced and employed for HPC failure prediction. Finally, the performances of the existing and the proposed association rule mining methods are compared and analyzed

    Effective pattern discovery for text mining

    Get PDF
    Many data mining techniques have been proposed for mining useful patterns in text documents. However, how to effectively use and update discovered patterns is still an open research issue, especially in the domain of text mining. Since most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern (or phrase) based approaches should perform better than the term-based ones, but many experiments did not support this hypothesis. This paper presents an innovative technique, effective pattern discovery which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information. Substantial experiments on RCV1 data collection and TREC topics demonstrate that the proposed solution achieves encouraging performance

    Exploring fish purchasing behaviour using data analytics

    Get PDF
    Nas últimas décadas têm ocorrido mudanças significativas no setor do retalho resultantes da globalização, do aumento de competitividade e da transformação do comportamento de compra do consumidor. Esta mudança de paradigma também se aplica ao setor do peixe fresco, que tem sido alvo do interesse de investigadores internacionais por razões políticas e económicas. Tendo em conta este ambiente competitivo, que valoriza a qualidade e o serviço fornecido ao consumidor assente em custos aceitáveis, é necessário a adoção de estratégias focadas no cliente. Esta dissertação está integrada no projeto ValorMar, que nasceu do compromisso de um conjunto alargado de entidades, desde empresas até centros de investigação posicionados pela relevância da economia marítima na cadeia de valor do pescado. Assim, esta dissertação irá tentar compreender relações que se revelem críticas para a tomada de decisão dos consumidores no momento de compra de peixe fresco. Para tal, irão ser usados dados transacionais e técnicas de data mining adequadas ao problema.A metodologia proposta por esta dissertação tem como objetivo não só a identificação de clientes recorrendo a técnicas de segmentação, mas também uma análise ao carrinho de compras de um cliente de peixe fresco. Estas análises aos dados irão mostrar que a extração de conhecimento de grandes bases de dados permite melhorar as decisões estratégicas das empresas e a sua relação com os clientes.In the last decades there have been significant changes in the retail sector resulting from globalization, the increased competitiveness and transformation on consumer's purchasing behaviour. This paradigm shift also applies to the fish sector, that has been capturing the interest of researchers internationally for political and economic reasons. Taking this competitive environment into account, which values the quality and the service given to the customer based on acceptable costs, it is necessary to adopt customer focused strategies.This thesis is integrated in the ValorMar's project, which was born from the commitment of a broad spectrum of entities, from companies to research centers, positioned by the relevance of the sea economy in the fishery value chain. Thus, this dissertation will try to understand critical relations for the decision making of customers when buying fresh fish.For this, transactional data and data mining techniques appropriate to the problem will be used.The methodology proposed by this thesis aims not only to identify customers using clustering techniques, but also to analyze the market basket of a fresh fish customer. These data analyzis will show that the knowledge extraction from large databases allows to improve the companies strategic decisions and their relationship with customers

    WiFi Miner: An online apriori and sensor based wireless network Intrusion Detection System

    Get PDF
    This thesis proposes an Intrusion Detection System, WiFi Miner, which applies an infrequent pattern association rule mining Apriori technique to wireless network packets captured through hardware sensors for purposes of real time detection of intrusive or anomalous packets. Contributions of the proposed system includes effectively adapting an efficient data mining association rule technique to important problem of intrusion detection in a wireless network environment using hardware sensors, providing a solution that eliminates the need for hard-to-obtain training data in this environment, providing increased intrusion detection rate and reduction of false alarms. The proposed system, WiFi Miner, solution approach is to find frequent and infrequent patterns on pre-processed wireless connection records using infrequent pattern finding Apriori algorithm also proposed by this thesis. The proposed Online Apriori-Infrequent algorithm improves the join and prune step of the traditional Apriori algorithm with a rule that avoids joining itemsets not likely to produce frequent itemsets as their results, thereby improving efficiency and run times significantly. A positive anomaly score is assigned to each packet (record) for each infrequent pattern found while a negative anomaly score is assigned for each frequent pattern found. So, a record with final positive anomaly score is considered as anomaly based on the presence of more infrequent patterns than frequent patterns found

    Adaptive Learning and Mining for Data Streams and Frequent Patterns

    Get PDF
    Aquesta tesi està dedicada al disseny d'algorismes de mineria de dades per fluxos de dades que evolucionen en el temps i per l'extracció d'arbres freqüents tancats. Primer ens ocupem de cadascuna d'aquestes tasques per separat i, a continuació, ens ocupem d'elles conjuntament, desenvolupant mètodes de classificació de fluxos de dades que contenen elements que són arbres. En el model de flux de dades, les dades arriben a gran velocitat, i els algorismes que els han de processar tenen limitacions estrictes de temps i espai. En la primera part d'aquesta tesi proposem i mostrem un marc per desenvolupar algorismes que aprenen de forma adaptativa dels fluxos de dades que canvien en el temps. Els nostres mètodes es basen en l'ús de mòduls detectors de canvi i estimadors en els llocs correctes. Proposem ADWIN, un algorisme de finestra lliscant adaptativa, per la detecció de canvi i manteniment d'estadístiques actualitzades, i proposem utilitzar-lo com a caixa negra substituint els comptadors en algorismes inicialment no dissenyats per a dades que varien en el temps. Com ADWIN té garanties teòriques de funcionament, això obre la possibilitat d'ampliar aquestes garanties als algorismes d'aprenentatge i de mineria de dades que l'usin. Provem la nostre metodologia amb diversos mètodes d'aprenentatge com el Naïve Bayes, partició, arbres de decisió i conjunt de classificadors. Construïm un marc experimental per fer mineria amb fluxos de dades que varien en el temps, basat en el programari MOA, similar al programari WEKA, de manera que sigui fàcil pels investigadors de realitzar-hi proves experimentals. Els arbres són grafs acíclics connectats i són estudiats com vincles en molts casos. En la segona part d'aquesta tesi, descrivim un estudi formal dels arbres des del punt de vista de mineria de dades basada en tancats. A més, presentem algorismes eficients per fer tests de subarbres i per fer mineria d'arbres freqüents tancats ordenats i no ordenats. S'inclou una anàlisi de l'extracció de regles d'associació de confiança plena dels conjunts d'arbres tancats, on hem trobat un fenomen interessant: les regles que la seva contrapart proposicional és no trivial, són sempre certes en els arbres a causa de la seva peculiar combinatòria. I finalment, usant aquests resultats en fluxos de dades evolutius i la mineria d'arbres tancats freqüents, hem presentat algorismes d'alt rendiment per fer mineria d'arbres freqüents tancats de manera adaptativa en fluxos de dades que evolucionen en el temps. Introduïm una metodologia general per identificar patrons tancats en un flux de dades, utilitzant la Teoria de Reticles de Galois. Usant aquesta metodologia, desenvolupem un algorisme incremental, un basat en finestra lliscant, i finalment un que troba arbres freqüents tancats de manera adaptativa en fluxos de dades. Finalment usem aquests mètodes per a desenvolupar mètodes de classificació per a fluxos de dades d'arbres.This thesis is devoted to the design of data mining algorithms for evolving data streams and for the extraction of closed frequent trees. First, we deal with each of these tasks separately, and then we deal with them together, developing classification methods for data streams containing items that are trees. In the data stream model, data arrive at high speed, and the algorithms that must process them have very strict constraints of space and time. In the first part of this thesis we propose and illustrate a framework for developing algorithms that can adaptively learn from data streams that change over time. Our methods are based on using change detectors and estimator modules at the right places. We propose an adaptive sliding window algorithm ADWIN for detecting change and keeping updated statistics from a data stream, and use it as a black-box in place or counters or accumulators in algorithms initially not designed for drifting data. Since ADWIN has rigorous performance guarantees, this opens the possibility of extending such guarantees to learning and mining algorithms. We test our methodology with several learning methods as Naïve Bayes, clustering, decision trees and ensemble methods. We build an experimental framework for data stream mining with concept drift, based on the MOA framework, similar to WEKA, so that it will be easy for researchers to run experimental data stream benchmarks. Trees are connected acyclic graphs and they are studied as link-based structures in many cases. In the second part of this thesis, we describe a rather formal study of trees from the point of view of closure-based mining. Moreover, we present efficient algorithms for subtree testing and for mining ordered and unordered frequent closed trees. We include an analysis of the extraction of association rules of full confidence out of the closed sets of trees, and we have found there an interesting phenomenon: rules whose propositional counterpart is nontrivial are, however, always implicitly true in trees due to the peculiar combinatorics of the structures. And finally, using these results on evolving data streams mining and closed frequent tree mining, we present high performance algorithms for mining closed unlabeled rooted trees adaptively from data streams that change over time. We introduce a general methodology to identify closed patterns in a data stream, using Galois Lattice Theory. Using this methodology, we then develop an incremental one, a sliding-window based one, and finally one that mines closed trees adaptively from data streams. We use these methods to develop classification methods for tree data streams.Postprint (published version

    A geographic knowledge discovery approach to property valuation

    Get PDF
    This thesis involves an investigation of how knowledge discovery can be applied in the area Geographic Information Science. In particular, its application in the area of property valuation in order to reveal how different spatial entities and their interactions affect the price of the properties is explored. This approach is entirely data driven and does not require previous knowledge of the area applied. To demonstrate this process, a prototype system has been designed and implemented. It employs association rule mining and associative classification algorithms to uncover any existing inter-relationships and perform the valuation. Various algorithms that perform the above tasks have been proposed in the literature. The algorithm developed in this work is based on the Apriori algorithm. It has been however, extended with an implementation of a ‘Best Rule’ classification scheme based on the Classification Based on Associations (CBA) algorithm. For the modelling of geographic relationships a graph-theoretic approach has been employed. Graphs have been widely used as modelling tools within the geography domain, primarily for the investigation of network-type systems. In the current context, the graph reflects topological and metric relationships between the spatial entities depicting general spatial arrangements. An efficient graph search algorithm has been developed, based on the Djikstra shortest path algorithm that enables the investigation of relationships between spatial entities beyond first degree connectivity. A case study with data from three central London boroughs has been performed to validate the methodology and algorithms, and demonstrate its effectiveness for computer aided property valuation. In addition, through the case study, the influence of location in the value of properties in those boroughs has been examined. The results are encouraging as they demonstrate the effectiveness of the proposed methodology and algorithms, provided that the data is appropriately pre processed and is of high quality
    • …
    corecore