9 research outputs found

    Mining Big Sources using Efficient Data Mining Algorithms

    Get PDF
    Abstract: Data mining algorithms are widely used in the real world application in order to discover knowledge from large data sources. These algorithms work on historical data to analyze data in order to bring about trends or patterns. Association rule mining or frequent item set mining is very useful in applications like inductive databases, query expansion and others. A frequent itemset is the itemset when a set of records are repeated for specified number of times in a given dataset. When such frequent itemset is no present in other frequent itemset, it is named as maximal itemset. When it is not as part of other itemset, them it is called closed itemset. These itemsets are used to extract patterns or trends in the real world applications that support in decision making. Recently Uno et al. proposed data mining algorithms to discover maximal itemsets, closed itemsets and frequent itemsets. In this paper we practically explore those algorithms. We implement them in a prototype application and the empirical results reveal that they are very useful for many data mining solutions

    Réflexions sur l'extraction de motifs rares

    Get PDF
    National audienceLes études en fouille de données se sont surtout intéressées jusqu'à présent à l'extraction de motifs fréquents et à la génération de règles d'association à partir des motifs fréquents. L'algorithme le plus célèbre ayant permis d'atteindre ces objectifs est Apriori, qui a été suivi par toute une famille d'algorithmes mis au point par la suite et possédant tous la caractéristique d'extraire l'ensemble des motifs fréquents ou un sous-ensemble de ces motifs (motifs fermés fréquents, motifs fréquents maximaux, générateurs minimaux). Dans cet article, nous posons le problème de la recherche des motifs rares ou non fréquents, qui se trouvent dans le complémentaire de l'ensemble des motifs fréquents. Ce type de motif n'a jamais vraiment fait l'objet d'une étude systématique, malgré l'intérêt et la demande existant dans certains domaines d'application. Ainsi, en biologie ou en médecine, il peut se révéler très important pour un praticien de repérer des symptômes non habituels ou des effets indésirables exceptionnels se déclarant chez un patient pour une pathologie ou un traitement donnés

    Discovery of the D-basis in binary tables based on hypergraph dualization

    Get PDF
    Discovery of (strong) association rules, or implications, is an important task in data management, and it nds application in arti cial intelligence, data mining and the semantic web. We introduce a novel approach for the discovery of a speci c set of implications, called the D-basis, that provides a representation for a reduced binary table, based on the structure of its Galois lattice. At the core of the method are the D-relation de ned in the lattice theory framework, and the hypergraph dualization algorithm that allows us to e ectively produce the set of transversals for a given Sperner hypergraph. The latter algorithm, rst developed by specialists from Rutgers Center for Operations Research, has already found numerous applications in solving optimization problems in data base theory, arti cial intelligence and game theory. One application of the method is for analysis of gene expression data related to a particular phenotypic variable, and some initial testing is done for the data provided by the University of Hawaii Cancer Cente

    Diverse Rule Sets

    Full text link
    While machine-learning models are flourishing and transforming many aspects of everyday life, the inability of humans to understand complex models poses difficulties for these models to be fully trusted and embraced. Thus, interpretability of models has been recognized as an equally important quality as their predictive power. In particular, rule-based systems are experiencing a renaissance owing to their intuitive if-then representation. However, simply being rule-based does not ensure interpretability. For example, overlapped rules spawn ambiguity and hinder interpretation. Here we propose a novel approach of inferring diverse rule sets, by optimizing small overlap among decision rules with a 2-approximation guarantee under the framework of Max-Sum diversification. We formulate the problem as maximizing a weighted sum of discriminative quality and diversity of a rule set. In order to overcome an exponential-size search space of association rules, we investigate several natural options for a small candidate set of high-quality rules, including frequent and accurate rules, and examine their hardness. Leveraging the special structure in our formulation, we then devise an efficient randomized algorithm, which samples rules that are highly discriminative and have small overlap. The proposed sampling algorithm analytically targets a distribution of rules that is tailored to our objective. We demonstrate the superior predictive power and interpretability of our model with a comprehensive empirical study against strong baselines

    Achieving New Upper Bounds for the Hypergraph Duality Problem through Logic

    Get PDF
    The hypergraph duality problem DUAL is defined as follows: given two simple hypergraphs G\mathcal{G} and H\mathcal{H}, decide whether H\mathcal{H} consists precisely of all minimal transversals of G\mathcal{G} (in which case we say that G\mathcal{G} is the dual of H\mathcal{H}). This problem is equivalent to deciding whether two given non-redundant monotone DNFs are dual. It is known that non-DUAL, the complementary problem to DUAL, is in GC(log2n,PTIME)\mathrm{GC}(\log^2 n,\mathrm{PTIME}), where GC(f(n),C)\mathrm{GC}(f(n),\mathcal{C}) denotes the complexity class of all problems that after a nondeterministic guess of O(f(n))O(f(n)) bits can be decided (checked) within complexity class C\mathcal{C}. It was conjectured that non-DUAL is in GC(log2n,LOGSPACE)\mathrm{GC}(\log^2 n,\mathrm{LOGSPACE}). In this paper we prove this conjecture and actually place the non-DUAL problem into the complexity class GC(log2n,TC0)\mathrm{GC}(\log^2 n,\mathrm{TC}^0) which is a subclass of GC(log2n,LOGSPACE)\mathrm{GC}(\log^2 n,\mathrm{LOGSPACE}). We here refer to the logtime-uniform version of TC0\mathrm{TC}^0, which corresponds to FO(COUNT)\mathrm{FO(COUNT)}, i.e., first order logic augmented by counting quantifiers. We achieve the latter bound in two steps. First, based on existing problem decomposition methods, we develop a new nondeterministic algorithm for non-DUAL that requires to guess O(log2n)O(\log^2 n) bits. We then proceed by a logical analysis of this algorithm, allowing us to formulate its deterministic part in FO(COUNT)\mathrm{FO(COUNT)}. From this result, by the well known inclusion TC0LOGSPACE\mathrm{TC}^0\subseteq\mathrm{LOGSPACE}, it follows that DUAL belongs also to DSPACE[log2n]\mathrm{DSPACE}[\log^2 n]. Finally, by exploiting the principles on which the proposed nondeterministic algorithm is based, we devise a deterministic algorithm that, given two hypergraphs G\mathcal{G} and H\mathcal{H}, computes in quadratic logspace a transversal of G\mathcal{G} missing in H\mathcal{H}.Comment: Restructured the presentation in order to be the extended version of a paper that will shortly appear in SIAM Journal on Computin

    On the Complexity of Generating Maximal Frequent and Minimal Infrequent Sets

    No full text
    Let A be an mn binary matrix, t . . . , m} be a threshold, and # > 0 be a positive parameter. We show that given a family of O(n ) maximal t-frequent column sets for A, it is NP-complete to decide whether A has any further maximal t-frequent sets, or not, even when the number of such additional maximal t-frequent column sets may be exponentially large. In contrast, all minimal t-infrequent sets of columns of A can be enumerated in incremental quasi-polynomial time. The proof of the latter result follows from the inequality # t + 1)#, where # and # are respectively the numbers of all maximal t-frequent and all minimal t-infrequent sets of columns of the matrix A. We also discuss the complexity of generating all closed t-frequent column sets for a given binary matrix

    Periodic subgraph mining in dynamic networks

    Get PDF
    La tesi si prefigge di scoprire interazioni periodiche frequenti tra i membri di una popolazione il cui comportamento viene studiato in un certo arco di tempo. Le interazioni tra i membri della popolazione sono rappresentate da archi E tra vertici V di un grafo. Una rete dinamica consiste in una serie di T timestep per ciascuno dei quali esiste un grafo che rappresenta le interazioni attive in quel dato istante. Questa tesi presenta ListMiner, un algoritmo per il problema dell’estrazione di sottografi periodici. La complessità computazionale di tale algoritmo è O((V+E) T2 ln (T /σ)), dove σ è il minimo numero di ripetizioni periodiche necessarie per riportare il sottografo estratto in output. Questa complessità propone un miglioramento di un fattore T rispetto l’unico algoritmo noto in letteratura, PSEMiner. Nella tesi sono inoltre presenti un’analisi dei risultati ottenuti e una presentazione di una variante del problem
    corecore