9 research outputs found
Mining Big Sources using Efficient Data Mining Algorithms
Abstract: Data mining algorithms are widely used in the real world application in order to discover knowledge from large data sources. These algorithms work on historical data to analyze data in order to bring about trends or patterns. Association rule mining or frequent item set mining is very useful in applications like inductive databases, query expansion and others. A frequent itemset is the itemset when a set of records are repeated for specified number of times in a given dataset. When such frequent itemset is no present in other frequent itemset, it is named as maximal itemset. When it is not as part of other itemset, them it is called closed itemset. These itemsets are used to extract patterns or trends in the real world applications that support in decision making. Recently Uno et al. proposed data mining algorithms to discover maximal itemsets, closed itemsets and frequent itemsets. In this paper we practically explore those algorithms. We implement them in a prototype application and the empirical results reveal that they are very useful for many data mining solutions
Réflexions sur l'extraction de motifs rares
National audienceLes études en fouille de données se sont surtout intéressées jusqu'à présent à l'extraction de motifs fréquents et à la génération de règles d'association à partir des motifs fréquents. L'algorithme le plus célèbre ayant permis d'atteindre ces objectifs est Apriori, qui a été suivi par toute une famille d'algorithmes mis au point par la suite et possédant tous la caractéristique d'extraire l'ensemble des motifs fréquents ou un sous-ensemble de ces motifs (motifs fermés fréquents, motifs fréquents maximaux, générateurs minimaux). Dans cet article, nous posons le problème de la recherche des motifs rares ou non fréquents, qui se trouvent dans le complémentaire de l'ensemble des motifs fréquents. Ce type de motif n'a jamais vraiment fait l'objet d'une étude systématique, malgré l'intérêt et la demande existant dans certains domaines d'application. Ainsi, en biologie ou en médecine, il peut se révéler très important pour un praticien de repérer des symptômes non habituels ou des effets indésirables exceptionnels se déclarant chez un patient pour une pathologie ou un traitement donnés
Discovery of the D-basis in binary tables based on hypergraph dualization
Discovery of (strong) association rules, or implications, is an important
task in data management, and it nds application in arti cial intelligence,
data mining and the semantic web. We introduce a novel approach
for the discovery of a speci c set of implications, called the D-basis, that provides
a representation for a reduced binary table, based on the structure of
its Galois lattice. At the core of the method are the D-relation de ned in
the lattice theory framework, and the hypergraph dualization algorithm that
allows us to e ectively produce the set of transversals for a given Sperner hypergraph.
The latter algorithm, rst developed by specialists from Rutgers
Center for Operations Research, has already found numerous applications in
solving optimization problems in data base theory, arti cial intelligence and
game theory. One application of the method is for analysis of gene expression
data related to a particular phenotypic variable, and some initial testing is
done for the data provided by the University of Hawaii Cancer Cente
Diverse Rule Sets
While machine-learning models are flourishing and transforming many aspects
of everyday life, the inability of humans to understand complex models poses
difficulties for these models to be fully trusted and embraced. Thus,
interpretability of models has been recognized as an equally important quality
as their predictive power. In particular, rule-based systems are experiencing a
renaissance owing to their intuitive if-then representation.
However, simply being rule-based does not ensure interpretability. For
example, overlapped rules spawn ambiguity and hinder interpretation. Here we
propose a novel approach of inferring diverse rule sets, by optimizing small
overlap among decision rules with a 2-approximation guarantee under the
framework of Max-Sum diversification. We formulate the problem as maximizing a
weighted sum of discriminative quality and diversity of a rule set.
In order to overcome an exponential-size search space of association rules,
we investigate several natural options for a small candidate set of
high-quality rules, including frequent and accurate rules, and examine their
hardness. Leveraging the special structure in our formulation, we then devise
an efficient randomized algorithm, which samples rules that are highly
discriminative and have small overlap. The proposed sampling algorithm
analytically targets a distribution of rules that is tailored to our objective.
We demonstrate the superior predictive power and interpretability of our
model with a comprehensive empirical study against strong baselines
Achieving New Upper Bounds for the Hypergraph Duality Problem through Logic
The hypergraph duality problem DUAL is defined as follows: given two simple
hypergraphs and , decide whether
consists precisely of all minimal transversals of (in which case
we say that is the dual of ). This problem is
equivalent to deciding whether two given non-redundant monotone DNFs are dual.
It is known that non-DUAL, the complementary problem to DUAL, is in
, where
denotes the complexity class of all problems that after a nondeterministic
guess of bits can be decided (checked) within complexity class
. It was conjectured that non-DUAL is in . In this paper we prove this conjecture and actually
place the non-DUAL problem into the complexity class which is a subclass of . We here refer to the logtime-uniform version of
, which corresponds to , i.e., first order
logic augmented by counting quantifiers. We achieve the latter bound in two
steps. First, based on existing problem decomposition methods, we develop a new
nondeterministic algorithm for non-DUAL that requires to guess
bits. We then proceed by a logical analysis of this algorithm, allowing us to
formulate its deterministic part in . From this result, by
the well known inclusion , it follows
that DUAL belongs also to . Finally, by exploiting
the principles on which the proposed nondeterministic algorithm is based, we
devise a deterministic algorithm that, given two hypergraphs and
, computes in quadratic logspace a transversal of
missing in .Comment: Restructured the presentation in order to be the extended version of
a paper that will shortly appear in SIAM Journal on Computin
On the Complexity of Generating Maximal Frequent and Minimal Infrequent Sets
Let A be an mn binary matrix, t . . . , m} be a threshold, and # > 0 be a positive parameter. We show that given a family of O(n ) maximal t-frequent column sets for A, it is NP-complete to decide whether A has any further maximal t-frequent sets, or not, even when the number of such additional maximal t-frequent column sets may be exponentially large. In contrast, all minimal t-infrequent sets of columns of A can be enumerated in incremental quasi-polynomial time. The proof of the latter result follows from the inequality # t + 1)#, where # and # are respectively the numbers of all maximal t-frequent and all minimal t-infrequent sets of columns of the matrix A. We also discuss the complexity of generating all closed t-frequent column sets for a given binary matrix
Periodic subgraph mining in dynamic networks
La tesi si prefigge di scoprire interazioni periodiche frequenti tra i membri di una popolazione il cui comportamento viene studiato in un certo arco di tempo. Le interazioni tra i membri della popolazione sono rappresentate da archi E tra vertici V di un grafo. Una rete dinamica consiste in una serie di T timestep per ciascuno dei quali esiste un grafo che rappresenta le interazioni attive in quel dato istante. Questa tesi presenta ListMiner, un algoritmo per il problema dell’estrazione di sottografi periodici. La complessità computazionale di tale algoritmo è O((V+E) T2 ln (T /σ)), dove σ è il minimo numero di ripetizioni periodiche necessarie per riportare il sottografo estratto in output. Questa complessità propone un miglioramento di un fattore T rispetto l’unico algoritmo noto in letteratura, PSEMiner. Nella tesi sono inoltre presenti un’analisi dei risultati ottenuti e una presentazione di una variante del problem