87 research outputs found
Using and extending itemsets in data mining : query approximation, dense itemsets, and tiles
Frequent itemsets are one of the best known concepts in data mining, and there is active research in itemset mining algorithms. An itemset is frequent in a database if its items co-occur in sufficiently many records. This thesis addresses two questions related to frequent itemsets. The first question is raised by a method for approximating logical queries by an inclusion-exclusion sum truncated to the terms corresponding to the frequent itemsets: how good are the approximations thereby obtained? The answer is twofold: in theory, the worst-case bound for the algorithm is very large, and a construction is given that shows the bound to be tight; but in practice, the approximations tend to be much closer to the correct answer than in the worst case. While some other algorithms based on frequent itemsets yield even better approximations, they are not as widely applicable.
The second question concerns extending the definition of frequent itemsets to relax the requirement of perfect co-occurrence: highly correlated items may form an interesting set, even if they never co-occur in a single record. The problem is to formalize this idea in a way that still admits efficient mining algorithms. Two different approaches are used. First, dense itemsets are defined in a manner similar to the usual frequent itemsets and can be found using a modification of the original itemset mining algorithm. Second, tiles are defined in a different way so as to form a model for the whole data, unlike frequent and dense itemsets. A heuristic algorithm based on spectral properties of the data is given and some of its properties are explored.Yksi tiedon louhinnan tunnetuimmista kÀsitteistÀ ovat kattavat joukot, ja niiden etsintÀalgoritmeja tutkitaan aktiivisesti. Joukko on tietokannassa kattava, jos sen alkiot esiintyvÀt yhdessÀ riittÀvÀn monessa tietueessa. VÀitöskirjassa kÀsitellÀÀn kahta kattaviin joukkoihin liittyvÀÀ kysymystÀ. EnsimmÀinen liittyy algoritmiin, jolla arvioidaan loogisten kyselyjen tuloksia laskemalla inkluusio-ekskluusio-summa pelkÀstÀÀn kattavilla joukoilla; kysymys on, kuinka hyviÀ arvioita nÀin saadaan. VÀitöskirjassa annetaan kaksi vastausta: Teoriassa algoritmin pahimman tapauksen raja on hyvin suuri, ja vastaesimerkillÀ osoitetaan, ettÀ raja on tiukka. KÀytÀnnössÀ arviot ovat paljon lÀhempÀnÀ oikeaa tulosta kuin teoreettinen raja antaa ymmÀrtÀÀ. Arvioita vertaillaan erÀisiin muihin algoritmeihin, joiden tulokset ovat vielÀ parempia mutta jotka eivÀt ole yhtÀ yleisesti sovellettavissa.
Toinen kysymys koskee kattavien joukkojen mÀÀritelmÀn yleistÀmistÀ siten, ettÀ tÀydellisen yhteisesiintymisen vaatimuksesta tingitÀÀn. Joukko korreloituneita alkioita voi olla kiinnostava, vaikka alkiot eivÀt koskaan esiintyisi kaikki samassa tietueessa. Ongelma on tÀmÀn ajatuksen muuttaminen sellaiseksi mÀÀritelmÀksi, ettÀ tehokkaita louhinta-algoritmeja voidaan kÀyttÀÀ. VÀitöskirjassa esitetÀÀn kaksi lÀhestymistapaa. EnsinnÀkin tiheÀt kattavat joukot mÀÀritellÀÀn samanlaiseen tapaan kuin tavalliset kattavat joukot, ja ne voidaan löytÀÀ samantyyppisellÀ algoritmilla. Toiseksi mÀÀritellÀÀn laatat, jotka muodostavat koko datalle mallin, toisin kuin kattavat ja tiheÀt kattavat joukot. Laattojen etsimistÀ varten kuvataan datan spektraalisiin ominaisuuksiin perustuva heuristiikka, jonka erÀitÀ ominaisuuksia tutkitaan.reviewe
Discovery of error-tolerant biclusters from noisy gene expression data
An important analysis performed on microarray gene-expression data is to discover biclusters, which denote groups of genes that are coherently expressed for a subset of conditions. Various biclustering algorithms have been proposed to find different types of biclusters from these real-valued gene-expression data sets. However, these algorithms suffer from several limitations such as inability to explicitly handle errors/noise in the data; difficulty in discovering small bicliusters due to their top-down approach; inability of some of the approaches to find overlapping biclusters, which is crucial as many genes participate in multiple biological processes. Association pattern mining also produce biclusters as their result and can naturally address some of these limitations. However, traditional association mining only finds exact biclusters, whic
Extraction of High Utility Itemsets using Utility Pattern with Genetic Algorithm from OLTP System
To analyse vast amount of data, Frequent pattern mining play an important role in data mining. In practice, Frequent pattern mining cannot meet the challenges of real world problems due to items differ in various measures. Hence an emerging technique called Utility-based data mining is used in data mining processes.The utility mining not only considers the frequency but also see the utility associated with the itemsets.The main objective of utility mining is to extract the itemsets with high utilities, by considering user preferences such as profit,quantity and cost from OLTP systems. In our proposed approach, we are using UP growth with Genetic Algorithm. The idea is that UP growth algorithm would generate Potentially High Utility Itemsets and Genetic Algorithm would optimize and provide the High Utility Item set from it. On comparing with existing algorithm, the proposed approach is performing better in terms of memory utilization.
DOI: 10.17762/ijritcc2321-8169.15039
Frequent itemset mining in big data with effective single scan algorithms
© 2013 IEEE. This paper considers frequent itemsets mining in transactional databases. It introduces a new accurate single scan approach for frequent itemset mining (SSFIM), a heuristic as an alternative approach (EA-SSFIM), as well as a parallel implementation on Hadoop clusters (MR-SSFIM). EA-SSFIM and MR-SSFIM target sparse and big databases, respectively. The proposed approach (in all its variants) requires only one scan to extract the candidate itemsets, and it has the advantage to generate a fixed number of candidate itemsets independently from the value of the minimum support. This accelerates the scan process compared with existing approaches while dealing with sparse and big databases. Numerical results show that SSFIM outperforms the state-of-the-art FIM approaches while dealing with medium and large databases. Moreover, EA-SSFIM provides similar performance as SSFIM while considerably reducing the runtime for large databases. The results also reveal the superiority of MR-SSFIM compared with the existing HPC-based solutions for FIM using sparse and big databases
New Spark solutions for distributed frequent itemset and association rule mining algorithms
Funding for open access publishing: Universidad de Gran-
ada/CBUA. The research reported in this paper was partially sup-
ported by the BIGDATAMED project, which has received funding
from the Andalusian Government (Junta de Andalucı Ìa) under grant
agreement No P18-RT-1765, by Grants PID2021-123960OB-I00 and
Grant TED2021-129402B-C21 funded by Ministerio de Ciencia e
Innovacio Ìn and, by ERDF A way of making Europe and by the
European Union NextGenerationEU. In addition, this work has been
partially supported by the Ministry of Universities through the EU-
funded Margarita Salas programme NextGenerationEU. Funding for
open access charge: Universidad de Granada/CBUAThe large amount of data generated every day makes necessary the re-implementation of new methods capable of handle with
massive data efficiently. This is the case of Association Rules, an unsupervised data mining tool capable of extracting information
in the form of IF-THEN patterns. Although several methods have been proposed for the extraction of frequent itemsets (previous
phase before mining association rules) in very large databases, the high computational cost and lack of memory remains a major
problem to be solved when processing large data. Therefore, the aim of this paper is three fold: (1) to review existent algorithms for
frequent itemset and association rule mining, (2)to develop new efficient frequent itemset Big Data algorithms using distributive
computation, as well as a new association rule mining algorithm in Spark, and (3) to compare the proposed algorithms with the
existent proposals varying the number of transactions and the number of items. To this purpose, we have used the Spark platform
which has been demonstrated to outperform existing distributive algorithmic implementations.Universidad de Granada/CBUAJunta de Andalucia
P18-RT-1765Ministry of Science and Innovation, Spain (MICINN)
Instituto de Salud Carlos III
Spanish Government
PID2021-123960OB-I00,
TED2021-129402B-C21ERDF A way of making EuropeEuropean Union NextGenerationEUMinistry of Universities through the E
Scalable And Efficient Outlier Detection In Large Distributed Data Sets With Mixed-type Attributes
An important problem that appears often when analyzing data involves identifying irregular or abnormal data points called outliers. This problem broadly arises under two scenarios: when outliers are to be removed from the data before analysis, and when useful information or knowledge can be extracted by the outliers themselves. Outlier Detection in the context of the second scenario is a research field that has attracted significant attention in a broad range of useful applications. For example, in credit card transaction data, outliers might indicate potential fraud; in network traffic data, outliers might represent potential intrusion attempts. The basis of deciding if a data point is an outlier is often some measure or notion of dissimilarity between the data point under consideration and the rest. Traditional outlier detection methods assume numerical or ordinal data, and compute pair-wise distances between data points. However, the notion of distance or similarity for categorical data is more difficult to define. Moreover, the size of currently available data sets dictates the need for fast and scalable outlier detection methods, thus precluding distance computations. Additionally, these methods must be applicable to data which might be distributed among different locations. In this work, we propose novel strategies to efficiently deal with large distributed data containing mixed-type attributes. Specifically, we first propose a fast and scalable algorithm for categorical data (AVF), and its parallel version based on MapReduce (MR-AVF). We extend AVF and introduce a fast outlier detection algorithm for large distributed data with mixed-type attributes (ODMAD). Finally, we modify ODMAD in order to deal with very high-dimensional categorical data. Experiments with large real-world and synthetic data show that the proposed methods exhibit large performance gains and high scalability compared to the state-of-the-art, while achieving similar accuracy detection rates
Extraction de biclusters contraints dans des contextes bruiteÌs
National audienceL'extraction de biclusters, qui consiste aÌ rechercher un groupe d'attributs qui montrent un comportement coheÌrent pour un sous-ensemble d'observations dans une matrice de donneÌes, est une taÌche importante dans divers domaines, telle que la biologie. Nous proposons ici un nouveau systeÌme, COBIC, qui combine des algorithmes de graphes avec des meÌthodes de fouille de donneÌes pour une recherche efficace de biclusters pertinents et susceptibles de se recouvrir. COBIC est fondeÌ sur les algorithmes de flot maximal/coupe minimale et est capable de prendre en compte les connaissances d'une base exprimeÌes sous forme d'une classification, par un meÌcanisme d'adaptation des poids lors de l'extraction iteÌrative des reÌgions denses. L'eÌvaluation de COBIC sur des donneÌes reÌelles et la comparaison par rapport aÌ des meÌthodes efficaces de biclustering montrent que COBIC est treÌs performant et en particulier lorsque la qualiteÌ des biclusters s'eÌvalue en fonction de la significativiteÌ de l'enrichissement des clusters calculeÌs avec les fonctions cellulaires deÌcrites dans l'Ontologie GO
- âŠ