87 research outputs found

    Using and extending itemsets in data mining : query approximation, dense itemsets, and tiles

    Get PDF
    Frequent itemsets are one of the best known concepts in data mining, and there is active research in itemset mining algorithms. An itemset is frequent in a database if its items co-occur in sufficiently many records. This thesis addresses two questions related to frequent itemsets. The first question is raised by a method for approximating logical queries by an inclusion-exclusion sum truncated to the terms corresponding to the frequent itemsets: how good are the approximations thereby obtained? The answer is twofold: in theory, the worst-case bound for the algorithm is very large, and a construction is given that shows the bound to be tight; but in practice, the approximations tend to be much closer to the correct answer than in the worst case. While some other algorithms based on frequent itemsets yield even better approximations, they are not as widely applicable. The second question concerns extending the definition of frequent itemsets to relax the requirement of perfect co-occurrence: highly correlated items may form an interesting set, even if they never co-occur in a single record. The problem is to formalize this idea in a way that still admits efficient mining algorithms. Two different approaches are used. First, dense itemsets are defined in a manner similar to the usual frequent itemsets and can be found using a modification of the original itemset mining algorithm. Second, tiles are defined in a different way so as to form a model for the whole data, unlike frequent and dense itemsets. A heuristic algorithm based on spectral properties of the data is given and some of its properties are explored.Yksi tiedon louhinnan tunnetuimmista kÀsitteistÀ ovat kattavat joukot, ja niiden etsintÀalgoritmeja tutkitaan aktiivisesti. Joukko on tietokannassa kattava, jos sen alkiot esiintyvÀt yhdessÀ riittÀvÀn monessa tietueessa. VÀitöskirjassa kÀsitellÀÀn kahta kattaviin joukkoihin liittyvÀÀ kysymystÀ. EnsimmÀinen liittyy algoritmiin, jolla arvioidaan loogisten kyselyjen tuloksia laskemalla inkluusio-ekskluusio-summa pelkÀstÀÀn kattavilla joukoilla; kysymys on, kuinka hyviÀ arvioita nÀin saadaan. VÀitöskirjassa annetaan kaksi vastausta: Teoriassa algoritmin pahimman tapauksen raja on hyvin suuri, ja vastaesimerkillÀ osoitetaan, ettÀ raja on tiukka. KÀytÀnnössÀ arviot ovat paljon lÀhempÀnÀ oikeaa tulosta kuin teoreettinen raja antaa ymmÀrtÀÀ. Arvioita vertaillaan erÀisiin muihin algoritmeihin, joiden tulokset ovat vielÀ parempia mutta jotka eivÀt ole yhtÀ yleisesti sovellettavissa. Toinen kysymys koskee kattavien joukkojen mÀÀritelmÀn yleistÀmistÀ siten, ettÀ tÀydellisen yhteisesiintymisen vaatimuksesta tingitÀÀn. Joukko korreloituneita alkioita voi olla kiinnostava, vaikka alkiot eivÀt koskaan esiintyisi kaikki samassa tietueessa. Ongelma on tÀmÀn ajatuksen muuttaminen sellaiseksi mÀÀritelmÀksi, ettÀ tehokkaita louhinta-algoritmeja voidaan kÀyttÀÀ. VÀitöskirjassa esitetÀÀn kaksi lÀhestymistapaa. EnsinnÀkin tiheÀt kattavat joukot mÀÀritellÀÀn samanlaiseen tapaan kuin tavalliset kattavat joukot, ja ne voidaan löytÀÀ samantyyppisellÀ algoritmilla. Toiseksi mÀÀritellÀÀn laatat, jotka muodostavat koko datalle mallin, toisin kuin kattavat ja tiheÀt kattavat joukot. Laattojen etsimistÀ varten kuvataan datan spektraalisiin ominaisuuksiin perustuva heuristiikka, jonka erÀitÀ ominaisuuksia tutkitaan.reviewe

    Discovery of error-tolerant biclusters from noisy gene expression data

    Get PDF
    An important analysis performed on microarray gene-expression data is to discover biclusters, which denote groups of genes that are coherently expressed for a subset of conditions. Various biclustering algorithms have been proposed to find different types of biclusters from these real-valued gene-expression data sets. However, these algorithms suffer from several limitations such as inability to explicitly handle errors/noise in the data; difficulty in discovering small bicliusters due to their top-down approach; inability of some of the approaches to find overlapping biclusters, which is crucial as many genes participate in multiple biological processes. Association pattern mining also produce biclusters as their result and can naturally address some of these limitations. However, traditional association mining only finds exact biclusters, whic

    Extraction of High Utility Itemsets using Utility Pattern with Genetic Algorithm from OLTP System

    Get PDF
    To analyse vast amount of data, Frequent pattern mining play an important role in data mining. In practice, Frequent pattern mining cannot meet the challenges of real world problems due to items differ in various measures. Hence an emerging technique called Utility-based data mining is used in data mining processes.The utility mining not only considers the frequency but also see the utility associated with the itemsets.The main objective of utility mining is to extract the itemsets with high utilities, by considering user preferences such as profit,quantity and cost from OLTP systems. In our proposed approach, we are using UP growth with Genetic Algorithm. The idea is that UP growth algorithm would generate Potentially High Utility Itemsets and Genetic Algorithm would optimize and provide the High Utility Item set from it. On comparing with existing algorithm, the proposed approach is performing better in terms of memory utilization. DOI: 10.17762/ijritcc2321-8169.15039

    Frequent itemset mining in big data with effective single scan algorithms

    Get PDF
    © 2013 IEEE. This paper considers frequent itemsets mining in transactional databases. It introduces a new accurate single scan approach for frequent itemset mining (SSFIM), a heuristic as an alternative approach (EA-SSFIM), as well as a parallel implementation on Hadoop clusters (MR-SSFIM). EA-SSFIM and MR-SSFIM target sparse and big databases, respectively. The proposed approach (in all its variants) requires only one scan to extract the candidate itemsets, and it has the advantage to generate a fixed number of candidate itemsets independently from the value of the minimum support. This accelerates the scan process compared with existing approaches while dealing with sparse and big databases. Numerical results show that SSFIM outperforms the state-of-the-art FIM approaches while dealing with medium and large databases. Moreover, EA-SSFIM provides similar performance as SSFIM while considerably reducing the runtime for large databases. The results also reveal the superiority of MR-SSFIM compared with the existing HPC-based solutions for FIM using sparse and big databases

    New Spark solutions for distributed frequent itemset and association rule mining algorithms

    Get PDF
    Funding for open access publishing: Universidad de Gran- ada/CBUA. The research reported in this paper was partially sup- ported by the BIGDATAMED project, which has received funding from the Andalusian Government (Junta de Andalucı ́a) under grant agreement No P18-RT-1765, by Grants PID2021-123960OB-I00 and Grant TED2021-129402B-C21 funded by Ministerio de Ciencia e Innovacio ́n and, by ERDF A way of making Europe and by the European Union NextGenerationEU. In addition, this work has been partially supported by the Ministry of Universities through the EU- funded Margarita Salas programme NextGenerationEU. Funding for open access charge: Universidad de Granada/CBUAThe large amount of data generated every day makes necessary the re-implementation of new methods capable of handle with massive data efficiently. This is the case of Association Rules, an unsupervised data mining tool capable of extracting information in the form of IF-THEN patterns. Although several methods have been proposed for the extraction of frequent itemsets (previous phase before mining association rules) in very large databases, the high computational cost and lack of memory remains a major problem to be solved when processing large data. Therefore, the aim of this paper is three fold: (1) to review existent algorithms for frequent itemset and association rule mining, (2)to develop new efficient frequent itemset Big Data algorithms using distributive computation, as well as a new association rule mining algorithm in Spark, and (3) to compare the proposed algorithms with the existent proposals varying the number of transactions and the number of items. To this purpose, we have used the Spark platform which has been demonstrated to outperform existing distributive algorithmic implementations.Universidad de Granada/CBUAJunta de Andalucia P18-RT-1765Ministry of Science and Innovation, Spain (MICINN) Instituto de Salud Carlos III Spanish Government PID2021-123960OB-I00, TED2021-129402B-C21ERDF A way of making EuropeEuropean Union NextGenerationEUMinistry of Universities through the E

    Scalable And Efficient Outlier Detection In Large Distributed Data Sets With Mixed-type Attributes

    Get PDF
    An important problem that appears often when analyzing data involves identifying irregular or abnormal data points called outliers. This problem broadly arises under two scenarios: when outliers are to be removed from the data before analysis, and when useful information or knowledge can be extracted by the outliers themselves. Outlier Detection in the context of the second scenario is a research field that has attracted significant attention in a broad range of useful applications. For example, in credit card transaction data, outliers might indicate potential fraud; in network traffic data, outliers might represent potential intrusion attempts. The basis of deciding if a data point is an outlier is often some measure or notion of dissimilarity between the data point under consideration and the rest. Traditional outlier detection methods assume numerical or ordinal data, and compute pair-wise distances between data points. However, the notion of distance or similarity for categorical data is more difficult to define. Moreover, the size of currently available data sets dictates the need for fast and scalable outlier detection methods, thus precluding distance computations. Additionally, these methods must be applicable to data which might be distributed among different locations. In this work, we propose novel strategies to efficiently deal with large distributed data containing mixed-type attributes. Specifically, we first propose a fast and scalable algorithm for categorical data (AVF), and its parallel version based on MapReduce (MR-AVF). We extend AVF and introduce a fast outlier detection algorithm for large distributed data with mixed-type attributes (ODMAD). Finally, we modify ODMAD in order to deal with very high-dimensional categorical data. Experiments with large real-world and synthetic data show that the proposed methods exhibit large performance gains and high scalability compared to the state-of-the-art, while achieving similar accuracy detection rates

    Extraction de biclusters contraints dans des contextes bruités

    No full text
    National audienceL'extraction de biclusters, qui consiste à rechercher un groupe d'attributs qui montrent un comportement cohérent pour un sous-ensemble d'observations dans une matrice de données, est une tâche importante dans divers domaines, telle que la biologie. Nous proposons ici un nouveau système, COBIC, qui combine des algorithmes de graphes avec des méthodes de fouille de données pour une recherche efficace de biclusters pertinents et susceptibles de se recouvrir. COBIC est fondé sur les algorithmes de flot maximal/coupe minimale et est capable de prendre en compte les connaissances d'une base exprimées sous forme d'une classification, par un mécanisme d'adaptation des poids lors de l'extraction itérative des régions denses. L'évaluation de COBIC sur des données réelles et la comparaison par rapport à des méthodes efficaces de biclustering montrent que COBIC est très performant et en particulier lorsque la qualité des biclusters s'évalue en fonction de la significativité de l'enrichissement des clusters calculés avec les fonctions cellulaires décrites dans l'Ontologie GO
    • 

    corecore