107 research outputs found

    Approximate Parallel High Utility Itemset Mining

    Get PDF
    High utility itemset mining discovers itemsets whose utility is above a given threshold, where utilities measure the importance of itemsets. In high utility itemset mining, memory and time performance limitations cause scalability issues, when the dataset is very large. In this thesis, the problem is addressed by proposing a distributed parallel algorithm, PHUI-Miner, and a sampling strategy, which can be used either separately or simultaneously. PHUI-Miner parallelizes the state-of-the-art high utility itemset mining algorithm HUI-Miner. The sampling strategy investigates the required sample size of a dataset, in order to achieve a given accuracy. We also propose an approach combining sampling with PHUI-Miner, which provides better time performance. In our experiments, we show that PHUI-Miner has high performance and outperforms the state-of-the-art non-parallel algorithm. The sampling strategy achieves accuracies much higher than the guarantee. Extensive experiments are also conducted to compare the time performance of PHUI-Miner with and without sampling

    Discovery of error-tolerant biclusters from noisy gene expression data

    Get PDF
    An important analysis performed on microarray gene-expression data is to discover biclusters, which denote groups of genes that are coherently expressed for a subset of conditions. Various biclustering algorithms have been proposed to find different types of biclusters from these real-valued gene-expression data sets. However, these algorithms suffer from several limitations such as inability to explicitly handle errors/noise in the data; difficulty in discovering small bicliusters due to their top-down approach; inability of some of the approaches to find overlapping biclusters, which is crucial as many genes participate in multiple biological processes. Association pattern mining also produce biclusters as their result and can naturally address some of these limitations. However, traditional association mining only finds exact biclusters, whic

    Using and extending itemsets in data mining : query approximation, dense itemsets, and tiles

    Get PDF
    Frequent itemsets are one of the best known concepts in data mining, and there is active research in itemset mining algorithms. An itemset is frequent in a database if its items co-occur in sufficiently many records. This thesis addresses two questions related to frequent itemsets. The first question is raised by a method for approximating logical queries by an inclusion-exclusion sum truncated to the terms corresponding to the frequent itemsets: how good are the approximations thereby obtained? The answer is twofold: in theory, the worst-case bound for the algorithm is very large, and a construction is given that shows the bound to be tight; but in practice, the approximations tend to be much closer to the correct answer than in the worst case. While some other algorithms based on frequent itemsets yield even better approximations, they are not as widely applicable. The second question concerns extending the definition of frequent itemsets to relax the requirement of perfect co-occurrence: highly correlated items may form an interesting set, even if they never co-occur in a single record. The problem is to formalize this idea in a way that still admits efficient mining algorithms. Two different approaches are used. First, dense itemsets are defined in a manner similar to the usual frequent itemsets and can be found using a modification of the original itemset mining algorithm. Second, tiles are defined in a different way so as to form a model for the whole data, unlike frequent and dense itemsets. A heuristic algorithm based on spectral properties of the data is given and some of its properties are explored.Yksi tiedon louhinnan tunnetuimmista kÀsitteistÀ ovat kattavat joukot, ja niiden etsintÀalgoritmeja tutkitaan aktiivisesti. Joukko on tietokannassa kattava, jos sen alkiot esiintyvÀt yhdessÀ riittÀvÀn monessa tietueessa. VÀitöskirjassa kÀsitellÀÀn kahta kattaviin joukkoihin liittyvÀÀ kysymystÀ. EnsimmÀinen liittyy algoritmiin, jolla arvioidaan loogisten kyselyjen tuloksia laskemalla inkluusio-ekskluusio-summa pelkÀstÀÀn kattavilla joukoilla; kysymys on, kuinka hyviÀ arvioita nÀin saadaan. VÀitöskirjassa annetaan kaksi vastausta: Teoriassa algoritmin pahimman tapauksen raja on hyvin suuri, ja vastaesimerkillÀ osoitetaan, ettÀ raja on tiukka. KÀytÀnnössÀ arviot ovat paljon lÀhempÀnÀ oikeaa tulosta kuin teoreettinen raja antaa ymmÀrtÀÀ. Arvioita vertaillaan erÀisiin muihin algoritmeihin, joiden tulokset ovat vielÀ parempia mutta jotka eivÀt ole yhtÀ yleisesti sovellettavissa. Toinen kysymys koskee kattavien joukkojen mÀÀritelmÀn yleistÀmistÀ siten, ettÀ tÀydellisen yhteisesiintymisen vaatimuksesta tingitÀÀn. Joukko korreloituneita alkioita voi olla kiinnostava, vaikka alkiot eivÀt koskaan esiintyisi kaikki samassa tietueessa. Ongelma on tÀmÀn ajatuksen muuttaminen sellaiseksi mÀÀritelmÀksi, ettÀ tehokkaita louhinta-algoritmeja voidaan kÀyttÀÀ. VÀitöskirjassa esitetÀÀn kaksi lÀhestymistapaa. EnsinnÀkin tiheÀt kattavat joukot mÀÀritellÀÀn samanlaiseen tapaan kuin tavalliset kattavat joukot, ja ne voidaan löytÀÀ samantyyppisellÀ algoritmilla. Toiseksi mÀÀritellÀÀn laatat, jotka muodostavat koko datalle mallin, toisin kuin kattavat ja tiheÀt kattavat joukot. Laattojen etsimistÀ varten kuvataan datan spektraalisiin ominaisuuksiin perustuva heuristiikka, jonka erÀitÀ ominaisuuksia tutkitaan.reviewe

    Mining approximate multi-relational patterns

    Get PDF

    Hybrid ASP-based Approach to Pattern Mining

    Full text link
    Detecting small sets of relevant patterns from a given dataset is a central challenge in data mining. The relevance of a pattern is based on user-provided criteria; typically, all patterns that satisfy certain criteria are considered relevant. Rule-based languages like Answer Set Programming (ASP) seem well-suited for specifying such criteria in a form of constraints. Although progress has been made, on the one hand, on solving individual mining problems and, on the other hand, developing generic mining systems, the existing methods either focus on scalability or on generality. In this paper we make steps towards combining local (frequency, size, cost) and global (various condensed representations like maximal, closed, skyline) constraints in a generic and efficient way. We present a hybrid approach for itemset, sequence and graph mining which exploits dedicated highly optimized mining systems to detect frequent patterns and then filters the results using declarative ASP. To further demonstrate the generic nature of our hybrid framework we apply it to a problem of approximately tiling a database. Experiments on real-world datasets show the effectiveness of the proposed method and computational gains for itemset, sequence and graph mining, as well as approximate tiling. Under consideration in Theory and Practice of Logic Programming (TPLP).Comment: 29 pages, 7 figures, 5 table
    • 

    corecore