159 research outputs found

    Discovery of error-tolerant biclusters from noisy gene expression data

    Get PDF
    An important analysis performed on microarray gene-expression data is to discover biclusters, which denote groups of genes that are coherently expressed for a subset of conditions. Various biclustering algorithms have been proposed to find different types of biclusters from these real-valued gene-expression data sets. However, these algorithms suffer from several limitations such as inability to explicitly handle errors/noise in the data; difficulty in discovering small bicliusters due to their top-down approach; inability of some of the approaches to find overlapping biclusters, which is crucial as many genes participate in multiple biological processes. Association pattern mining also produce biclusters as their result and can naturally address some of these limitations. However, traditional association mining only finds exact biclusters, whic

    Efficient Discovery of Interesting Patterns Based on Strong Closedness

    Full text link

    New approaches for clustering high dimensional data

    Get PDF
    Clustering is one of the most effective methods for analyzing datasets that contain a large number of objects with numerous attributes. Clustering seeks to identify groups, or clusters, of similar objects. In low dimensional space, the similarity between objects is often evaluated by summing the difference across all of their attributes. High dimensional data, however, may contain irrelevant attributes which mask the existence of clusters. The discovery of groups of objects that are highly similar within some subsets of relevant attributes becomes an important but challenging task. My thesis focuses on various models and algorithms for this task. We first present a flexible clustering model, namely OP-Cluster (Order Preserving Cluster). Under this model, two objects are similar on a subset of attributes if the values of these two objects induce the same relative ordering of these attributes. OPClustering algorithm has demonstrated to be useful to identify co-regulated genes in gene expression data. We also propose a semi-supervised approach to discover biologically meaningful OP-Clusters by incorporating existing gene function classifications into the clustering process. This semi-supervised algorithm yields only OP-clusters that are significantly enriched by genes from specific functional categories. Real datasets are often noisy. We propose a noise-tolerant clustering algorithm for mining frequently occuring itemsets. This algorithm is called approximate frequent itemsets (AFI). Both the theoretical and experimental results demonstrate that our AFI mining algorithm has higher recoverability of real clusters than any other existing itemset mining approaches. Pair-wise dissimilarities are often derived from original data to reduce the complexities of high dimensional data. Traditional clustering algorithms taking pair-wise dissimilarities as input often generate disjoint clusters from pair-wise dissimilarities. It is well known that the classification model represented by disjoint clusters is inconsistent with many real classifications, such gene function classifications. We develop a Poclustering algorithm, which generates overlapping clusters from pair-wise dissimilarities. We prove that by allowing overlapping clusters, Poclustering fully preserves the information of any dissimilarity matrices while traditional partitioning algorithms may cause significant information loss

    Mining approximate multi-relational patterns

    Get PDF

    Using and extending itemsets in data mining : query approximation, dense itemsets, and tiles

    Get PDF
    Frequent itemsets are one of the best known concepts in data mining, and there is active research in itemset mining algorithms. An itemset is frequent in a database if its items co-occur in sufficiently many records. This thesis addresses two questions related to frequent itemsets. The first question is raised by a method for approximating logical queries by an inclusion-exclusion sum truncated to the terms corresponding to the frequent itemsets: how good are the approximations thereby obtained? The answer is twofold: in theory, the worst-case bound for the algorithm is very large, and a construction is given that shows the bound to be tight; but in practice, the approximations tend to be much closer to the correct answer than in the worst case. While some other algorithms based on frequent itemsets yield even better approximations, they are not as widely applicable. The second question concerns extending the definition of frequent itemsets to relax the requirement of perfect co-occurrence: highly correlated items may form an interesting set, even if they never co-occur in a single record. The problem is to formalize this idea in a way that still admits efficient mining algorithms. Two different approaches are used. First, dense itemsets are defined in a manner similar to the usual frequent itemsets and can be found using a modification of the original itemset mining algorithm. Second, tiles are defined in a different way so as to form a model for the whole data, unlike frequent and dense itemsets. A heuristic algorithm based on spectral properties of the data is given and some of its properties are explored.Yksi tiedon louhinnan tunnetuimmista kÀsitteistÀ ovat kattavat joukot, ja niiden etsintÀalgoritmeja tutkitaan aktiivisesti. Joukko on tietokannassa kattava, jos sen alkiot esiintyvÀt yhdessÀ riittÀvÀn monessa tietueessa. VÀitöskirjassa kÀsitellÀÀn kahta kattaviin joukkoihin liittyvÀÀ kysymystÀ. EnsimmÀinen liittyy algoritmiin, jolla arvioidaan loogisten kyselyjen tuloksia laskemalla inkluusio-ekskluusio-summa pelkÀstÀÀn kattavilla joukoilla; kysymys on, kuinka hyviÀ arvioita nÀin saadaan. VÀitöskirjassa annetaan kaksi vastausta: Teoriassa algoritmin pahimman tapauksen raja on hyvin suuri, ja vastaesimerkillÀ osoitetaan, ettÀ raja on tiukka. KÀytÀnnössÀ arviot ovat paljon lÀhempÀnÀ oikeaa tulosta kuin teoreettinen raja antaa ymmÀrtÀÀ. Arvioita vertaillaan erÀisiin muihin algoritmeihin, joiden tulokset ovat vielÀ parempia mutta jotka eivÀt ole yhtÀ yleisesti sovellettavissa. Toinen kysymys koskee kattavien joukkojen mÀÀritelmÀn yleistÀmistÀ siten, ettÀ tÀydellisen yhteisesiintymisen vaatimuksesta tingitÀÀn. Joukko korreloituneita alkioita voi olla kiinnostava, vaikka alkiot eivÀt koskaan esiintyisi kaikki samassa tietueessa. Ongelma on tÀmÀn ajatuksen muuttaminen sellaiseksi mÀÀritelmÀksi, ettÀ tehokkaita louhinta-algoritmeja voidaan kÀyttÀÀ. VÀitöskirjassa esitetÀÀn kaksi lÀhestymistapaa. EnsinnÀkin tiheÀt kattavat joukot mÀÀritellÀÀn samanlaiseen tapaan kuin tavalliset kattavat joukot, ja ne voidaan löytÀÀ samantyyppisellÀ algoritmilla. Toiseksi mÀÀritellÀÀn laatat, jotka muodostavat koko datalle mallin, toisin kuin kattavat ja tiheÀt kattavat joukot. Laattojen etsimistÀ varten kuvataan datan spektraalisiin ominaisuuksiin perustuva heuristiikka, jonka erÀitÀ ominaisuuksia tutkitaan.reviewe
    • 

    corecore