5 research outputs found

    Unsupervised Discretization by Two-dimensional MDL-based Histogram

    Full text link
    Unsupervised discretization is a crucial step in many knowledge discovery tasks. The state-of-the-art method for one-dimensional data infers locally adaptive histograms using the minimum description length (MDL) principle, but the multi-dimensional case is far less studied: current methods consider the dimensions one at a time (if not independently), which result in discretizations based on rectangular cells of adaptive size. Unfortunately, this approach is unable to adequately characterize dependencies among dimensions and/or results in discretizations consisting of more cells (or bins) than is desirable. To address this problem, we propose an expressive model class that allows for far more flexible partitions of two-dimensional data. We extend the state of the art for the one-dimensional case to obtain a model selection problem based on the normalised maximum likelihood, a form of refined MDL. As the flexibility of our model class comes at the cost of a vast search space, we introduce a heuristic algorithm, named PALM, which partitions each dimension alternately and then merges neighbouring regions, all using the MDL principle. Experiments on synthetic data show that PALM 1) accurately reveals ground truth partitions that are within the model class (i.e., the search space), given a large enough sample size; 2) approximates well a wide range of partitions outside the model class; 3) converges, in contrast to its closest competitor IPD; and 4) is self-adaptive with regard to both sample size and local density structure of the data despite being parameter-free. Finally, we apply our algorithm to two geographic datasets to demonstrate its real-world potential.Comment: 30 pages, 9 figure

    An efficient approach to pruning regression trees using a modified Bayesian information criterion

    Get PDF
    By identifying relationships between regression tree construction and change-point detection, we show that it is possible to prune a regression tree efficiently using properly modified information criteria. We prove that one of the proposed pruning approaches that uses a modified Bayesian information criterion consistently recovers the true tree structure provided that the true regression function can be represented as a subtree of a full tree. In practice, we obtain simplified trees that can have prediction accuracy comparable to trees obtained using standard cost-complexity pruning. We briefly discuss an extension to random forests that prunes trees adaptively in order to prevent excessive variance, building upon the work of other authors

    Finding correlations and independences in omics data

    Get PDF
    Biological studies across all omics fields generate vast amounts of data. To understand these complex data, biologically motivated data mining techniques are indispensable. Evaluation of the high-throughput measurements usually relies on the identification of underlying signals as well as shared or outstanding characteristics. Therein, methods have been developed to recover source signals of present datasets, reveal objects which are more similar to each other than to other objects as well as to detect observations which are in contrast to the background dataset. Biological problems got individually addressed by using solutions from computer science according to their needs. The study of protein-protein interactions (interactome) focuses on the identification of clusters, the sub-graphs of graphs: A parameter-free graph clustering algorithm was developed, which was based on the concept of graph compression, in order to find sets of highly interlinked proteins sharing similar characteristics. The study of lipids (lipidome) calls for co-regulation analyses: To reveal those lipids similarly responding to biological factors, partial correlations were generated with differential Gaussian Graphical Models while accounting for solely disease-specific correlations. The study on single cell level (cytomics) aims to understand cellular systems often with the help of microscopy techniques: A novel noise robust source separation technique allowed to reliably extract independent components from microscopy images describing protein behaviors. The study of peptides (peptidomics) often requires the detection outstanding observations: By assessing regularities in the data set, an outlier detection algorithm was implemented based on compression efficacy of independent components of the dataset. All developed algorithms had to fulfill most diverse constraints in each omics field, but were met with methods derived from standard correlation and dependency analyses

    Inductive learning of tree-based regression models

    Get PDF
    Dissertação de Doutoramento em Ciência de Computadores apresentada à Faculdade de Ciências da Universidade do PortoEsta tese explora diferentes aspectos da metodologia de indução de árvores de regressão a partir de amostras de dados. O objectivo principal deste estudo é o de melhorar a capacidade predictiva das árvores de regressão tentando manter, tanto quanto possível, a sua compreensibilidade e eficiência computacional. O nosso estudo sobre este tipo de modelos de regressão é dividido em três partes principais.Na primeira parte do estudo são descritas em detalhe duas metodologias para crescer árvores de regressão: uma que minimiza o erro quadrado médio; e outra que minimiza o desvio absoluto médio. A análise que é apresentada concentra-se primordialmente na questão da eficiência computacional do processo de crescimento das árvores. São apresentados diversos algoritmos novos que originam ganhos de eficiência computacional significativos. Por fim, é apresentada uma comparação experimental das duas metodologias alternativas, mostrando claramente os diferentes objectivos práticos de cada uma. A poda das árvores de regressão é um procedimento "standard" neste tipo de metodologias cujo objectivo principal é o de proporcionar um melhor compromisso entre a simplicidade e compreensibilidade das árvores e a sua capacidade predictiva. Na segunda parte desta dissertação são descritas uma série de técnicas novas de poda baseadas num processo de selecção a partir de um conjunto de árvores podadas alternativas. Apresentamos também um conjunto extenso de experiências comparando diferentes métodos de podar árvores de regressão. Os resultados desta comparação, levada a cabo num largo conjunto de problemas, mostram que as nossas técnicas de poda obtêm resultados, em termos de capacidade predictiva, significativamente superiores aos obtidos pelos métodos do actual "estado da arte". Na parte final desta dissertação é apresentado um novo tipo de árvores, que denominamos árvores de regressão locais. Estes modelos híbridos resultam da integração das árvores de regressão com técnicas de modelação ..
    corecore