19 research outputs found

    Efficient Frequent Subtree Mining Beyond Forests

    Get PDF
    A common paradigm in distance-based learning is to embed the instance space into some appropriately chosen feature space equipped with a metric and to define the dissimilarity between instances by the distance of their images in the feature space. If the instances are graphs, then frequent connected subgraphs are a well-suited pattern language to define such feature spaces. Identifying the set of frequent connected subgraphs and subsequently computing embeddings for graph instances, however, is computationally intractable. As a result, existing frequent subgraph mining algorithms either restrict the structural complexity of the instance graphs or require exponential delay between the output of subsequent patterns. Hence distance-based learners lack an efficient way to operate on arbitrary graph data. To resolve this problem, in this thesis we present a mining system that gives up the demand on the completeness of the pattern set to instead guarantee a polynomial delay between subsequent patterns. Complementing this, we devise efficient methods to compute the embedding of arbitrary graphs into the Hamming space spanned by our pattern set. As a result, we present a system that allows to efficiently apply distance-based learning methods to arbitrary graph databases. To overcome the computational intractability of the mining step, we consider only frequent subtrees for arbitrary graph databases. This restriction alone, however, does not suffice to make the problem tractable. We reduce the mining problem from arbitrary graphs to forests by replacing each graph by a polynomially sized forest obtained from a random sample of its spanning trees. This results in an incomplete mining algorithm. However, we prove that the probability of missing a frequent subtree pattern is low. We show empirically that this is true in practice even for very small sized forests. As a result, our algorithm is able to mine frequent subtrees in a range of graph databases where state-of-the-art exact frequent subgraph mining systems fail to produce patterns in reasonable time or even at all. Furthermore, the predictive performance of our patterns is comparable to that of exact frequent connected subgraphs, where available. The above method considers polynomially many spanning trees for the forest, while many graphs have exponentially many spanning trees. The number of patterns found by our mining algorithm can be negatively influenced by this exponential gap. We hence propose a method that can (implicitly) consider forests of exponential size, while remaining computationally tractable. This results in a higher recall for our incomplete mining algorithm. Furthermore, the methods extend the known positive results on the tractability of exact frequent subtree mining to a novel class of transaction graphs. We conjecture that the next natural extension of our results to a larger transaction graph class is at least as difficult as proving whether P = NP, or not. Regarding the graph embedding step, we apply a similar strategy as in the mining step. We represent a novel graph by a forest of its spanning trees and decide whether the frequent trees from the mining step are subgraph isomorphic to this forest. As a result, the embedding computation has one-sided error with respect to the exact subgraph isomorphism test but is computationally tractable. Furthermore, we show that we can leverage a partial order on the pattern set. This structure can be used to reduce the runtime of the embedding computation dramatically. For the special case of Jaccard-similarity between graph embeddings, a further substantial reduction of runtime can be achieved using min-hashing. The Jaccard-distance can be approximated using small sketch vectors that can be computed fast, again using the partial order on the tree patterns

    Non-parametric Methods for Correlation Analysis in Multivariate Data with Applications in Data Mining

    Get PDF
    In this thesis, we develop novel methods for correlation analysis in multivariate data, with a special focus on mining correlated subspaces. Our methods handle major open challenges arisen when combining correlation analysis with subspace mining. Besides traditional correlation analysis, we explore interaction-preserving discretization of multivariate data and causality analysis. We conduct experiments on a variety of real-world data sets. The results validate the benefits of our methods

    Maximal frequent sequences applied to drug-drug interaction extraction

    Full text link
    A drug-drug interaction (DDI) occurs when the effects of a drug are modified by the presence of other drugs. DDIs can decrease therapeutic benefit or efficacy of treatments and this could have very harmful consequences in the patient's health that could even cause the patient's death. Knowing the interactions between prescribed drugs is of great clinical importance, it is very important to keep databases up-to-date with respect to new DDI. In this thesis we aim to build a system to assist healthcare professionals to be updated about published drug-drug interactions. The goal of this thesis is to study a method based on maximal frequent sequences (MFS) and machine learning techniques in order to automatically detect interactions between drugs in pharmacological and medical literature. With the study of these methods, the IT community will assist healthcare community to update their drug interactions database in a fast and semi-automatic way. In a first solution, we classify pharmacological sentences depending on whether or not they are describing a drug-drug interaction. This would enable to automatically find sentences containing drug-drug interactions. This solution is completely based in maximal frequent sequences (MFS) extracted from a set of test documents. In a second solution based in machine learning, we go further in the search and perform DDI extraction, determining if two specific drugs appearing in a sentence interact or not. This can be used as an assisting tool to populate databases with drug-drug interactions. The machine learning classifier is trained with several features i.e., bag of words, word categories, MFS, token and char level features and drug level features. The classifier we used was a Random Forest. This system was sent to the DDIExtraction 2011 competition and reached the 6th position. Finally, we introduce Maximal Frequent Discriminative Sequences (MFDS), a novel method of sequential pattern discovery that extends the concept of MFS to adapt it to classification tasks.García Blasco, S. (2012). Maximal frequent sequences applied to drug-drug interaction extraction. http://hdl.handle.net/10251/15342Archivo delegad

    Proceedings of the International Workshop "What can FCA do for Artificial Intelligence?" (FCA4AI 2014)

    Get PDF
    International audienceThis is the third edition of the FCA4AI workshop, whose first edition was organized at ECAI 2012 Conference (Montpellier, August 2012) and second edition was organized at IJCAI 2013 Conference (Beijing, August 2013, see http://www.fca4ai.hse.ru/). Formal Concept Analysis (FCA) is a mathematically well-founded theory aimed at data analysis and classification that can be used for many purposes, especially for Artificial Intelligence (AI) needs. The objective of the workshop is to investigate two main main issues: how can FCA support various AI activities (knowledge discovery, knowledge representation and reasoning, learning, data mining, NLP, information retrieval), and how can FCA be extended in order to help AI researchers to solve new and complex problems in their domain

    Menetelmiä jälleenkuvausten louhintaan

    Get PDF
    In scientific investigations data oftentimes have different nature. For instance, they might originate from distinct sources or be cast over separate terminologies. In order to gain insight into the phenomenon of interest, a natural task is to identify the correspondences that exist between these different aspects. This is the motivating idea of redescription mining, the data analysis task studied in this thesis. Redescription mining aims to find distinct common characterizations of the same objects and, vice versa, to identify sets of objects that admit multiple shared descriptions. A practical example in biology consists in finding geographical areas that admit two characterizations, one in terms of their climatic profile and one in terms of the occupying species. Discovering such redescriptions can contribute to better our understanding of the influence of climate over species distribution. Besides biology, applications of redescription mining can be envisaged in medicine or sociology, among other fields. Previously, redescription mining was restricted to propositional queries over Boolean attributes. However, many conditions, like aforementioned climate, cannot be expressed naturally in this limited formalism. In this thesis, we consider more general query languages and propose algorithms to find the corresponding redescriptions, making the task relevant to a broader range of domains and problems. Specifically, we start by extending redescription mining to non-Boolean attributes. In other words, we propose an algorithm to handle nominal and real-valued attributes natively. We then extend redescription mining to the relational setting, where the aim is to find corresponding connection patterns that relate almost the same object tuples in a network. We also study approaches for selecting high quality redescriptions to be output by the mining process. The first approach relies on an interface for mining and visualizing redescriptions interactively and allows the analyst to tailor the selection of results to meet his needs. The second approach, rooted in information theory, is a compression-based method for mining small sets of associations from two-view datasets. In summary, we take redescription mining outside the Boolean world and show its potential as a powerful exploratory method relevant in a broad range of domains.Tieteellinen tutkimusaineisto kootaan usein eri termistöä käyttävistä lähteistä. Näiden erilaisten näkökulmienvälisten vastaavuuksien ja yhteyksien tunnistaminen on luonnollinen tapa lähestyä tutkittavaa ilmiötä. Väitöskirjassa tarkastellaan juuri tähän pyrkivää data-analyysimenetelmää, jälleenkuvausten louhintaa (redescription mining). Jälleenkuvausten tavoitteena on yhtäältä kuvata samaa asiaa vaihoehtoisilla tavoilla ja toisaalta tunnistaa sellaiset asiat, joilla on useita eri kuvauksia. Jälleenkuvausten louhinnalla on mahdollisia sovelluksia mm. biologiassa, lääketieteessä ja sosiologiassa. Biologiassa voidaan esimerkiksi etsiä sellaisia maantieteellisiä alueita, joita voidaan luonnehtia kahdella vaihtoehtoisella tavalla: joko kuvaamalla alueen ilmasto tai kuvaamalla alueella elävät lajit. Esimerkiksi Skandinaviassa ja Baltiassa on ensinnäkin samankaltaiset lämpötila- ja sadeolosuhteet ja toisekseen hirvi on yhteinen laji molemmilla alueilla. Tällaisten jälleenkuvausten löytäminen voi auttaa ymmärtämään ilmaston vaikutuksia lajien levinneisyyteen. Lääketieteessä taas jälleenkuvauksilla voidaan löytää potilaiden taustatietojen sekä heidän oireidensa ja diagnoosiensa välisiä yhteyksiä, joiden avulla taas voidaan mahdollisesti paremmin ymmärtää itse sairauksia. Aiemmin jälleenkuvausten louhinnassa on rajoituttu tarkastelemaan totuusarvoisia muuttujia sekä propositionaalisia kuvauksia. Monia asioita, esimerkiksi ilmastotyyppiä, ei kuitenkaan voi luontevasti kuvata tällaisilla rajoittuneilla formalismeilla. Väitöskirjatyössä laajennetaankin jälleenkuvausten käytettävyyttä. Työssä esitetään ensimmäinen algoritmi jälleenkuvausten löytämiseen aineistoista, joissa attribuutit ovat reaalilukuarvoisia ja käsitellään ensimmäistä kertaa jälleenkuvausten etsintää relationaalisista aineistoista, joissa asiat viittaavat toisiinsa. Lisäksi väitöskirjassa tarkastellaan menetelmiä, joilla jälleenkuvausten joukosta voidaan valita kaikkein laadukkaimmat. Näihin menetelmiin kuuluvat sekä interaktiivinen käyttöliittymä jälleenkuvausten louhintaan ja visualisointiin, että informaatioteoriaan perustuvaa parametriton menetelmä parhaiden kuvausten valitsemiseksi. Kokonaisuutena väitöskirjatyössä siis laajennetaan jälleenkuvausten louhintaa totuusarvoisista muuttujista myös muunlaisten aineistojen käsittelyyn sekä osoitetaan menetelmän mahdollisuuksia monenlaisilla sovellusalueilla.Méthodes pour la fouille de redescriptions Lors de l'analyse scientifique d'un phénomène, les données disponibles sont souvent de différentes natures. Entre autres, elles peuvent provenir de différentes sources ou utiliser différentes terminologies. Découvrir des correspondances entre ces différents aspects fournit un moyen naturel de mieux comprendre le phénomène à l'étude. C'est l'idée directrice de la fouille de redescriptions (redescription mining), la méthode d'analyse de données étudiée dans cette thèse. La fouille de redescriptions a pour but de trouver diverses manières de décrire les même choses et vice versa, de trouver des choses qui ont plusieurs descriptions en commun. Un exemple en biologie consiste à déterminer des zones géographiques qui peuvent être caractérisées de deux manières, en terme de leurs conditions climatiques d'une part, et en terme des espèces animales qui y vivent d'autre part. Les régions européennes de la Scandinavie et de la Baltique, par exemple, ont des conditions de températures et de précipitations similaires et l'élan est une espèce commune aux deux régions. Identifier de telles redescriptions peut potentiellement aider à élucider l'influence du climat sur la distribution des espèces animales. Pour prendre un autre exemple, la fouille de redescriptions pourrait être appliquée en médecine, pour mettre en relation les antécédents des patients, leurs symptômes et leur diagnostic, dans le but d'améliorer notre compréhension des maladies. Auparavant, la fouille de redescriptions n'utilisait que des requêtes propositionnelles à variables booléennes. Cependant, de nombreuses conditions, telles que le climat cité ci-dessus, ne peuvent être exprimées dans ce formalisme restreint. Dans cette thèse, nous proposons un algorithme pour construire directement des redescriptions avec des variables réelles. Nous introduisons ensuite des redescriptions mettant en jeu des liens entre les objets, c'est à dire basées sur des requêtes relationnelles. Nous étudions aussi des approches pour sélectionner des redescriptions de qualité, soit en utilisant une interface permettant la fouille et la visualisation interactives des redescriptions, soit via une méthode sans paramètres motivée par des principes de la théorie de l'information. En résumé, nous étendons la fouille de redescriptions hors du monde booléen et montrons qu'elle constitue une méthode d'exploration de données puissante et pertinente dans une large variété de domaines

    Advanced Document Description, a Sequential Approach

    Get PDF
    To be able to perform efficient document processing, information systems need to use simple models of documents that can be treated in a smaller number of operations. This problem of document representation is not trivial. For decades, researchers have tried to combine relevant document representations with efficient processing. Documents are commonly represented by vectors in which each dimension corresponds to a word of the document. This approach is termed “bag of words”, as it entirely ignores the relative positions of words. One natural improvement over this representation is the extraction and use of cohesive word sequences. In this dissertation, we consider the problem of the extraction, selection and exploitation of word sequences, with a particular focus on the applicability of our work to domain-independent document collections written in any language

    Data Structures for Efficient String Algorithms

    Get PDF
    This thesis deals with data structures that are mostly useful in the area of string matching and string mining. Our main result is an O(n)-time preprocessing scheme for an array of n numbers such that subsequent queries asking for the position of a minimum element in a specified interval can be answered in constant time (so-called RMQs for Range Minimum Queries). The space for this data structure is 2n+o(n) bits, which is shown to be asymptotically optimal in a general setting. This improves all previous results on this problem. The main techniques for deriving this result rely on combinatorial properties of arrays and so-called Cartesian Trees. For compressible input arrays we show that further space can be saved, while not affecting the time bounds. For the two-dimensional variant of the RMQ-problem we give a preprocessing scheme with quasi-optimal time bounds, but with an asymptotic increase in space consumption of a factor of log(n). It is well known that algorithms for answering RMQs in constant time are useful for many different algorithmic tasks (e.g., the computation of lowest common ancestors in trees); in the second part of this thesis we give several new applications of the RMQ-problem. We show that our preprocessing scheme for RMQ (and a variant thereof) leads to improvements in the space- and time-consumption of the Enhanced Suffix Array, a collection of arrays that can be used for many tasks in pattern matching. In particular, we will see that in conjunction with the suffix- and LCP-array 2n+o(n) bits of additional space (coming from our RMQ-scheme) are sufficient to find all occ occurrences of a (usually short) pattern of length m in a (usually long) text of length n in O(m*s+occ) time, where s denotes the size of the alphabet. This is certainly optimal if the size of the alphabet is constant; for non-constant alphabets we can improve this to O(m*log(s)+occ) locating time, replacing our original scheme with a data structure of size approximately 2.54n bits. Again by using RMQs, we then show how to solve frequency-related string mining tasks in optimal time. In a final chapter we propose a space- and time-optimal algorithm for computing suffix arrays on texts that are logically divided into words, if one is just interested in finding all word-aligned occurrences of a pattern. Apart from the theoretical improvements made in this thesis, most of our algorithms are also of practical value; we underline this fact by empirical tests and comparisons on real-word problem instances. In most cases our algorithms outperform previous approaches by all means

    A COMPREHENSIVE GEOSPATIAL KNOWLEDGE DISCOVERY FRAMEWORK FOR SPATIAL ASSOCIATION RULE MINING

    Get PDF
    Continuous advances in modern data collection techniques help spatial scientists gain access to massive and high-resolution spatial and spatio-temporal data. Thus there is an urgent need to develop effective and efficient methods seeking to find unknown and useful information embedded in big-data datasets of unprecedentedly large size (e.g., millions of observations), high dimensionality (e.g., hundreds of variables), and complexity (e.g., heterogeneous data sources, space–time dynamics, multivariate connections, explicit and implicit spatial relations and interactions). Responding to this line of development, this research focuses on the utilization of the association rule (AR) mining technique for a geospatial knowledge discovery process. Prior attempts have sidestepped the complexity of the spatial dependence structure embedded in the studied phenomenon. Thus, adopting association rule mining in spatial analysis is rather problematic. Interestingly, a very similar predicament afflicts spatial regression analysis with a spatial weight matrix that would be assigned a priori, without validation on the specific domain of application. Besides, a dependable geospatial knowledge discovery process necessitates algorithms supporting automatic and robust but accurate procedures for the evaluation of mined results. Surprisingly, this has received little attention in the context of spatial association rule mining. To remedy the existing deficiencies mentioned above, the foremost goal for this research is to construct a comprehensive geospatial knowledge discovery framework using spatial association rule mining for the detection of spatial patterns embedded in geospatial databases and to demonstrate its application within the domain of crime analysis. It is the first attempt at delivering a complete geo-spatial knowledge discovery framework using spatial association rule mining
    corecore