    Decomposable families of itemsets

    The problem of selecting a small, yet high quality subset of patterns from a larger collection of itemsets has recently attracted a lot of research. Here we discuss an approach to this problem using the notion of decomposable families of itemsets. Such itemset families define a probabilistic model for the data from which the original collection of itemsets was derived. Furthermore, they induce a special tree structure, called a junction tree, familiar from the theory of Markov Random Fields. The method has several advantages. The junction trees provide an intuitive representation of themining results. From the computational point of view, the model provides leverage for problems that could be intractable using the entire collection of itemsets. We provide an efficient algorithm to build decomposable itemset families, and give an application example with frequency bound querying using the model. An empirical study show that our algorithm yields high quality results

    Advances in Mining Binary Data: Itemsets as Summaries

    Mining frequent itemsets is one of the most popular topics in data mining. Itemsets are local patterns, representing frequently cooccurring sets of variables. This thesis studies the use of itemsets to give information about the whole dataset. We show how to use itemsets for answering queries, that is, finding out the number of transactions satisfying some given formula. While this is a simple procedure given the original data, the task transforms into a computationally infeasible problem if we seek the solution using the itemsets. By making some assumptions of the structure of the itemsets and applying techniques from the theory of Markov Random Fields we are able to reduce the computational burden of query answering. We can also use the known itemsets to predict the unknown itemsets. The difference between the prediction and the actual value can be used for ranking itemsets. In fact, this method can be seen as generalisation for ranking itemsets based on their deviation from the independence model, an approach commonly used in the data mining literature. The next contribution is to use itemsets to define a distance between the datasets. We achieve this by computing the difference between the frequencies of the itemsets. We take into account the fact that the itemset frequencies may be correlated and by removing the correlation we show that our distance transforms into Euclidean distance between the frequencies of parity formulae. The last contribution concerns calculating the effective dimension of binary data. We apply fractal dimension, a known concept that works well with realvalued data. Applying fractal dimension dimension directly is problematic because of the unique nature of binary data. We propose a solution to this problem by introducing a new concept called normalised correlation dimension. We study our approach theoretically and empirically by comparing it against other methods.Kattavien joukkojen louhinta on yksi suosituimmista tiedon louhinnan teemoista. Kattavat joukot ovat paikallisia hahmoja: ne edustavat usein esiintyviä muuttujakombinaatioita. kattavien joukkojen käyttöä koko tietokantaa kuvaaviin tarkoituksiin. Kattavia joukkoja voidaan käyttää Boolen kyselyihin vastaamiseen, ts. annetun Boolen kaavan toteuttavien tietuiden lukumäärän arviointiin. Tehtävästä tulee kuitenkin laskennallisesti vaativa, jos käytössä ovat vain kattavat joukot. Väitöskirjassa osoitetaan, että tietyin oletuksin ongelman ratkaisemista voidaan helpottaa käyttäen hyväksi tekniikoita, jotka perustuvat Markov-kenttiin. Väitöskirjassa tutkitaan myös miten kattavia joukkoja voidaan käyttää tuntemattomien joukkojen frekvenssin ennustamiseen. Varsinaisen datasta lasketun frekvenssin ja ennusteen välistä erotusta voidaan käyttää kattavan joukon merkitsevyyden mittana. Tämä lähestymistapa on itseasiassa tiedon louhinnassa usein toistuvan tärkeysmitan yleistys, jossa kattavan joukon tärkeys on sen poikkeama riippumattomuusoletuksesta. Väitöskirjan seuraava tutkimusaihe on kattavien joukkojen käyttö tietokantojen välisen etäisyyden määrittelemiseen. Etäisyys määritellään kattavien joukkojen frekvenssien erotuksena. Kattavien joukkojen frekvenssien välillä saattaa olla korrelaatiota ja eliminoimalla tämä korrelaatio työssä osoitetaan, että etäisyys vastaa tiettyjen pariteettikyselyiden välistä euklidista etäisyyttä. Väitöskirjan viimeinen teema on binääritietokannan efektiivisen dimension määritteleminen. Työssä sovelletaan fraktaalidimensiota, joka on suosittu menetelmä ja soveltuu hyvin jatkuvalle datalle. Tämän lähestymistavan soveltaminen diskreettiin dataan ei kuitenkaan ole suoraviivaista. Työssä ehdotetaan ratkaisuksi normalisoitua korrelaatiodimensiota. Lähestymistapoja tarkastellaan sekä teoreettisesti että empiirisesti vertailemalla sitä muihin tunnettuihin menetelmiin

    Fouille de données complexes et biclustering avec l'analyse formelle de concepts

    Knowledge discovery in database (KDD) is a process which is applied to possibly large volumes of data for discovering patterns which can be significant and useful. In this thesis, we are interested in data transformation and data mining in knowledge discovery applied to complex data, and we present several experiments related to different approaches and different data types.The first part of this thesis focuses on the task of biclustering using formal concept analysis (FCA) and pattern structures. FCA is naturally related to biclustering, where the objective is to simultaneously group rows and columns which verify some regularities. Related to FCA, pattern structures are its generalizations which work on more complex data. Partition pattern structures were proposed to discover constant-column biclustering, while interval pattern structures were studied in similar-column biclustering. Here we extend these approaches to enumerate other types of biclusters: additive, multiplicative, order-preserving, and coherent-sign-changes.The second part of this thesis focuses on two experiments in mining complex data. First, we present a contribution related to the CrossCult project, where we analyze a dataset of visitor trajectories in a museum. We apply sequence clustering and FCA-based sequential pattern mining to discover patterns in the dataset and to classify these trajectories. This analysis can be used within CrossCult project to build recommendation systems for future visitors. Second, we present our work related to the task of antibacterial drug discovery. The dataset for this task is generally a numerical matrix with molecules as rows and features/attributes as columns. The huge number of features makes it more complex for any classifier to perform molecule classification. Here we study a feature selection approach based on log-linear analysis which discovers associations among features.As a synthesis, this thesis presents a series of different experiments in the mining of complex real-world data.L'extraction de connaissances dans les bases de données (ECBD) est un processus qui s'applique à de (potentiellement larges) volumes de données pour découvrir des motifs qui peuvent être signifiants et utiles. Dans cette thèse, on s'intéresse à deux étapes du processus d'ECBD, la transformation et la fouille, que nous appliquons à des données complexes. Nous présentons de nombreuses expérimentations s'appuyant sur des approches et des types de données variés.La première partie de cette thèse s'intéresse à la tâche de biclustering en s'appuyant sur l'analyse formelle de concepts (FCA) et aux pattern structures. FCA est naturellement liées au biclustering, dont l'objectif consiste à grouper simultanément un ensemble de lignes et de colonnes qui vérifient certaines régularités. Les pattern structures sont une généralisation de la FCA qui permet de travailler avec des données plus complexes. Les "partition pattern structures'' ont été proposées pour du biclustering à colonnes constantes tandis que les "interval pattern structures'' ont été étudiées pour du biclustering à colonnes similaires. Nous proposons ici d'étendre ces approches afin d'énumérer d'autres types de biclusters : additif, multiplicatif, préservant l'ordre, et changement de signes cohérents.Dans la seconde partie, nous nous intéressons à deux expériences de fouille de données complexes. Premièrement, nous présentons une contribution dans la quelle nous analysons les trajectoires des visiteurs d'un musée dans le cadre du projet CrossCult. Nous utilisons du clustering de séquences et de la fouille de motifs séquentiels basée sur l'analyse formelle de concepts pour découvrir des motifs dans les données et classifier les trajectoires. Cette analyse peut ensuite être exploitée par un système de recommandation pour les futurs visiteurs. Deuxièmement, nous présentons un travail sur la découverte de médicaments antibactériens. Les jeux de données pour cette tâche, généralement des matrices numériques, décrivent des molécules par un certain nombre de variables/attributs. Le grand nombre de variables complexifie la classification des molécules par les classifieurs. Ici, nous étudions une approche de sélection de variables basée sur l'analyse log-linéaire qui découvre des associations entre variables.En somme, cette thèse présente différentes expériences de fouille de données réelles et complexes

    Document analysis by means of data mining techniques

    The huge amount of textual data produced everyday by scientists, journalists and Web users, allows investigating many different aspects of information stored in the published documents. Data mining and information retrieval techniques are exploited to manage and extract information from huge amount of unstructured textual data. Text mining also known as text data mining is the processing of extracting high quality information (focusing relevance, novelty and interestingness) from text by identifying patterns etc. Text mining typically involves the process of structuring input text by means of parsing and other linguistic features or sometimes by removing extra data and then finding patterns from structured data. Patterns are then evaluated at last and interpretation of output is performed to accomplish the desired task. Recently, text mining has got attention in several fields such as in security (involves analysis of Internet news), for commercial (for search and indexing purposes) and in academic departments (such as answering query). Beyond searching the documents consisting the words given in a user query, text mining may provide direct answer to user by semantic web for content based (content meaning and its context). It can also act as intelligence analyst and can also be used in some email spam filters for filtering out unwanted material. Text mining usually includes tasks such as clustering, categorization, sentiment analysis, entity recognition, entity relation modeling and document summarization. In particular, summarization approaches are suitable for identifying relevant sentences that describe the main concepts presented in a document dataset. Furthermore, the knowledge existed in the most informative sentences can be employed to improve the understanding of user and/or community interests. Different approaches have been proposed to extract summaries from unstructured text documents. Some of them are based on the statistical analysis of linguistic features by means of supervised machine learning or data mining methods, such as Hidden Markov models, neural networks and Naive Bayes methods. An appealing research field is the extraction of summaries tailored to the major user interests. In this context, the problem of extracting useful information according to domain knowledge related to the user interests is a challenging task. The main topics have been to study and design of novel data representations and data mining algorithms useful for managing and extracting knowledge from unstructured documents. This thesis describes an effort to investigate the application of data mining approaches, firmly established in the subject of transactional data (e.g., frequent itemset mining), to textual documents. Frequent itemset mining is a widely exploratory technique to discover hidden correlations that frequently occur in the source data. Although its application to transactional data is well-established, the usage of frequent itemsets in textual document summarization has never been investigated so far. A work is carried on exploiting frequent itemsets for the purpose of multi-document summarization so a novel multi-document summarizer, namely ItemSum (Itemset-based Summarizer) is presented, that is based on an itemset-based model, i.e., a framework comprise of frequent itemsets, taken out from the document collection. Highly representative and not redundant sentences are selected for generating summary by considering both sentence coverage, with respect to a sentence relevance score, based on tf-idf statistics, and a concise and highly informative itemset-based model. To evaluate the ItemSum performance a suite of experiments on a collection of news articles has been performed. Obtained results show that ItemSum significantly outperforms mostly used previous summarizers in terms of precision, recall, and F-measure. We also validated our approach against a large number of approaches on the DUC’04 document collection. Performance comparisons, in terms of precision, recall, and F-measure, have been performed by means of the ROUGE toolkit. In most cases, ItemSum significantly outperforms the considered competitors. Furthermore, the impact of both the main algorithm parameters and the adopted model coverage strategy on the summarization performance are investigated as well. In some cases, the soundness and readability of the generated summaries are unsatisfactory, because the summaries do not cover in an effective way all the semantically relevant data facets. A step beyond towards the generation of more accurate summaries has been made by semantics-based summarizers. Such approaches combine the use of general-purpose summarization strategies with ad-hoc linguistic analysis. The key idea is to also consider the semantics behind the document content to overcome the limitations of general-purpose strategies in differentiating between sentences based on their actual meaning and context. Most of the previously proposed approaches perform the semantics-based analysis as a preprocessing step that precedes the main summarization process. Therefore, the generated summaries could not entirely reflect the actual meaning and context of the key document sentences. In contrast, we aim at tightly integrating the ontology-based document analysis into the summarization process in order to take the semantic meaning of the document content into account during the sentence evaluation and selection processes. With this in mind, we propose a new multi-document summarizer, namely Yago-based Summarizer, that integrates an established ontology-based entity recognition and disambiguation step. Named Entity Recognition from Yago ontology is being used for the task of text summarization. The Named Entity Recognition (NER) task is concerned with marking occurrences of a specific object being mentioned. These mentions are then classified into a set of predefined categories. Standard categories include “person”, “location”, “geo-political organization”, “facility”, “organization”, and “time”. The use of NER in text summarization improved the summarization process by increasing the rank of informative sentences. To demonstrate the effectiveness of the proposed approach, we compared its performance on the DUC’04 benchmark document collections with that of a large number of state-of-the-art summarizers. Furthermore, we also performed a qualitative evaluation of the soundness and readability of the generated summaries and a comparison with the results that were produced by the most effective summarizers. A parallel effort has been devoted to integrating semantics-based models and the knowledge acquired from social networks into a document summarization model named as SociONewSum. The effort addresses the sentence-based generic multi-document summarization problem, which can be formulated as follows: given a collection of news articles ranging over the same topic, the goal is to extract a concise yet informative summary, which consists of most salient document sentences. An established ontological model has been used to improve summarization performance by integrating a textual entity recognition and disambiguation step. Furthermore, the analysis of the user-generated content coming from Twitter has been exploited to discover current social trends and improve the appealing of the generated summaries. An experimental evaluation of the SociONewSum performance was conducted on real English-written news article collections and Twitter posts. The achieved results demonstrate the effectiveness of the proposed summarizer, in terms of different ROUGE scores, compared to state-of-the-art open source summarizers as well as to a baseline version of the SociONewSum summarizer that does not perform any UGC analysis. Furthermore, the readability of the generated summaries has also been analyzed