16 research outputs found

    5th international workshop on knowledge discovery in inductive databases (KDID2006): Workshop report

    No full text
    This report presents a review of the 5th International Workshop on Knowledge Discovery in Inductive Databases (KDID'06), which was organized by the authors and held in Berlin, Germany, on September 18, 2006, in conjunction with ECML/PKDD'06, the 17th European Conference on Machine Learning and the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases. The goal of the workshop was to bring together the researchers that are interested in the area of inductive databases, inductive queries, constraint-based data mining, and data mining query languages.status: publishe

    Abduction and Anonymity in Data Mining

    Get PDF
    This thesis investigates two new research problems that arise in modern data mining: reasoning on data mining results, and privacy implication of data mining results. Most of the data mining algorithms rely on inductive techniques, trying to infer information that is generalized from the input data. But very often this inductive step on raw data is not enough to answer the user questions, and there is the need to process data again using other inference methods. In order to answer high level user needs such as explanation of results, we describe an environment able to perform abductive (hypothetical) reasoning, since often the solutions of such queries can be seen as the set of hypothesis that satisfy some requirements. By using cost-based abduction, we show how classification algorithms can be boosted by performing abductive reasoning over the data mining results, improving the quality of the output. Another growing research area in data mining is the one of privacy-preserving data mining. Due to the availability of large amounts of data, easily collected and stored via computer systems, new applications are emerging, but unfortunately privacy concerns make data mining unsuitable. We study the privacy implications of data mining in a mathematical and logical context, focusing on the anonymity of people whose data are analyzed. A formal theory on anonymity preserving data mining is given, together with a number of anonymity-preserving algorithms for pattern mining. The post-processing improvement on data mining results (w.r.t. utility and privacy) is the central focus of the problems we investigated in this thesis

    Motifs séquentiels pour la description de séries temporelles d'images satellitaires et la prévision d'événements

    Get PDF
    Les travaux prĂ©sentĂ©s concernent l’extraction de connaissances dans les donnĂ©es Ă  des ïŹns de description et d’infĂ©rence. Comment dĂ©crire des SĂ©ries Temporelles d’Images Satellitaire (STIS) en mode non supervisĂ© ? Comment prĂ©voir des Ă©vĂ©nements tels que des pannes dans des systĂšmes complexes ? Des rĂ©ponses originales s’appuyant sur des techniques de fouille de donnĂ©es extrayant des motifs locaux, les motifs sĂ©quentiels, sont dĂ©veloppĂ©es. Ainsi, de nouveaux motifs, les motifs SĂ©quentiels FrĂ©quents GroupĂ©s (motifs SFG), sont-ils proposĂ©s aïŹn d’extraire d’une STIS des groupes de pixels faisant sens spatialement et temporellement. Une technique originale permettant de pousser les contraintes associĂ©es Ă  ces motifs au sein du processus d’extraction est Ă©galement dĂ©taillĂ©e. Des expĂ©riences sur des donnĂ©es optiques et radar, Ă  des rĂ©solutions diïŹ€Ă©rentes, conïŹrment leur potentiel. Un classement de ces motifs basĂ© sur l’information mutuelle et la swap-randomization est par ailleurs proposĂ© aïŹn de mettre en avant les motifs ayant peu de chances d’apparaĂźtre dans un jeu de donnĂ©es alĂ©atoires oĂč les frĂ©quences sont conservĂ©es, exprimant des changements et progressant dans l’espace. Quant Ă  la prĂ©vision d’évĂ©nements, une approche de type leave-one-out est proposĂ©e pour sĂ©lectionner des motifs sĂ©quentiels, les FLM-rĂšgles, gĂ©nĂ©riques et dĂ©clenchant le moins possible de fausses alarmes. Une mĂ©thode de prĂ©vision au plus tĂŽt tirant parti de ces motifs est Ă©galement avancĂ©e et validĂ©e sur des donnĂ©es rĂ©elles provenant de systĂšmes mĂ©caniques complexes. Les expĂ©riences menĂ©es montrent qu’il est possible de prĂ©voir des dĂ©faillances pour lesquelles l’expertise technique est insuïŹƒsante. Cette mĂ©thode de prĂ©vision est aujourd’hui brevetĂ©e

    FCAIR 2012 Formal Concept Analysis Meets Information Retrieval Workshop co-located with the 35th European Conference on Information Retrieval (ECIR 2013) March 24, 2013, Moscow, Russia

    Get PDF
    International audienceFormal Concept Analysis (FCA) is a mathematically well-founded theory aimed at data analysis and classifiation. The area came into being in the early 1980s and has since then spawned over 10000 scientific publications and a variety of practically deployed tools. FCA allows one to build from a data table with objects in rows and attributes in columns a taxonomic data structure called concept lattice, which can be used for many purposes, especially for Knowledge Discovery and Information Retrieval. The Formal Concept Analysis Meets Information Retrieval (FCAIR) workshop collocated with the 35th European Conference on Information Retrieval (ECIR 2013) was intended, on the one hand, to attract researchers from FCA community to a broad discussion of FCA-based research on information retrieval, and, on the other hand, to promote ideas, models, and methods of FCA in the community of Information Retrieval

    Summarization Techniques for Pattern Collections in Data Mining

    Get PDF
    Discovering patterns from data is an important task in data mining. There exist techniques to find large collections of many kinds of patterns from data very efficiently. A collection of patterns can be regarded as a summary of the data. A major difficulty with patterns is that pattern collections summarizing the data well are often very large. In this dissertation we describe methods for summarizing pattern collections in order to make them also more understandable. More specifically, we focus on the following themes: 1) Quality value simplifications. 2) Pattern orderings. 3) Pattern chains and antichains. 4) Change profiles. 5) Inverse pattern discovery.Comment: PhD Thesis, Department of Computer Science, University of Helsink

    Exploiting semantic web knowledge graphs in data mining

    Full text link
    Data Mining and Knowledge Discovery in Databases (KDD) is a research field concerned with deriving higher-level insights from data. The tasks performed in that field are knowledge intensive and can often benefit from using additional knowledge from various sources. Therefore, many approaches have been proposed in this area that combine Semantic Web data with the data mining and knowledge discovery process. Semantic Web knowledge graphs are a backbone of many information systems that require access to structured knowledge. Such knowledge graphs contain factual knowledge about real word entities and the relations between them, which can be utilized in various natural language processing, information retrieval, and any data mining applications. Following the principles of the Semantic Web, Semantic Web knowledge graphs are publicly available as Linked Open Data. Linked Open Data is an open, interlinked collection of datasets in machine-interpretable form, covering most of the real world domains. In this thesis, we investigate the hypothesis if Semantic Web knowledge graphs can be exploited as background knowledge in different steps of the knowledge discovery process, and different data mining tasks. More precisely, we aim to show that Semantic Web knowledge graphs can be utilized for generating valuable data mining features that can be used in various data mining tasks. Identifying, collecting and integrating useful background knowledge for a given data mining application can be a tedious and time consuming task. Furthermore, most data mining tools require features in propositional form, i.e., binary, nominal or numerical features associated with an instance, while Linked Open Data sources are usually graphs by nature. Therefore, in Part I, we evaluate unsupervised feature generation strategies from types and relations in knowledge graphs, which are used in different data mining tasks, i.e., classification, regression, and outlier detection. As the number of generated features grows rapidly with the number of instances in the dataset, we provide a strategy for feature selection in hierarchical feature space, in order to select only the most informative and most representative features for a given dataset. Furthermore, we provide an end-to-end tool for mining the Web of Linked Data, which provides functionalities for each step of the knowledge discovery process, i.e., linking local data to a Semantic Web knowledge graph, integrating features from multiple knowledge graphs, feature generation and selection, and building machine learning models. However, we show that such feature generation strategies often lead to high dimensional feature vectors even after dimensionality reduction, and also, the reusability of such feature vectors across different datasets is limited. In Part II, we propose an approach that circumvents the shortcomings introduced with the approaches in Part I. More precisely, we develop an approach that is able to embed complete Semantic Web knowledge graphs in a low dimensional feature space, where each entity and relation in the knowledge graph is represented as a numerical vector. Projecting such latent representations of entities into a lower dimensional feature space shows that semantically similar entities appear closer to each other. We use several Semantic Web knowledge graphs to show that such latent representation of entities have high relevance for different data mining tasks. Furthermore, we show that such features can be easily reused for different datasets and different tasks. In Part III, we describe a list of applications that exploit Semantic Web knowledge graphs, besides the standard data mining tasks, like classification and regression. We show that the approaches developed in Part I and Part II can be used in applications in various domains. More precisely, we show that Semantic Web graphs can be exploited for analyzing statistics, building recommender systems, entity and document modeling, and taxonomy induction. %In Part III, we focus on semantic annotations in HTML pages, which are another realization of the Semantic Web vision. Semantic annotations are integrated into the code of HTML pages using markup languages, like Microformats, RDFa, and Microdata. While such data covers various domains and topics, and can be useful for developing various data mining applications, additional steps of cleaning and integrating the data need to be performed. In this thesis, we describe a set of approaches for processing long literals and images extracted from semantic annotations in HTML pages. We showcase the approaches in the e-commerce domain. Such approaches contribute in building and consuming Semantic Web knowledge graphs

    A framework for trend mining with application to medical data

    Get PDF
    This thesis presents research work conducted in the field of knowledge discovery. It presents an integrated trend-mining framework and SOMA, which is the application of the trend-mining framework in diabetic retinopathy data. Trend mining is the process of identifying and analysing trends in the context of the variation of support of the association/classification rules that have been extracted from longitudinal datasets. The integrated framework concerns all major processes from data preparation to the extraction of knowledge. At the pre-process stage, data are cleaned, transformed if necessary, and sorted into time-stamped datasets using logic rules. At the next stage, time-stamp datasets are passed through the main processing, in which the ARM technique of matrix algorithm is applied to identify frequent rules with acceptable confidence. Mathematical conditions are applied to classify the sequences of support values into trends. Afterwards, interestingness criteria are applied to obtain interesting knowledge, and a visualization technique is proposed that maps how objects are moving from the previous to the next time stamp. A validation and verification (external and internal validation) framework is described that aims to ensure that the results at the intermediate stages of the framework are correct and that the framework as a whole can yield results that demonstrate causality. To evaluate the thesis, SOMA was developed. The dataset is, in itself, also of interest, as it is very noisy (in common with other similar medical datasets) and does not feature a clear association between specific time stamps and subsets of the data. The Royal Liverpool University Hospital has been a major centre for retinopathy research since 1991. Retinopathy is a generic term used to describe damage to the retina of the eye, which can, in the long term, lead to visual loss. Diabetic retinopathy is used to evaluate the framework, to determine whether SOMA can extract knowledge that is already known to the medics. The results show that those datasets can be used to extract knowledge that can show causality between patients’ characteristics such as the age of patient at diagnosis, type of diabetes, duration of diabetes, and diabetic retinopathy

    Process Mining Handbook

    Get PDF
    This is an open access book. This book comprises all the single courses given as part of the First Summer School on Process Mining, PMSS 2022, which was held in Aachen, Germany, during July 4-8, 2022. This volume contains 17 chapters organized into the following topical sections: Introduction; process discovery; conformance checking; data preprocessing; process enhancement and monitoring; assorted process mining topics; industrial perspective and applications; and closing
    corecore