48 research outputs found

    Categorization of interestingness measures for knowledge extraction

    Full text link
    Finding interesting association rules is an important and active research field in data mining. The algorithms of the Apriori family are based on two rule extraction measures, support and confidence. Although these two measures have the virtue of being algorithmically fast, they generate a prohibitive number of rules most of which are redundant and irrelevant. It is therefore necessary to use further measures which filter uninteresting rules. Many synthesis studies were then realized on the interestingness measures according to several points of view. Different reported studies have been carried out to identify "good" properties of rule extraction measures and these properties have been assessed on 61 measures. The purpose of this paper is twofold. First to extend the number of the measures and properties to be studied, in addition to the formalization of the properties proposed in the literature. Second, in the light of this formal study, to categorize the studied measures. This paper leads then to identify categories of measures in order to help the users to efficiently select an appropriate measure by choosing one or more measure(s) during the knowledge extraction process. The properties evaluation on the 61 measures has enabled us to identify 7 classes of measures, classes that we obtained using two different clustering techniques.Comment: 34 pages, 4 figure

    Le choix d'une bonne mesure de qualité, condition du succès d'un processus de fouille de données

    No full text
    International audienceNotre réflexion se situe dans le domaine de l'apprentissage supervisé ou non supervisé par induction de règles. La fouille de données est couronnée de succès lorsque l'on parvient à extraire des données des connaissances nouvelles, valides, exploitables, etc. (Fayyad et al. (1996) Kodratoff et al. (2001)). L'une des clefs du succès est, bien sûr, le choix d'un algorithme qui soit bien adapté aux caractéristiques des données et au type de connaissances souhaitées : par exemple les règles d'association en non supervisé ; les arbres de décision, les règles d'association de classe et le bayésien naïf, en supervisé. Cependant, le succès dépend d'autres facteurs, notamment la préparation des données (représentation des données, outliers, variables redondantes) et le choix d'une bonne mesure d'évaluation de la qualité des connaissances extraites, tant dans le déroulement de l'algorithme que dans l'évaluation finale des résultats obtenus. C'est de ce dernier facteur que nous allons parler.En introduction, nous évoquerons rapidement le problème de la représentation des données. Puis, après avoir rappelé le principe de la recherche des règles d'association (Agrawal et Srikant (1994)) ou des règles d'association de classe intéressantes (Liu et al. (1998)), nous montrerons, à partir de quelques exemples, la diversité des résultats obtenus suivant la mesure d'intérêt choisie, que ce soit en comparant les pré-ordres obtenus ou en calculant les meilleures règles (Vaillant et al., 2004). Ces exemples illustrent le fait qu'il n'y a pas de mesure qui soit intrinsèquement bonne, mais différentes mesures qui, suivant leurs propriétés, sont plus ou moins bien adaptées au but poursuivi par l'utilisateur. Une mesure favorise tel ou tel type de connaissance, ce qui constitue un biais d'apprentissage que nous illustrerons par la mesure de Jaccard (Plasse et al. (2007)). Nous proposerons ensuite une synthèse des travaux concernant les mesures de qualité des règles d'association en présentant les principaux critères d'évaluation des mesures et en montrant concrètement le rôle de chacun de ces critères dans le comportement des mesures (e.g. Lenca et al. (2003), Tan et al. (2004), Geng et Hamilton (2006), Lenca et al. (2008), Suzuki (2008), Guillaume et al. (2010), Lerman et Guillaume (2010), Gras et Couturier (2010) ; nous renvoyons également le lecteur aux ouvrages édités par Guillet et Hamilton (2007) et Zhao et al. (2009)). Nous illustrerons le lien qui existe entre les propriétés des mesures sur les critères retenus et leur comportement sur un certain nombre de bases de règles (Vaillant et al., 2004). A côté de ces critères qui permettent d'étalonner les propriétés des mesures, nous présenterons d'autres critères de choix très importants. En premier lieu, nous nous intéresserons aux propriétés algorithmiques des mesures afin de pouvoir extraire les motifs intéressants en travaillant directement sur la mesure considérée, sans fixer de seuil de support, ce qui permet d'accéder aux pépites de connaissances (Wang et al. (2001), Xiong et al. (2003), Li (2006), Le Bras et al. (2009), Le Bras et al. (2009), Le Bras et al. (2010)). Nous exhiberons des conditions algébriques sur la formule d'une mesure qui assurent de pouvoir associer un critère d'élagage à la mesure considérée. Nous nous poserons ensuite le problème de l'évaluation de la robustesse des règles suivant la mesure utilisée (Azé et Kodratoff (2002), Cadot (2005), Gras et al. (2007), Le Bras et al. (2010)). Enfin, nous traiterons le cas des données déséquilibrées (Weiss et Provost (2003)) en apprentissage par arbres (Chawla (2003)) et nous montrerons comment le choix d'une mesure appropriée permet d'apporter une solution algorithmique à ce problème qui améliore de façon significative à la fois le taux d'erreur global, la précision et le rappel (Zighed et al. (2007), Lenca et al. (2008)). Si l'on veut privilégier la classe minoritaire, cette solution peut être encore améliorée en introduisant, dans la procédure d'affectation des étiquettes opérant sur chaque feuille de l'arbre, une mesure d'intérêt adéquate qui se substitue à la règle majoritaire (Ritschard et al. (2007), Pham et al. (2008)). Une discussion sur les mesures de qualité de bases de règles est présentée dans (Holena, 2009). En définitive, comment aider l'utilisateur à choisir la mesure la plus appropriée à son projet ? Nous proposerons une procédure d'assistance au choix de l'utilisateur qui permet de retourner à celui-ci les mesures les plus appropriées, une fois qu'il a défini les propriétés qu'il attend d'une mesure (Lenca et al. (2008))

    Exploratory search in time-oriented primary data

    Get PDF
    In a variety of research fields, primary data that describes scientific phenomena in an original condition is obtained. Time-oriented primary data, in particular, is an indispensable data type, derived from complex measurements depending on time. Today, time-oriented primary data is collected at rates that exceed the domain experts’ abilities to seek valuable information undiscovered in the data. It is widely accepted that the magnitudes of uninvestigated data will disclose tremendous knowledge in data-driven research, provided that domain experts are able to gain insight into the data. Domain experts involved in data-driven research urgently require analytical capabilities. In scientific practice, predominant activities are the generation and validation of hypotheses. In analytical terms, these activities are often expressed in confirmatory and exploratory data analysis. Ideally, analytical support would combine the strengths of both types of activities. Exploratory search (ES) is a concept that seamlessly includes information-seeking behaviors ranging from search to exploration. ES supports domain experts in both gaining an understanding of huge and potentially unknown data collections and the drill-down to relevant subsets, e.g., to validate hypotheses. As such, ES combines predominant tasks of domain experts applied to data-driven research. For the design of useful and usable ES systems (ESS), data scientists have to incorporate different sources of knowledge and technology. Of particular importance is the state-of-the-art in interactive data visualization and data analysis. Research in these factors is at heart of Information Visualization (IV) and Visual Analytics (VA). Approaches in IV and VA provide meaningful visualization and interaction designs, allowing domain experts to perform the information-seeking process in an effective and efficient way. Today, bestpractice ESS almost exclusively exist for textual data content, e.g., put into practice in digital libraries to facilitate the reuse of digital documents. For time-oriented primary data, ES mainly remains at a theoretical state. Motivation and Problem Statement. This thesis is motivated by two main assumptions. First, we expect that ES will have a tremendous impact on data-driven research for many research fields. In this thesis, we focus on time-oriented primary data, as a complex and important data type for data-driven research. Second, we assume that research conducted to IV and VA will particularly facilitate ES. For time-oriented primary data, however, novel concepts and techniques are required that enhance the design and the application of ESS. In particular, we observe a lack of methodological research in ESS for time-oriented primary data. In addition, the size, the complexity, and the quality of time-oriented primary data hampers the content-based access, as well as the design of visual interfaces for gaining an overview of the data content. Furthermore, the question arises how ESS can incorporate techniques for seeking relations between data content and metadata to foster data-driven research. Overarching challenges for data scientists are to create usable and useful designs, urgently requiring the involvement of the targeted user group and support techniques for choosing meaningful algorithmic models and model parameters. Throughout this thesis, we will resolve these challenges from conceptual, technical, and systemic perspectives. In turn, domain experts can benefit from novel ESS as a powerful analytical support to conduct data-driven research. Concepts for Exploratory Search Systems (Chapter 3). We postulate concepts for the ES in time-oriented primary data. Based on a survey of analysis tasks supported in IV and VA research, we present a comprehensive selection of tasks and techniques relevant for search and exploration activities. The assembly guides data scientists in the choice of meaningful techniques presented in IV and VA. Furthermore, we present a reference workflow for the design and the application of ESS for time-oriented primary data. The workflow divides the data processing and transformation process into four steps, and thus divides the complexity of the design space into manageable parts. In addition, the reference workflow describes how users can be involved in the design. The reference workflow is the framework for the technical contributions of this thesis. Visual-Interactive Preprocessing of Time-Oriented Primary Data (Chapter 4). We present a visual-interactive system that enables users to construct workflows for preprocessing time-oriented primary data. In this way, we introduce a means of providing content-based access. Based on a rich set of preprocessing routines, users can create individual solutions for data cleansing, normalization, segmentation, and other preprocessing tasks. In addition, the system supports the definition of time series descriptors and time series distance measures. Guidance concepts support users in assessing the workflow generalizability, which is important for large data sets. The execution of the workflows transforms time-oriented primary data into feature vectors, which can subsequently be used for downstream search and exploration techniques. We demonstrate the applicability of the system in usage scenarios and case studies. Content-Based Overviews (Chapter 5). We introduce novel guidelines and techniques for the design of contentbased overviews. The three key factors are the creation of meaningful data aggregates, the visual mapping of these aggregates into the visual space, and the view transformation providing layouts of these aggregates in the display space. For each of these steps, we characterize important visualization and interaction design parameters allowing the involvement of users. We introduce guidelines supporting data scientists in choosing meaningful solutions. In addition, we present novel visual-interactive quality assessment techniques enhancing the choice of algorithmic model and model parameters. Finally, we present visual interfaces enabling users to formulate visual queries of the time-oriented data content. In this way, we provide means of combining content-based exploration with content-based search. Relation Seeking Between Data Content and Metadata (Chapter 6). We present novel visual interfaces enabling domain experts to seek relations between data content and metadata. These interfaces can be integrated into ESS to bridge analytical gaps between the data content and attached metadata. In three different approaches, we focus on different types of relations and define algorithmic support to guide users towards most interesting relations. Furthermore, each of the three approaches comprises individual visualization and interaction designs, enabling users to explore both the data and the relations in an efficient and effective way. We demonstrate the applicability of our interfaces with usage scenarios, each conducted together with domain experts. The results confirm that our techniques are beneficial for seeking relations between data content and metadata, particularly for data-centered research. Case Studies - Exploratory Search Systems (Chapter 7). In two case studies, we put our concepts and techniques into practice. We present two ESS constructed in design studies with real users, and real ES tasks, and real timeoriented primary data collections. The web-based VisInfo ESS is a digital library system facilitating the visual access to time-oriented primary data content. A content-based overview enables users to explore large collections of time series measurements and serves as a baseline for content-based queries by example. In addition, VisInfo provides a visual interface for querying time oriented data content by sketch. A result visualization combines different views of the data content and metadata with faceted search functionality. The MotionExplorer ESS supports domain experts in human motion analysis. Two content-based overviews enhance the exploration of large collections of human motion capture data from two perspectives. MotionExplorer provides a search interface, allowing domain experts to query human motion sequences by example. Retrieval results are depicted in a visual-interactive view enabling the exploration of variations of human motions. Field study evaluations performed for both ESS confirm the applicability of the systems in the environment of the involved user groups. The systems yield a significant improvement of both the effectiveness and the efficiency in the day-to-day work of the domain experts. As such, both ESS demonstrate how large collections of time-oriented primary data can be reused to enhance data-centered research. In essence, our contributions cover the entire time series analysis process starting from accessing raw time-oriented primary data, processing and transforming time series data, to visual-interactive analysis of time series. We present visual search interfaces providing content-based access to time-oriented primary data. In a series of novel explorationsupport techniques, we facilitate both gaining an overview of large and complex time-oriented primary data collections and seeking relations between data content and metadata. Throughout this thesis, we introduce VA as a means of designing effective and efficient visual-interactive systems. Our VA techniques empower data scientists to choose appropriate models and model parameters, as well as to involve users in the design. With both principles, we support the design of usable and useful interfaces which can be included into ESS. In this way, our contributions bridge the gap between search systems requiring exploration support and exploratory data analysis systems requiring visual querying capability. In the ESS presented in two case studies, we prove that our techniques and systems support data-driven research in an efficient and effective way

    Localizing the media, locating ourselves: a critical comparative analysis of socio-spatial sorting in locative media platforms (Google AND Flickr 2009-2011)

    Get PDF
    In this thesis I explore media geocoding (i.e., geotagging or georeferencing), the process of inscribing the media with geographic information. A process that enables distinct forms of producing, storing, and distributing information based on location. Historically, geographic information technologies have served a biopolitical function producing knowledge of populations. In their current guise as locative media platforms, these systems build rich databases of places facilitated by user-generated geocoded media. These geoindexes render places, and users of these services, this thesis argues, subject to novel forms of computational modelling and economic capture. Thus, the possibility of tying information, people and objects to location sets the conditions to the emergence of new communicative practices as well as new forms of governmentality (management of populations). This project is an attempt to develop an understanding of the socio-economic forces and media regimes structuring contemporary forms of location-aware communication, by carrying out a comparative analysis of two of the main current location-enabled platforms: Google and Flickr. Drawing from the medium-specific approach to media analysis characteristic of the subfield of Software Studies, together with the methodological apparatus of Cultural Analytics (data mining and visualization methods), the thesis focuses on examining how social space is coded and computed in these systems. In particular, it looks at the databases’ underlying ontologies supporting the platforms' geocoding capabilities and their respective algorithmic logics. In the final analysis the thesis argues that the way social space is translated in the form of POIs (Points of Interest) and business-biased categorizations, as well as the geodemographical ordering underpinning the way it is computed, are pivotal if we were to understand what kind of socio-spatial relations are actualized in these systems, and what modalities of governing urban mobility are enabled

    Mixture Model Clustering in the Analysis of Complex Diseases

    Get PDF
    The topic of this thesis is the analysis of complex diseases, and specifically the use of certain clustering methods to do it. We concern ourselves mostly with the modeling of complex phenotypes of diseases: the symptoms and signs of diseases, and the other multiple cophenotypes that go with them. The two related questions we seek answers for are: 1) how can we use these clustering methods to summarize the complex, multivariate phenotype data, for example to be used as a simple phenotype in genetic analyses and 2) how can we use these clustering methods to find subgroups of sufferers of a particular disease, such that might share the same causal factors of the disease. Current methods for studies on medical genetics ideally call for a single or at most handful of univariate phenotypes to be compared to genetic markers. Multidimensional phenotypes cannot be handled by the standard methods, and treating each variable as independent and testing one hundred phenotypes with unclear true dependency structure against thousands of markers results into problems with both running times and multiple testing correction. In this work, clustering is utilized to summarize a multi-dimensional phenotype into something that can then be used in association studies of both genetic and other type of potential causes. I describe a clustering process and some clustering methods used in this work, with comments on practical issues and references to the relevant literature. After some experiments on artificial data to gain insight to the properties of these methods, I present four case-studies on real data, highlighting both ways to succesfully use these methods and problems that can arise in the process.Tässä väitöskirjatyössä tarkastellaan niin sanottujen kompleksitautien mallintamista sekoitemalliklusteroinniin avulla. Monet nykyään käytössä olevat geneettiset ja muut epidemiologiset menetelmät olettavat yksimuuttujaisen ilmenemisasun (esimerkiksi ihmisellä joko on tai ei ole tietty sairaus), mutta kompleksitautien ilmenemismuodot ovat yleensä monimutkaisempia. Tämän väitöskirjatyön pääasiallisena tutkimuskohteena on näiden monimutkaisten tautien ilmenemismuotojen (oireiden, löydösten ja samaan aikaan esiintyvien muiden piirteiden) mallintaminen sekoitemalliklusterointimentelmiä käyttäen. Tavoitteena on joko löytää yksinkertaisia kuvauksia monimutkaisista taudeista tai erottaa potilaista sellaisia alaryhmiä, että taudinkuva niiden sisällä on hyvin samankaltainen. Näitä tietoja voidaan sitten käyttää hyväksi tautien syitä selvitettäessä. Väitöskirjassa on kartoitettu näiden sekoitemalliklusteroinnin menetelmien käyttäytymistä eri tilanteissa käyttäen testiaineistona keinotekoista dataa, joka ominaisuuksiltaan muistuttaa todellista lääketieteellistä aineistoa. Lisäksi kuvataan menetelmien soveltamista neljässä oikeassa lääketieteellisessä aineistossa siten, että havainnollistetuksi tulee sekä tämänkaltaisen tutkimuksen hyviä että heikkoja puolia

    Benefits of the application of web-mining methods and techniques for the field of analytical customer relationship management of the marketing function in a knowledge management perspective

    Get PDF
    Le Web Mining (WM) reste une technologie relativement méconnue. Toutefois, si elle est utilisée adéquatement, elle s'avère être d'une grande utilité pour l'identification des profils et des comportements des clients prospects et existants, dans un contexte internet. Les avancées techniques du WM améliorent grandement le volet analytique de la Gestion de la Relation Client (GRC). Cette étude suit une approche exploratoire afin de déterminer si le WM atteint, à lui seul, tous les objectifs fondamentaux de la GRC, ou le cas échéant, devrait être utilisé de manière conjointe avec la recherche marketing traditionnelle et les méthodes classiques de la GRC analytique (GRCa) pour optimiser la GRC, et de fait le marketing, dans un contexte internet. La connaissance obtenue par le WM peut ensuite être administrée au sein de l'organisation dans un cadre de Gestion de la Connaissance (GC), afin d'optimiser les relations avec les clients nouveaux et/ou existants, améliorer leur expérience client et ultimement, leur fournir de la meilleure valeur. Dans un cadre de recherche exploratoire, des entrevues semi-structurés et en profondeur furent menées afin d'obtenir le point de vue de plusieurs experts en (web) data rnining. L'étude révéla que le WM est bien approprié pour segmenter les clients prospects et existants, pour comprendre les comportements transactionnels en ligne des clients existants et prospects, ainsi que pour déterminer le statut de loyauté (ou de défection) des clients existants. Il constitue, à ce titre, un outil d'une redoutable efficacité prédictive par le biais de la classification et de l'estimation, mais aussi descriptive par le biais de la segmentation et de l'association. En revanche, le WM est moins performant dans la compréhension des dimensions sous-jacentes, moins évidentes du comportement client. L'utilisation du WM est moins appropriée pour remplir des objectifs liés à la description de la manière dont les clients existants ou prospects développent loyauté, satisfaction, défection ou attachement envers une enseigne sur internet. Cet exercice est d'autant plus difficile que la communication multicanale dans laquelle évoluent les consommateurs a une forte influence sur les relations qu'ils développent avec une marque. Ainsi le comportement en ligne ne serait qu'une transposition ou tout du moins une extension du comportement du consommateur lorsqu'il n'est pas en ligne. Le WM est également un outil relativement incomplet pour identifier le développement de la défection vers et depuis les concurrents ainsi que le développement de la loyauté envers ces derniers. Le WM nécessite toujours d'être complété par la recherche marketing traditionnelle afin d'atteindre ces objectives plus difficiles mais essentiels de la GRCa. Finalement, les conclusions de cette recherche sont principalement dirigées à l'encontre des firmes et des gestionnaires plus que du côté des clients-internautes, car ces premiers plus que ces derniers possèdent les ressources et les processus pour mettre en œuvre les projets de recherche en WM décrits.\ud ______________________________________________________________________________ \ud MOTS-CLÉS DE L’AUTEUR : Web mining, Gestion de la connaissance, Gestion de la relation client, Données internet, Comportement du consommateur, Forage de données, Connaissance du consommateu
    corecore