37 research outputs found

    Learning what matters - Sampling interesting patterns

    Get PDF
    In the field of exploratory data mining, local structure in data can be described by patterns and discovered by mining algorithms. Although many solutions have been proposed to address the redundancy problems in pattern mining, most of them either provide succinct pattern sets or take the interests of the user into account-but not both. Consequently, the analyst has to invest substantial effort in identifying those patterns that are relevant to her specific interests and goals. To address this problem, we propose a novel approach that combines pattern sampling with interactive data mining. In particular, we introduce the LetSIP algorithm, which builds upon recent advances in 1) weighted sampling in SAT and 2) learning to rank in interactive pattern mining. Specifically, it exploits user feedback to directly learn the parameters of the sampling distribution that represents the user's interests. We compare the performance of the proposed algorithm to the state-of-the-art in interactive pattern mining by emulating the interests of a user. The resulting system allows efficient and interleaved learning and sampling, thus user-specific anytime data exploration. Finally, LetSIP demonstrates favourable trade-offs concerning both quality-diversity and exploitation-exploration when compared to existing methods.Comment: PAKDD 2017, extended versio

    Robust subgroup discovery

    Get PDF
    We introduce the problem of robust subgroup discovery, i.e., finding a set of interpretable descriptions of subsets that 1) stand out with respect to one or more target attributes, 2) are statistically robust, and 3) non-redundant. Many attempts have been made to mine either locally robust subgroups or to tackle the pattern explosion, but we are the first to address both challenges at the same time from a global modelling perspective. First, we formulate the broad model class of subgroup lists, i.e., ordered sets of subgroups, for univariate and multivariate targets that can consist of nominal or numeric variables, and that includes traditional top-1 subgroup discovery in its definition. This novel model class allows us to formalise the problem of optimal robust subgroup discovery using the Minimum Description Length (MDL) principle, where we resort to optimal Normalised Maximum Likelihood and Bayesian encodings for nominal and numeric targets, respectively. Second, as finding optimal subgroup lists is NP-hard, we propose SSD++, a greedy heuristic that finds good subgroup lists and guarantees that the most significant subgroup found according to the MDL criterion is added in each iteration, which is shown to be equivalent to a Bayesian one-sample proportions, multinomial, or t-test between the subgroup and dataset marginal target distributions plus a multiple hypothesis testing penalty. We empirically show on 54 datasets that SSD++ outperforms previous subgroup set discovery methods in terms of quality and subgroup list size.Comment: For associated code, see https://github.com/HMProenca/RuleList ; submitted to Data Mining and Knowledge Discovery Journa

    Interactive data analysis and its applications on multi-structured datasets

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Elements About Exploratory, Knowledge-Based, Hybrid, and Explainable Knowledge Discovery

    Get PDF
    International audienceKnowledge Discovery in Databases (KDD) and especially pattern mining can be interpreted along several dimensions, namely data, knowledge, problem-solving and interactivity. These dimensions are not disconnected and have a direct impact on the quality, applicability, and efficiency of KDD. Accordingly, we discuss some objectives of KDD based on these dimensions, namely exploration, knowledge orientation, hybridization, and explanation. The data space and the pattern space can be explored in several ways, depending on specific evaluation functions and heuristics, possibly related to domain knowledge. Furthermore, numerical data are complex and supervised numerical machine learning methods are usually the best candidates for efficiently mining such data. However, the work and output of numerical methods are most of the time hard to understand, while symbolic methods are usually more intelligible. This calls for hybridization, combining numerical and symbolic mining methods to improve the applicability and interpretability of KDD. Moreover, suitable explanations about the operating models and possible subsequent decisions should complete KDD, and this is far from being the case at the moment. For illustrating these dimensions and objectives, we analyze a concrete case about the mining of biological data, where we characterize these dimensions and their connections. We also discuss dimensions and objectives in the framework of Formal Concept Analysis and we draw some perspectives for future research

    빅데이터의 효율적인 스카이라인 질의 처리를 위한 병렬처리 알고리즘

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2017. 2. 심규석.스카이라인 질의와 스카이라인에서 파생된 동적 스카이라인, 역 스카이라인 그리고 확률적 스카이라인 질의들은 다양한 응용이 가능하기 때문에 최근에 많은 연구가 진행되어 왔다. 스카이라인 질의들은 큰 데이터를 처리해야하는 경우가 많기 때문에 효율적인 스카이라인 질의 처리는 중요한 문제이다. 큰 데이터를 처리해야하는 경우를 위해 맵리듀스 프레임워크가 제안되었고, 따라서 본 논문에서는 스카이라인, 동적 스카이라인, 역 스카이라인, 확률적 스카이라인 질의 처리를 위한 효율적인 맵리듀스 알고리즘을 개발한다. 스카이라인, 동적 스카이라인, 역 스카이라인에 대해서는 질의 결과에 포함될 수 없는 데이터를 빠르게 제거하기 위해서 쿼드트리에 기반한 히스토그램을 생성한다. 그리고 히스토그램에 따라 데이터를 여러 파티션으로 나누고 각 파티션에 있는 데이터만을 이용하여 스카이라인이 될 수 있는 후보 데이터를 맵리듀스를 이용하여 병렬적으로 뽑아낸다. 그 후에 다시 맵리듀스를 사용하여 병렬적으로 후보 데이터중 실제 스카이라인을 찾아낸다. 확률적 스카이라인의 효율적인 처리를 위해 먼저 세가지 필터링 기법을 제안하였다. 이 필터링 기법을 활용할 수 있도록 쿼드트리에 기반한 히스토그램을 생성한다. 쿼드트리의 영역에 따라 데이터를 파티션하고 각 파티션마다 확률적 스카이라인 점들을 찾아낸다. 각 컴퓨터의 수행시간을 비슷하게 맞추기 위해서 부하균형 기법도 제안하였다. 다양한 실험을 통해 제안한 알고리즘의 성능들이 최신 관련 연구 보다 좋음을 확인하였고, 사용하는 컴퓨터의 수를 늘림에 따라 성능이 확장성을 갖고 있음을 확인하였다.The skyline operator and its variants such as dynamic skyline, reverse skyline and probabilistic skyline operators have attracted considerable attention recently due to its broad applications. However, computing a skyline is challenging today since we have to deal with big data. For data-intensive applications, the MapReduce framework has been widely used recently. In this dissertation, we propose the efficient parallel algorithms for processing skyline, dynamic skyline, reverse skyline and probabilistic skyline queries using MapReduce. For the skyline, dynamic skyline and reverse skyline queries, we first build quadtree-based histograms to prune out non-skyline points. We next partition data based on the regions divided by the histograms and compute candidate skyline points for each partition using MapReduce. Finally, in every partition, we check whether each skyline candidate point is actually a skyline point or not using MapReduce. For the probabilistic skyline query, we first introduce three filtering techniques to prune out points that are not probabilistic skyline points. Then, we build a quadtree-based histogram and split data into partitions according to the regions divided by the quadtree. We finally compute the probabilistic skyline points for each partition using MapReduce. We also develop the workload balancing methods to make the estimated execution times of all available machines to be similar. We did experiments to compare our algorithms with the state-of-the-art algorithms using MapReduce and confirmed the effectiveness as well as the scalability of our proposed skyline algorithms.1 INTRODUCTION 1 1.1 Motivation 1 1.2 Contributions of This Dissertation 6 1.3 Dissertation Overview 8 2 Related Work 10 2.1 Skyline Queries 10 2.2 Reverse Skyline Queries 13 2.3 Probabilistic Skyline Queries 14 3 Background 17 3.1 Skyline and Its Variants 17 3.2 MapReduce Framework 22 4 Parallel Skyline Query Processing 24 4.1 SKY-MR: Our Skyline Computation Algorithm 24 4.1.1 SKY-QTREE: The Sky-Quadtree Building Algorithm 25 4.1.2 L-SKY-MR: The Local Skyline Computation Algorithm 29 4.1.3 G-SKY-MR: The Global Skyline Computation Algorithm 32 4.2 Experiment 34 4.2.1 Performance Results for Skylines 36 4.2.2 Performance Results in Other Environments 41 5 Parallel Reverse Skyline Query Processing 45 5.1 RSKY-MR: Our Reverse Skyline Computation Algorithm 45 5.1.1 RSKY-QTREE: The Rsky-Quadtree Building Algorithm 47 5.1.2 Computations of Reverse Skylines using Rsky-Quadtrees 50 5.1.3 L-RSKY-MR: The Local Reverse Skyline Computation Algorithm 53 5.1.4 G-RSKY-MR: The Global Reverse Skyline Computation Algorithm 57 5.2 Experiment 59 5.2.1 Performance Results for Reverse Skylines 59 6 Parallel Probabilistic Skyline Query Processing 63 6.1 Early Pruning Techniques 63 6.1.1 Upper-bound Filtering 63 6.1.2 Zero-probability Filtering 67 6.1.3 Dominance-Power Filtering 68 6.2 Utilization of a PS-QTREE for Pruning 69 6.2.1 Generating a PS-QTREE 70 6.2.2 Exploiting a PS-QTREE for Filtering 70 6.2.3 Partitioning Objects by a PS-QTREE 71 6.3 PS-QPF-MR: Our Algorithm with Quadtree Partitiong and Filtering 73 6.3.1 Optimizations of PS-QPF-MR 79 6.3.2 Sample Size and Split Threshold of a PSQtree 83 6.4 PS-BRF-MR: Our Algorithm with Random Partitioning and Filtering 84 6.5 Experiments 87 6.5.1 Performance Results for Probabilistic Skylines 89 7 Conclusion 97 Bibliography 99 Abstract (In Korean) 105Docto

    Anytime Discovery of a Diverse Set of Patterns with Monte Carlo Tree Search

    Get PDF
    International audienceThe discovery of patterns that accurately discriminate one class label from another remains a challenging data mining task. Subgroup discovery (SD) is one of the frameworks that enables to elicit such interesting patterns from labeled data. A question remains fairly open: How to select an accurate heuristic search technique when exhaustive enumeration of the pattern space is infeasible? Existing approaches make use of beam-search, sampling, and genetic algorithms for discovering a pattern set that is non-redundant and of high quality w.r.t. a pattern quality measure. We argue that such approaches produce pattern sets that lack of diversity: Only few patterns of high quality, and different enough, are discovered. Our main contribution is then to formally define pattern mining as a game and to solve it with Monte Carlo tree search (MCTS). It can be seen as an exhaustive search guided by random simulations which can be stopped early (limited budget) by virtue of its best-first search property. We show through a comprehensive set of experiments how MCTS enables the anytime discovery of a diverse pattern set of high quality. It out-performs other approaches when dealing with a large pattern search space and for different quality measures. Thanks to its genericity, our MCTS approach can be used for SD but also for many other pattern mining tasks

    A portfolio theory approach to ease navigation task of users

    Get PDF
    The way the users interact with Information Retrieval (IR) systems is an interesting topic of interest in the field of Human Computer Interaction (HCI) and IR. With the ever increasing information in the web, users are often lost in the vast information space. Navigating in the complex information space to find the required information, is often an abstruse task by users. One of the reasons is the difficulty in designing systems that would present the user with an optimal set of navigation options to support varying information needs. As a solution to the navigation problem, in this thesis we propose a method referred as interaction portfolio theory, based on Markowitz's 'Modern Portfolio theory', a theory of finance. It provides the users with N optimal interaction options in each iteration, by taking into account user's goal expressed via interaction during the task, but also the risk related to a potentially suboptimal choice made by the user. In each iteration, the proposed method learns the relevant interaction options from user behaviour interactively and optimizes relevance and diversity to allow the user to accomplish the task in a shorter interaction sequence. This theory can be applied to any IR system to help users to retrieve the required information efficiently

    Listening to Museums: Sounds as objects of culture and curatorial care

    Full text link
    This practice-based project begins with an exploration of the acoustic environments of a variety of contemporary museums via field recording and sound mapping. Through a critical listening practice, this mapping leads to a central question: can sounds act as objects analogous to physical objects within museum practice – and if so, what is at stake in creating a museum that only exhibits sounds?Given the interest in collection and protection of intangible culture within contemporary museum practice, as well as the evolving anthropological view of sound as an object of human culture, this project suggests that a re-definition of Pierre Shaeffer’s oft-debated term ‘sound object’ within the context of museum practice may be of use in re-imagining how sounds might be able to function within traditionally object-based museum exhibition practices. Furthermore, the longstanding notion of ‘soundmarks’ – sounds that reoccur within local communities which help to define their unique cultural identity – is explored as a means by which post-industrial sounds such as traffic signals for the visually impaired and those made by public transport, may be considered deserving of protection by museum practitioners.These ideas are then tested via creative practice by establishing an experimental curatorial project, The Museum of Portable Sound (MOPS), an institution dedicated to collecting, preserving, and exhibiting sounds as objects of culture and human agency. MOPS displays sounds, collected via the author’s field recording practice, as museological objects that, like the physical objects described by Stephen Greenblatt, ‘resonate’ with the outside world – but also with each other, via their careful selection and sequencing that calls back to the mix tape culture of the late twentieth century.The unconventional form of MOPS – digital audio files on a single mobile phone accompanied by a museum ‘map’ and Gallery Guide – emphasizes social connections between the virtual and the physical. The project presents a viable format via which sounds may be displayed as culture while also interrogating what a museum can be in the twenty first centur

    Development as a Battlefield

    Get PDF
    Development as a Battlefield is an innovative exploration of conflict and development, phenomena that are often regarded as ostensibly antagonistic. It invites readers to reconsider socio-political and economic developments in the MENA region and beyond. Readership: Academic libraries and institutional libraries, scholars and post-graduate students, development and policy specialists and practitioners interested in the development-conflict nexus, global security, international relations, development cycles, development policy and practice, citizens’ role in society, anthropology, history, political sociology and political economy, globalisation, migration, women, the Middle East and North Africa
    corecore