1,681 research outputs found

    Interactive Constrained Association Rule Mining

    Full text link
    We investigate ways to support interactive mining sessions, in the setting of association rule mining. In such sessions, users specify conditions (queries) on the associations to be generated. Our approach is a combination of the integration of querying conditions inside the mining phase, and the incremental querying of already generated associations. We present several concrete algorithms and compare their performance.Comment: A preliminary report on this work was presented at the Second International Conference on Knowledge Discovery and Data Mining (DaWaK 2000

    Enabling Graph Analysis Over Relational Databases

    Get PDF
    Complex interactions and systems can be modeled by analyzing the connections between underlying entities or objects described by a dataset. These relationships form networks (graphs), the analysis of which has been shown to provide tremendous value in areas ranging from retail to many scientific domains. This value is obtained by using various methodologies from network science-- a field which focuses on studying network representations in the real world. In particular "graph algorithms", which iteratively traverse a graph's connections, are often leveraged to gain insights. To take advantage of the opportunity presented by graph algorithms, there have been a variety of specialized graph data management systems, and analysis frameworks, proposed in recent years, which have made significant advances in efficiently storing and analyzing graph-structured data. Most datasets however currently do not reside in these specialized systems but rather in general-purpose relational database management systems (RDBMS). A relational or similarly structured system is typically governed by a schema of varying strictness that implements constraints and is meticulously designed for the specific enterprise. Such structured datasets contain many relationships between the entities therein, that can be seen as latent or "hidden" graphs that exist inherently inside the datasets. However, these relationships can only typically be traversed via conducting expensive JOINs using SQL or similar languages. Thus, in order for users to efficiently traverse these latent graphs to conduct analysis, data needs to be transformed and migrated to specialized systems. This creates barriers that hinder and discourage graph analysis; our vision is to break these barriers. In this dissertation we investigate the opportunities and challenges involved in efficiently leveraging relationships within data stored in structured databases. First, we present GraphGen, a lightweight software layer that is independent from the underlying database, and provides interfaces for graph analysis of data in RDBMSs. GraphGen is the first such system that introduces an intuitive high-level language for specifying graphs of interest, and utilizes in-memory graph representations to tackle the problems associated with analyzing graphs that are hidden inside structured datasets. We show GraphGen can analyze such graphs in orders of magnitude less memory, and often computation time, while eliminating manual Extract-Transform-Load (ETL) effort. Second, we examine how in-memory graph representations of RDBMS data can be used to enhance relational query processing. We present a novel, general framework for executing GROUP BY aggregation over conjunctive queries which avoids materialization of intermediate JOIN results, and wrap this framework inside a multi-way relational operator called Join-Agg. We show that Join-Agg can compute aggregates over a class of relational and graph queries using orders of magnitude less memory and computation time

    Portinari: A Data Exploration Tool to Personalize Cervical Cancer Screening

    Full text link
    Socio-technical systems play an important role in public health screening programs to prevent cancer. Cervical cancer incidence has significantly decreased in countries that developed systems for organized screening engaging medical practitioners, laboratories and patients. The system automatically identifies individuals at risk of developing the disease and invites them for a screening exam or a follow-up exam conducted by medical professionals. A triage algorithm in the system aims to reduce unnecessary screening exams for individuals at low-risk while detecting and treating individuals at high-risk. Despite the general success of screening, the triage algorithm is a one-size-fits all approach that is not personalized to a patient. This can easily be observed in historical data from screening exams. Often patients rely on personal factors to determine that they are either at high risk or not at risk at all and take action at their own discretion. Can exploring patient trajectories help hypothesize personal factors leading to their decisions? We present Portinari, a data exploration tool to query and visualize future trajectories of patients who have undergone a specific sequence of screening exams. The web-based tool contains (a) a visual query interface (b) a backend graph database of events in patients' lives (c) trajectory visualization using sankey diagrams. We use Portinari to explore diverse trajectories of patients following the Norwegian triage algorithm. The trajectories demonstrated variable degrees of adherence to the triage algorithm and allowed epidemiologists to hypothesize about the possible causes.Comment: Conference paper published at ICSE 2017 Buenos Aires, at the Software Engineering in Society Track. 10 pages, 5 figure

    Spatial Data Quality in the IoT Era:Management and Exploitation

    Get PDF
    Within the rapidly expanding Internet of Things (IoT), growing amounts of spatially referenced data are being generated. Due to the dynamic, decentralized, and heterogeneous nature of the IoT, spatial IoT data (SID) quality has attracted considerable attention in academia and industry. How to invent and use technologies for managing spatial data quality and exploiting low-quality spatial data are key challenges in the IoT. In this tutorial, we highlight the SID consumption requirements in applications and offer an overview of spatial data quality in the IoT setting. In addition, we review pertinent technologies for quality management and low-quality data exploitation, and we identify trends and future directions for quality-aware SID management and utilization. The tutorial aims to not only help researchers and practitioners to better comprehend SID quality challenges and solutions, but also offer insights that may enable innovative research and applications

    A Comprehensive Survey on Graph Summarization with Graph Neural Networks

    Full text link
    As large-scale graphs become more widespread, more and more computational challenges with extracting, processing, and interpreting large graph data are being exposed. It is therefore natural to search for ways to summarize these expansive graphs while preserving their key characteristics. In the past, most graph summarization techniques sought to capture the most important part of a graph statistically. However, today, the high dimensionality and complexity of modern graph data are making deep learning techniques more popular. Hence, this paper presents a comprehensive survey of progress in deep learning summarization techniques that rely on graph neural networks (GNNs). Our investigation includes a review of the current state-of-the-art approaches, including recurrent GNNs, convolutional GNNs, graph autoencoders, and graph attention networks. A new burgeoning line of research is also discussed where graph reinforcement learning is being used to evaluate and improve the quality of graph summaries. Additionally, the survey provides details of benchmark datasets, evaluation metrics, and open-source tools that are often employed in experimentation settings, along with a discussion on the practical uses of graph summarization in different fields. Finally, the survey concludes with a number of open research challenges to motivate further study in this area.Comment: 20 pages, 4 figures, 3 tables, Journal of IEEE Transactions on Artificial Intelligenc

    Probabilistic Inductive Querying Using ProbLog

    Get PDF
    We study how probabilistic reasoning and inductive querying can be combined within ProbLog, a recent probabilistic extension of Prolog. ProbLog can be regarded as a database system that supports both probabilistic and inductive reasoning through a variety of querying mechanisms. After a short introduction to ProbLog, we provide a survey of the different types of inductive queries that ProbLog supports, and show how it can be applied to the mining of large biological networks.Peer reviewe
    corecore