874 research outputs found

    Subgroup discovery for structured target concepts

    Get PDF
    The main object of study in this thesis is subgroup discovery, a theoretical framework for finding subgroups in data—i.e., named sub-populations— whose behaviour with respect to a specified target concept is exceptional when compared to the rest of the dataset. This is a powerful tool that conveys crucial information to a human audience, but despite past advances has been limited to simple target concepts. In this work we propose algorithms that bring this framework to novel application domains. We introduce the concept of representative subgroups, which we use not only to ensure the fairness of a sub-population with regard to a sensitive trait, such as race or gender, but also to go beyond known trends in the data. For entities with additional relational information that can be encoded as a graph, we introduce a novel measure of robust connectedness which improves on established alternative measures of density; we then provide a method that uses this measure to discover which named sub-populations are more well-connected. Our contributions within subgroup discovery crescent with the introduction of kernelised subgroup discovery: a novel framework that enables the discovery of subgroups on i.i.d. target concepts with virtually any kind of structure. Importantly, our framework additionally provides a concrete and efficient tool that works out-of-the-box without any modification, apart from specifying the Gramian of a positive definite kernel. To use within kernelised subgroup discovery, but also on any other kind of kernel method, we additionally introduce a novel random walk graph kernel. Our kernel allows the fine tuning of the alignment between the vertices of the two compared graphs, during the count of the random walks, while we also propose meaningful structure-aware vertex labels to utilise this new capability. With these contributions we thoroughly extend the applicability of subgroup discovery and ultimately re-define it as a kernel method.Der Hauptgegenstand dieser Arbeit ist die Subgruppenentdeckung (Subgroup Discovery), ein theoretischer Rahmen für das Auffinden von Subgruppen in Daten—d. h. benannte Teilpopulationen—deren Verhalten in Bezug auf ein bestimmtes Targetkonzept im Vergleich zum Rest des Datensatzes außergewöhnlich ist. Es handelt sich hierbei um ein leistungsfähiges Instrument, das einem menschlichen Publikum wichtige Informationen vermittelt. Allerdings ist es trotz bisherigen Fortschritte auf einfache Targetkonzepte beschränkt. In dieser Arbeit schlagen wir Algorithmen vor, die diesen Rahmen auf neuartige Anwendungsbereiche übertragen. Wir führen das Konzept der repräsentativen Untergruppen ein, mit dem wir nicht nur die Fairness einer Teilpopulation in Bezug auf ein sensibles Merkmal wie Rasse oder Geschlecht sicherstellen, sondern auch über bekannte Trends in den Daten hinausgehen können. Für Entitäten mit zusätzlicher relationalen Information, die als Graph kodiert werden kann, führen wir ein neuartiges Maß für robuste Verbundenheit ein, das die etablierten alternativen Dichtemaße verbessert; anschließend stellen wir eine Methode bereit, die dieses Maß verwendet, um herauszufinden, welche benannte Teilpopulationen besser verbunden sind. Unsere Beiträge in diesem Rahmen gipfeln in der Einführung der kernelisierten Subgruppenentdeckung: ein neuartiger Rahmen, der die Entdeckung von Subgruppen für u.i.v. Targetkonzepten mit praktisch jeder Art von Struktur ermöglicht. Wichtigerweise, unser Rahmen bereitstellt zusätzlich ein konkretes und effizientes Werkzeug, das ohne jegliche Modifikation funktioniert, abgesehen von der Angabe des Gramian eines positiv definitiven Kernels. Für den Einsatz innerhalb der kernelisierten Subgruppentdeckung, aber auch für jede andere Art von Kernel-Methode, führen wir zusätzlich einen neuartigen Random-Walk-Graph-Kernel ein. Unser Kernel ermöglicht die Feinabstimmung der Ausrichtung zwischen den Eckpunkten der beiden unter-Vergleich-gestelltenen Graphen während der Zählung der Random Walks, während wir auch sinnvolle strukturbewusste Vertex-Labels vorschlagen, um diese neue Fähigkeit zu nutzen. Mit diesen Beiträgen erweitern wir die Anwendbarkeit der Subgruppentdeckung gründlich und definieren wir sie im Endeffekt als Kernel-Methode neu

    Fraction-score: a generalized support measure for weighted and maximal co-location pattern mining

    Get PDF
    Co-location patterns, which capture the phenomenon that objects with certain labels are often located in close geographic proximity, are defined based on a support measure which quantifies the prevalence of a pattern candidate in the form of a label set. Existing support measures share the idea of counting the number of instances of a given label set C as its support, where an instance of C is an object set whose objects collectively carry all labels in C and are located close to one another. However, they suffer from various weaknesses, e.g., fail to capture all possible instances, or overlook the cases when multiple instances overlap. In this paper, we propose a new measure called Fraction-Score which counts instances fractionally if they overlap. Fraction-Score captures all possible instances, and handles the cases where instances overlap appropriately (so that the supports defined are more meaningful and anti-monotonic). We develop efficient algorithms to solve the co-location pattern mining problem defined with Fraction-Score. Furthermore, to obtain representative patterns, we develop an efficient algorithm for mining the maximal co-location patterns, which are those patterns without proper superset patterns. We conduct extensive experiments using real and synthetic datasets, which verified the superiority of our proposals

    Visual Analytics of Co-Occurrences to Discover Subspaces in Structured Data

    Get PDF
    We present an approach that shows all relevant subspaces of categorical data condensed in a single picture. We model the categorical values of the attributes as co-occurrences with data partitions generated from structured data using pattern mining. We show that these co-occurrences are a-priori, allowing us to greatly reduce the search space, effectively generating the condensed picture where conventional approaches filter out several subspaces as these are deemed insignificant. The task of identifying interesting subspaces is common but difficult due to exponential search spaces and the curse of dimensionality. One application of such a task might be identifying a cohort of patients defined by attributes such as gender, age, and diabetes type that share a common patient history, which is modeled as event sequences. Filtering the data by these attributes is common but cumbersome and often does not allow a comparison of subspaces. We contribute a powerful multi-dimensional pattern exploration approach (MDPE-approach) agnostic to the structured data type that models multiple attributes and their characteristics as co-occurrences, allowing the user to identify and compare thousands of subspaces of interest in a single picture. In our MDPE-approach, we introduce two methods to dramatically reduce the search space, outputting only the boundaries of the search space in the form of two tables. We implement the MDPE-approach in an interactive visual interface (MDPE-vis) that provides a scalable, pixel-based visualization design allowing the identification, comparison, and sense-making of subspaces in structured data. Our case studies using a gold-standard dataset and external domain experts confirm our approach’s and implementation’s applicability. A third use case sheds light on the scalability of our approach and a user study with 15 participants underlines its usefulness and power

    Measuring the impact of COVID-19 on hospital care pathways

    Get PDF
    Care pathways in hospitals around the world reported significant disruption during the recent COVID-19 pandemic but measuring the actual impact is more problematic. Process mining can be useful for hospital management to measure the conformance of real-life care to what might be considered normal operations. In this study, we aim to demonstrate that process mining can be used to investigate process changes associated with complex disruptive events. We studied perturbations to accident and emergency (A &E) and maternity pathways in a UK public hospital during the COVID-19 pandemic. Co-incidentally the hospital had implemented a Command Centre approach for patient-flow management affording an opportunity to study both the planned improvement and the disruption due to the pandemic. Our study proposes and demonstrates a method for measuring and investigating the impact of such planned and unplanned disruptions affecting hospital care pathways. We found that during the pandemic, both A &E and maternity pathways had measurable reductions in the mean length of stay and a measurable drop in the percentage of pathways conforming to normative models. There were no distinctive patterns of monthly mean values of length of stay nor conformance throughout the phases of the installation of the hospital’s new Command Centre approach. Due to a deficit in the available A &E data, the findings for A &E pathways could not be interpreted

    Mitigating the Cold-Start Problem by Leveraging Category Level Associations

    Get PDF
    Recommender systems model user preferences by exploiting their profiles, historical transactions, and ratings of the items. The quality of the recommendations heavily relies on the availability of the data. While typical recommendation methods such as collaborative and content-based filtering can be effective in a wide range of online shopping and e-commerce applications, they suffer from the cold-start problem in settings where new users enter the system and ratings are sparse for new or low-volume items. To this end, we present a pairwise association rule-based recommendation algorithm that builds a model of collective user preferences by utilizing mined associations at both the item and the category levels. In the meantime, the model allows an individual user’s in-session activities to be integrated at the category level to further improve the recommendation quality. Experimental results show that the proposed method improves recommendation performance, as compared to similar approaches

    FootApp: An AI-powered system for football match annotation

    Get PDF
    In the last years, scientific and industrial research has experienced a growing interest in acquiring large annotated data sets to train artificial intelligence algorithms for tackling problems in different domains. In this context, we have observed that even the market for football data has substantially grown. The analysis of football matches relies on the annotation of both individual players’ and team actions, as well as the athletic performance of players. Consequently, annotating football events at a fine-grained level is a very expensive and error-prone task. Most existing semi-automatic tools for football match annotation rely on cameras and computer vision. However, those tools fall short in capturing team dynamics and in extracting data of players who are not visible in the camera frame. To address these issues, in this manuscript we present FootApp, an AI-based system for football match annotation. First, our system relies on an advanced and mixed user interface that exploits both vocal and touch interaction. Second, the motor performance of players is captured and processed by applying machine learning algorithms to data collected from inertial sensors worn by players. Artificial intelligence techniques are then used to check the consistency of generated labels, including those regarding the physical activity of players, to automatically recognize annotation errors. Notably, we implemented a full prototype of the proposed system, performing experiments to show its effectiveness in a real-world adoption scenario

    Abductive Design of BDI Agent-based Digital Twins of Organizations

    Get PDF
    For a Digital Twin - a precise, virtual representation of a physical counterpart - of a human-like system to be faithful and complete, it must appeal to a notion of anthropomorphism (i.e., attributing human behaviour to non-human entities) to imitate (1) the externally visible behaviour and (2) the internal workings of that system. Although the Belief-Desire-Intention (BDI) paradigm was not developed for this purpose, it has been used successfully in human modeling applications. In this sense, we introduce in this thesis the notion of abductive design of BDI agent-based Digital Twins of organizations, which builds on two powerful reasoning disciplines: reverse engineering (to recreate the visible behaviour of the target system) and goal-driven eXplainable Artificial Intelligence (XAI) (for viewing the behaviour of the target system through the lens of BDI agents). Precisely speaking, the overall problem we are trying to address in this thesis is to “Find a BDI agent program that best explains (in the sense of formal abduction) the behaviour of a target system based on its past experiences . To do so, we propose three goal-driven XAI techniques: (1) abductive design of BDI agents, (2) leveraging imperfect explanations and (3) mining belief-based explanations. The resulting approach suggests that using goal-driven XAI to generate Digital Twins of organizations in the form of BDI agents can be effective, even in a setting with limited information about the target system’s behaviour

    Diversification and fairness in top-k ranking algorithms

    Get PDF
    Given a user query, the typical user interfaces, such as search engines and recommender systems, only allow a small number of results to be returned to the user. Hence, figuring out what would be the top-k results is an important task in information retrieval, as it helps to ensure that the most relevant results are presented to the user. There exists an extensive body of research that studies how to score the records and return top-k to the user. Moreover, there exists an extensive set of criteria that researchers identify to present the user with top-k results, and result diversification is one of them. Diversifying the top-k result ensures that the returned result set is relevant as well as representative of the entire set of answers to the user query, and it is highly relevant in the context of search, recommendation, and data exploration. The goal of this dissertation is two-fold: the first goal is to focus on adapting existing popular diversification algorithms and studying how to expedite them without losing the accuracy of the answers. This work studies the scalability challenges of expediting the running time of existing diversification algorithms by designing a generic framework that produces the same results as the original algorithms, yet it is significantly faster in running time. This proposed approach handles scenarios where data change over a period of time and studies how to adapt the framework to accommodate data changes. The second aspect of the work studies how the existing top-k algorithms could lead to inequitable exposure of records that are equivalent qualitatively. This scenario is highly important for long-tail data where there exists a long tail of records that have similar utility, but the existing top-k algorithm only shows one of the top-ks, and the rest are never returned to the user. Both of these problems are studied analytically, and their hardness is studied. The contributions of this dissertation lie in (a) formalizing principal problems and studying them analytically. (b) designing scalable algorithms with theoretical guarantees, and (c) evaluating the efficacy and scalability of the designed solutions by comparing them with the state-of-the-art solutions over large-scale datasets

    Information-Theoretic Model Diagnostics (InfoMoD)

    Full text link
    Model validation is a critical step in the development, deployment, and governance of machine learning models. During the validation process, the predictive power of a model is measured on unseen datasets with a variety of metrics such as Accuracy and F1-Scores for classification tasks. Although the most used metrics are easy to implement and understand, they are aggregate measures over all the segments of heterogeneous datasets, and therefore, they do not identify the performance variation of a model among different data segments. The lack of insight into how the model performs over segments of unseen datasets has raised significant challenges in deploying machine learning models into production environments. The unstable performance is especially concerning for critical applications such as credit risk models, cancer detection, and self-driving cars, which have significant impacts on users.In this dissertation, we leverage the notion of information-theoretic explanations to measure the performance of binary classifiers over various segments of data. We provide the following contributions: 1) A distributed implementation of the explanation framework that outperforms a single-node baseline. 2) An application of the framework to summarize the model’s performance over various segments of training and testing data in terms of overall accuracy as well as providing false-positive and false-negative patterns. 3) A further application of the framework to annotate test instances with expected performance indicators at the inference time. In addition to assisting machine learning engineers with model tuning and data augmentation decisions, the proposed tools can also identify potential model bias and unfairness with respect to protected attributes and data segments
    • …
    corecore