68 research outputs found

    Robust subgroup discovery

    Get PDF
    We introduce the problem of robust subgroup discovery, i.e., finding a set of interpretable descriptions of subsets that 1) stand out with respect to one or more target attributes, 2) are statistically robust, and 3) non-redundant. Many attempts have been made to mine either locally robust subgroups or to tackle the pattern explosion, but we are the first to address both challenges at the same time from a global modelling perspective. First, we formulate the broad model class of subgroup lists, i.e., ordered sets of subgroups, for univariate and multivariate targets that can consist of nominal or numeric variables, and that includes traditional top-1 subgroup discovery in its definition. This novel model class allows us to formalise the problem of optimal robust subgroup discovery using the Minimum Description Length (MDL) principle, where we resort to optimal Normalised Maximum Likelihood and Bayesian encodings for nominal and numeric targets, respectively. Second, as finding optimal subgroup lists is NP-hard, we propose SSD++, a greedy heuristic that finds good subgroup lists and guarantees that the most significant subgroup found according to the MDL criterion is added in each iteration, which is shown to be equivalent to a Bayesian one-sample proportions, multinomial, or t-test between the subgroup and dataset marginal target distributions plus a multiple hypothesis testing penalty. We empirically show on 54 datasets that SSD++ outperforms previous subgroup set discovery methods in terms of quality and subgroup list size.Comment: For associated code, see https://github.com/HMProenca/RuleList ; submitted to Data Mining and Knowledge Discovery Journa

    More than the sum of its parts – pattern mining, neural networks, and how they complement each other

    Get PDF
    In this thesis we explore pattern mining and deep learning. Often seen as orthogonal, we show that these fields complement each other and propose to combine them to gain from each other’s strengths. We, first, show how to efficiently discover succinct and non-redundant sets of patterns that provide insight into data beyond conjunctive statements. We leverage the interpretability of such patterns to unveil how and which information flows through neural networks, as well as what characterizes their decisions. Conversely, we show how to combine continuous optimization with pattern discovery, proposing a neural network that directly encodes discrete patterns, which allows us to apply pattern mining at a scale orders of magnitude larger than previously possible. Large neural networks are, however, exceedingly expensive to train for which ‘lottery tickets’ – small, well-trainable sub-networks in randomly initialized neural networks – offer a remedy. We identify theoretical limitations of strong tickets and overcome them by equipping these tickets with the property of universal approximation. To analyze whether limitations in ticket sparsity are algorithmic or fundamental, we propose a framework to plant and hide lottery tickets. With novel ticket benchmarks we then conclude that the limitation is likely algorithmic, encouraging further developments for which our framework offers means to measure progress.In dieser Arbeit befassen wir uns mit Mustersuche und Deep Learning. Oft als gegensĂ€tzlich betrachtet, verbinden wir diese Felder, um von den StĂ€rken beider zu profitieren. Wir zeigen erst, wie man effizient prĂ€gnante Mengen von Mustern entdeckt, die Einsichten ĂŒber konjunktive Aussagen hinaus geben. Wir nutzen dann die Interpretierbarkeit solcher Muster, um zu verstehen wie und welche Information durch neuronale Netze fließen und was ihre Entscheidungen charakterisiert. Umgekehrt verbinden wir kontinuierliche Optimierung mit Mustererkennung durch ein neuronales Netz welches diskrete Muster direkt abbildet, was Mustersuche in einigen GrĂ¶ĂŸenordnungen höher erlaubt als bisher möglich. Das Training großer neuronaler Netze ist jedoch extrem teuer, fĂŒr das ’Lotterietickets’ – kleine, gut trainierbare Subnetzwerke in zufĂ€llig initialisierten neuronalen Netzen – eine Lösung bieten. Wir zeigen theoretische EinschrĂ€nkungen von starken Tickets und wie man diese ĂŒberwindet, indem man die Tickets mit der Eigenschaft der universalen Approximierung ausstattet. Um zu beantworten, ob EinschrĂ€nkungen in TicketgrĂ¶ĂŸe algorithmischer oder fundamentaler Natur sind, entwickeln wir ein Rahmenwerk zum Einbetten und Verstecken von Tickets, die als ModellfĂ€lle dienen. Basierend auf unseren Ergebnissen schließen wir, dass die EinschrĂ€nkungen algorithmische Ursachen haben, was weitere Entwicklungen begĂŒnstigt, fĂŒr die unser Rahmenwerk Fortschrittsevaluierungen ermöglicht

    Automated identification of patient subgroups: A case-study on mortality of COVID-19 patients admitted to the ICU

    Get PDF
    BACKGROUND: - Subgroup discovery (SGD) is the automated splitting of the data into complex subgroups. Various SGD methods have been applied to the medical domain, but none have been extensively evaluated. We assess the numerical and clinical quality of SGD methods. METHOD: - We applied the improved Subgroup Set Discovery (SSD++), Patient Rule Induction Method (PRIM) and APRIORI - Subgroup Discovery (APRIORI-SD) algorithms to obtain patient subgroups on observational data of 14,548 COVID-19 patients admitted to 73 Dutch intensive care units. Hospital mortality was the clinical outcome. Numerical significance of the subgroups was assessed with information-theoretic measures. Clinical significance of the subgroups was assessed by comparing variable importance on population and subgroup levels and by expert evaluation. RESULTS: - The tested algorithms varied widely in the total number of discovered subgroups (5-62), the number of selected variables, and the predictive value of the subgroups. Qualitative assessment showed that the found subgroups make clinical sense. SSD++ found most subgroups (n = 62), which added predictive value and generally showed high potential for clinical use. APRIORI-SD and PRIM found fewer subgroups (n = 5 and 6), which did not add predictive value and were clinically less relevant. CONCLUSION: - Automated SGD methods find clinical subgroups that are relevant when assessed quantitatively (yield added predictive value) and qualitatively (intensivists consider the subgroups significant). Different methods yield different subgroups with varying degrees of predictive performance and clinical quality. External validation is needed to generalize the results to other populations and future research should explore which algorithm performs best in other settings

    Subgroup discovery for structured target concepts

    Get PDF
    The main object of study in this thesis is subgroup discovery, a theoretical framework for finding subgroups in data—i.e., named sub-populations— whose behaviour with respect to a specified target concept is exceptional when compared to the rest of the dataset. This is a powerful tool that conveys crucial information to a human audience, but despite past advances has been limited to simple target concepts. In this work we propose algorithms that bring this framework to novel application domains. We introduce the concept of representative subgroups, which we use not only to ensure the fairness of a sub-population with regard to a sensitive trait, such as race or gender, but also to go beyond known trends in the data. For entities with additional relational information that can be encoded as a graph, we introduce a novel measure of robust connectedness which improves on established alternative measures of density; we then provide a method that uses this measure to discover which named sub-populations are more well-connected. Our contributions within subgroup discovery crescent with the introduction of kernelised subgroup discovery: a novel framework that enables the discovery of subgroups on i.i.d. target concepts with virtually any kind of structure. Importantly, our framework additionally provides a concrete and efficient tool that works out-of-the-box without any modification, apart from specifying the Gramian of a positive definite kernel. To use within kernelised subgroup discovery, but also on any other kind of kernel method, we additionally introduce a novel random walk graph kernel. Our kernel allows the fine tuning of the alignment between the vertices of the two compared graphs, during the count of the random walks, while we also propose meaningful structure-aware vertex labels to utilise this new capability. With these contributions we thoroughly extend the applicability of subgroup discovery and ultimately re-define it as a kernel method.Der Hauptgegenstand dieser Arbeit ist die Subgruppenentdeckung (Subgroup Discovery), ein theoretischer Rahmen fĂŒr das Auffinden von Subgruppen in Daten—d. h. benannte Teilpopulationen—deren Verhalten in Bezug auf ein bestimmtes Targetkonzept im Vergleich zum Rest des Datensatzes außergewöhnlich ist. Es handelt sich hierbei um ein leistungsfĂ€higes Instrument, das einem menschlichen Publikum wichtige Informationen vermittelt. Allerdings ist es trotz bisherigen Fortschritte auf einfache Targetkonzepte beschrĂ€nkt. In dieser Arbeit schlagen wir Algorithmen vor, die diesen Rahmen auf neuartige Anwendungsbereiche ĂŒbertragen. Wir fĂŒhren das Konzept der reprĂ€sentativen Untergruppen ein, mit dem wir nicht nur die Fairness einer Teilpopulation in Bezug auf ein sensibles Merkmal wie Rasse oder Geschlecht sicherstellen, sondern auch ĂŒber bekannte Trends in den Daten hinausgehen können. FĂŒr EntitĂ€ten mit zusĂ€tzlicher relationalen Information, die als Graph kodiert werden kann, fĂŒhren wir ein neuartiges Maß fĂŒr robuste Verbundenheit ein, das die etablierten alternativen Dichtemaße verbessert; anschließend stellen wir eine Methode bereit, die dieses Maß verwendet, um herauszufinden, welche benannte Teilpopulationen besser verbunden sind. Unsere BeitrĂ€ge in diesem Rahmen gipfeln in der EinfĂŒhrung der kernelisierten Subgruppenentdeckung: ein neuartiger Rahmen, der die Entdeckung von Subgruppen fĂŒr u.i.v. Targetkonzepten mit praktisch jeder Art von Struktur ermöglicht. Wichtigerweise, unser Rahmen bereitstellt zusĂ€tzlich ein konkretes und effizientes Werkzeug, das ohne jegliche Modifikation funktioniert, abgesehen von der Angabe des Gramian eines positiv definitiven Kernels. FĂŒr den Einsatz innerhalb der kernelisierten Subgruppentdeckung, aber auch fĂŒr jede andere Art von Kernel-Methode, fĂŒhren wir zusĂ€tzlich einen neuartigen Random-Walk-Graph-Kernel ein. Unser Kernel ermöglicht die Feinabstimmung der Ausrichtung zwischen den Eckpunkten der beiden unter-Vergleich-gestelltenen Graphen wĂ€hrend der ZĂ€hlung der Random Walks, wĂ€hrend wir auch sinnvolle strukturbewusste Vertex-Labels vorschlagen, um diese neue FĂ€higkeit zu nutzen. Mit diesen BeitrĂ€gen erweitern wir die Anwendbarkeit der Subgruppentdeckung grĂŒndlich und definieren wir sie im Endeffekt als Kernel-Methode neu

    AVATAR - Machine Learning Pipeline Evaluation Using Surrogate Model

    Get PDF
    © 2020, The Author(s). The evaluation of machine learning (ML) pipelines is essential during automatic ML pipeline composition and optimisation. The previous methods such as Bayesian-based and genetic-based optimisation, which are implemented in Auto-Weka, Auto-sklearn and TPOT, evaluate pipelines by executing them. Therefore, the pipeline composition and optimisation of these methods requires a tremendous amount of time that prevents them from exploring complex pipelines to find better predictive models. To further explore this research challenge, we have conducted experiments showing that many of the generated pipelines are invalid, and it is unnecessary to execute them to find out whether they are good pipelines. To address this issue, we propose a novel method to evaluate the validity of ML pipelines using a surrogate model (AVATAR). The AVATAR enables to accelerate automatic ML pipeline composition and optimisation by quickly ignoring invalid pipelines. Our experiments show that the AVATAR is more efficient in evaluating complex pipelines in comparison with the traditional evaluation approaches requiring their execution

    Data quality assurance for strategic decision making in Abu Dhabi's public organisations

    Get PDF
    “A thesis submitted to the University of Bedfordshire, in partial fulfilment of the requirements for the degree of Master of Philosophy”.Data quality is an important aspect of an organisation’s strategies for supporting decision makers in reaching the best decisions possible and consequently attaining the organisation’s objectives. In the case of public organisations, decisions ultimately concern the public and hence further diligence is required to make sure that these decisions do, for instance, preserve economic resources, maintain public health, and provide national security. The decision making process requires a wealth of information in order to achieve efficient results. Public organisations typically acquire great amounts of data generated by public services. However, the vast amount of data stored in public organisations’ databases may be one of the main reasons for inefficient decisions made by public organisations. Processing vast amounts of data and extracting accurate information are not easy tasks. Although technology helps in this respect, for example, the use of decision support systems, it is not sufficient for improving decisions to a significant level of assurance. The research proposed using data mining to improve results obtained by decision support systems. However, more considerations are needed than the mere technological aspects. The research argues that a complete data quality framework is needed in order to improve data quality and consequently the decision making process in public organisations. A series of surveys conducted in seven public organisations in Abu Dhabi Emirate of the United Arab Emirates contributed to the design of a data quality framework. The framework comprises elements found necessary to attain the quality of data reaching decision makers. The framework comprises seven elements ranging from technical to human-based found important to attain data quality in public organisations taking Abu Dhabi public organisations as the case. The interaction and integration of these elements contributes to the quality of data reaching decision makers and hence to the efficiency of decisions made by public organisations. The framework suggests that public organisations may need to adopt a methodological basis to support the decision making process. This includes more training courses and supportive bodies of the organisational units, such as decision support centres, information security and strategic management. The framework also underscores the importance of acknowledging human and cultural factors involved in the decision making process. Such factors have implications for how training and raising awareness are implemented to lead to effective methods of system development

    Network-driven strategies to integrate and exploit biomedical data

    Get PDF
    [eng] In the quest for understanding complex biological systems, the scientific community has been delving into protein, chemical and disease biology, populating biomedical databases with a wealth of data and knowledge. Currently, the field of biomedicine has entered a Big Data era, in which computational-driven research can largely benefit from existing knowledge to better understand and characterize biological and chemical entities. And yet, the heterogeneity and complexity of biomedical data trigger the need for a proper integration and representation of this knowledge, so that it can be effectively and efficiently exploited. In this thesis, we aim at developing new strategies to leverage the current biomedical knowledge, so that meaningful information can be extracted and fused into downstream applications. To this goal, we have capitalized on network analysis algorithms to integrate and exploit biomedical data in a wide variety of scenarios, providing a better understanding of pharmacoomics experiments while helping accelerate the drug discovery process. More specifically, we have (i) devised an approach to identify functional gene sets associated with drug response mechanisms of action, (ii) created a resource of biomedical descriptors able to anticipate cellular drug response and identify new drug repurposing opportunities, (iii) designed a tool to annotate biomedical support for a given set of experimental observations, and (iv) reviewed different chemical and biological descriptors relevant for drug discovery, illustrating how they can be used to provide solutions to current challenges in biomedicine.[cat] En la cerca d’una millor comprensiĂł dels sistemes biolĂČgics complexos, la comunitat cientĂ­fica ha estat aprofundint en la biologia de les proteĂŻnes, fĂ rmacs i malalties, poblant les bases de dades biomĂšdiques amb un gran volum de dades i coneixement. En l’actualitat, el camp de la biomedicina es troba en una era de “dades massives” (Big Data), on la investigaciĂł duta a terme per ordinadors se’n pot beneficiar per entendre i caracteritzar millor les entitats quĂ­miques i biolĂČgiques. No obstant, la heterogeneĂŻtat i complexitat de les dades biomĂšdiques requereix que aquestes s’integrin i es representin d’una manera idĂČnia, permetent aixĂ­ explotar aquesta informaciĂł d’una manera efectiva i eficient. L’objectiu d’aquesta tesis doctoral Ă©s desenvolupar noves estratĂšgies que permetin explotar el coneixement biomĂšdic actual i aixĂ­ extreure informaciĂł rellevant per aplicacions biomĂšdiques futures. Per aquesta finalitat, em fet servir algoritmes de xarxes per tal d’integrar i explotar el coneixement biomĂšdic en diferents tasques, proporcionant un millor enteniment dels experiments farmacoĂČmics per tal d’ajudar accelerar el procĂ©s de descobriment de nous fĂ rmacs. Com a resultat, en aquesta tesi hem (i) dissenyat una estratĂšgia per identificar grups funcionals de gens associats a la resposta de lĂ­nies cel·lulars als fĂ rmacs, (ii) creat una col·lecciĂł de descriptors biomĂšdics capaços, entre altres coses, d’anticipar com les cĂšl·lules responen als fĂ rmacs o trobar nous usos per fĂ rmacs existents, (iii) desenvolupat una eina per descobrir quins contextos biolĂČgics corresponen a una associaciĂł biolĂČgica observada experimentalment i, finalment, (iv) hem explorat diferents descriptors quĂ­mics i biolĂČgics rellevants pel procĂ©s de descobriment de nous fĂ rmacs, mostrant com aquests poden ser utilitzats per trobar solucions a reptes actuals dins el camp de la biomedicina

    Prediction of Airport Arrival Rates Using Data Mining Methods

    Get PDF
    This research sought to establish and utilize relationships between environmental variable inputs and airport efficiency estimates by data mining archived weather and airport performance data at ten geographically and climatologically different airports. Several meaningful relationships were discovered using various statistical modeling methods within an overarching data mining protocol and the developed models were tested using historical data. Additionally, a selected model was deployed using real-time predictive weather information to estimate airport efficiency as a demonstration of potential operational usefulness. This work employed SAS¼ Enterprise Miner TM data mining and modeling software to train and validate decision tree, neural network, and linear regression models to estimate the importance of weather input variables in predicting Airport Arrival Rates (AAR) using the FAA’s Aviation System Performance Metric (ASPM) database. The ASPM database contains airport performance statistics and limited weather variables archived at 15-minute and hourly intervals, and these data formed the foundation of this study. In order to add more weather parameters into the data mining environment, National Oceanic and Atmospheric Administration (NOAA) National Centers for Environmental Information (NCEI) meteorological hourly station data were merged with the ASPM data to increase the number of environmental variables (e.g., precipitation type and amount) into the analyses. Using the SAS¼ Enterprise Miner TM, three different types of models were created, compared, and scored at the following ten airports: a) Hartsfield-Jackson Atlanta International Airport (ATL), b) Los Angeles International Airport (LAX), c) O’Hare International Airport (ORD), d) Dallas/Fort Worth International Airport (DFW), e) John F. Kennedy International Airport (JFK), f) Denver International Airport (DEN), g) San Francisco International Airport (SFO), h) Charlotte-Douglas International Airport (CLT), i) LaGuardia Airport (LGA), and j) Newark Liberty International Airport (EWR). At each location, weather inputs were used to estimate AARs as a metric of efficiency easily interpreted by FAA airspace managers. To estimate Airport Arrival Rates, three data sets were used: a) 15-minute and b) hourly ASPM data, along with c) a merged ASPM and meteorological hourly station data set. For all three data sets, the models were trained and validated using data from 2014 and 2015, and then tested using 2016 data. Additionally, a selected airport model was deployed using National Weather Service (NWS) Localized Aviation MOS (Model Output Statistics) Program (LAMP) weather guidance as the input variables over a 24-hour period as a test. The resulting AAR output predictions were then compared with the real-world AARs observed. Based on model scoring using 2016 data, LAX, ATL, and EWR demonstrated useful predictive performance that potentially could be applied to estimate real-world AARs. Marginal, but perhaps useful AAR prediction might be gleaned operationally at LGA, SFO, and DFW, as the number of successfully scored cases fall loosely within one standard deviation of acceptable model performance arbitrarily set at ten percent of the airport’s maximum AAR. The remaining models studied, DEN, CLT, ORD, and JFK appeared to have little useful operational application based on the 2016 model scoring results

    Front-Line Physicians' Satisfaction with Information Systems in Hospitals

    Get PDF
    Day-to-day operations management in hospital units is difficult due to continuously varying situations, several actors involved and a vast number of information systems in use. The aim of this study was to describe front-line physicians' satisfaction with existing information systems needed to support the day-to-day operations management in hospitals. A cross-sectional survey was used and data chosen with stratified random sampling were collected in nine hospitals. Data were analyzed with descriptive and inferential statistical methods. The response rate was 65 % (n = 111). The physicians reported that information systems support their decision making to some extent, but they do not improve access to information nor are they tailored for physicians. The respondents also reported that they need to use several information systems to support decision making and that they would prefer one information system to access important information. Improved information access would better support physicians' decision making and has the potential to improve the quality of decisions and speed up the decision making process.Peer reviewe
    • 

    corecore