Search CORE

68 research outputs found

Robust subgroup discovery

Author: Bäck Thomas
Grünwald Peter
Proença Hugo Manuel
van Leeuwen Matthijs
Publication venue
Publication date: 28/11/2021
Field of study

We introduce the problem of robust subgroup discovery, i.e., finding a set of interpretable descriptions of subsets that 1) stand out with respect to one or more target attributes, 2) are statistically robust, and 3) non-redundant. Many attempts have been made to mine either locally robust subgroups or to tackle the pattern explosion, but we are the first to address both challenges at the same time from a global modelling perspective. First, we formulate the broad model class of subgroup lists, i.e., ordered sets of subgroups, for univariate and multivariate targets that can consist of nominal or numeric variables, and that includes traditional top-1 subgroup discovery in its definition. This novel model class allows us to formalise the problem of optimal robust subgroup discovery using the Minimum Description Length (MDL) principle, where we resort to optimal Normalised Maximum Likelihood and Bayesian encodings for nominal and numeric targets, respectively. Second, as finding optimal subgroup lists is NP-hard, we propose SSD++, a greedy heuristic that finds good subgroup lists and guarantees that the most significant subgroup found according to the MDL criterion is added in each iteration, which is shown to be equivalent to a Bayesian one-sample proportions, multinomial, or t-test between the subgroup and dataset marginal target distributions plus a multiple hypothesis testing penalty. We empirically show on 54 datasets that SSD++ outperforms previous subgroup set discovery methods in terms of quality and subgroup list size.Comment: For associated code, see https://github.com/HMProenca/RuleList ; submitted to Data Mining and Knowledge Discovery Journa

arXiv.org e-Print Archive

CWI's Institutional Repository

Leiden University Scholary Publications

More than the sum of its parts – pattern mining, neural networks, and how they complement each other

Author: Fischer Jonas
Publication venue: Saarländische Universitäts- und Landesbibliothek
Publication date: 01/01/2022
Field of study

In this thesis we explore pattern mining and deep learning. Often seen as orthogonal, we show that these fields complement each other and propose to combine them to gain from each other’s strengths. We, first, show how to efficiently discover succinct and non-redundant sets of patterns that provide insight into data beyond conjunctive statements. We leverage the interpretability of such patterns to unveil how and which information flows through neural networks, as well as what characterizes their decisions. Conversely, we show how to combine continuous optimization with pattern discovery, proposing a neural network that directly encodes discrete patterns, which allows us to apply pattern mining at a scale orders of magnitude larger than previously possible. Large neural networks are, however, exceedingly expensive to train for which ‘lottery tickets’ – small, well-trainable sub-networks in randomly initialized neural networks – offer a remedy. We identify theoretical limitations of strong tickets and overcome them by equipping these tickets with the property of universal approximation. To analyze whether limitations in ticket sparsity are algorithmic or fundamental, we propose a framework to plant and hide lottery tickets. With novel ticket benchmarks we then conclude that the limitation is likely algorithmic, encouraging further developments for which our framework offers means to measure progress.In dieser Arbeit befassen wir uns mit Mustersuche und Deep Learning. Oft als gegensätzlich betrachtet, verbinden wir diese Felder, um von den Stärken beider zu profitieren. Wir zeigen erst, wie man effizient prägnante Mengen von Mustern entdeckt, die Einsichten über konjunktive Aussagen hinaus geben. Wir nutzen dann die Interpretierbarkeit solcher Muster, um zu verstehen wie und welche Information durch neuronale Netze fließen und was ihre Entscheidungen charakterisiert. Umgekehrt verbinden wir kontinuierliche Optimierung mit Mustererkennung durch ein neuronales Netz welches diskrete Muster direkt abbildet, was Mustersuche in einigen Größenordnungen höher erlaubt als bisher möglich. Das Training großer neuronaler Netze ist jedoch extrem teuer, für das ’Lotterietickets’ – kleine, gut trainierbare Subnetzwerke in zufällig initialisierten neuronalen Netzen – eine Lösung bieten. Wir zeigen theoretische Einschränkungen von starken Tickets und wie man diese überwindet, indem man die Tickets mit der Eigenschaft der universalen Approximierung ausstattet. Um zu beantworten, ob Einschränkungen in Ticketgröße algorithmischer oder fundamentaler Natur sind, entwickeln wir ein Rahmenwerk zum Einbetten und Verstecken von Tickets, die als Modellfälle dienen. Basierend auf unseren Ergebnissen schließen wir, dass die Einschränkungen algorithmische Ursachen haben, was weitere Entwicklungen begünstigt, für die unser Rahmenwerk Fortschrittsevaluierungen ermöglicht

Universaar

Acronym

MPG.PuRe

Automated identification of patient subgroups: A case-study on mortality of COVID-19 patients admitted to the ICU

Author: Dutch COVID-19 ICU Research Consortium
Publication venue
Publication date: 01/09/2023
Field of study

BACKGROUND: - Subgroup discovery (SGD) is the automated splitting of the data into complex subgroups. Various SGD methods have been applied to the medical domain, but none have been extensively evaluated. We assess the numerical and clinical quality of SGD methods. METHOD: - We applied the improved Subgroup Set Discovery (SSD++), Patient Rule Induction Method (PRIM) and APRIORI - Subgroup Discovery (APRIORI-SD) algorithms to obtain patient subgroups on observational data of 14,548 COVID-19 patients admitted to 73 Dutch intensive care units. Hospital mortality was the clinical outcome. Numerical significance of the subgroups was assessed with information-theoretic measures. Clinical significance of the subgroups was assessed by comparing variable importance on population and subgroup levels and by expert evaluation. RESULTS: - The tested algorithms varied widely in the total number of discovered subgroups (5-62), the number of selected variables, and the predictive value of the subgroups. Qualitative assessment showed that the found subgroups make clinical sense. SSD++ found most subgroups (n = 62), which added predictive value and generally showed high potential for clinical use. APRIORI-SD and PRIM found fewer subgroups (n = 5 and 6), which did not add predictive value and were clinically less relevant. CONCLUSION: - Automated SGD methods find clinical subgroups that are relevant when assessed quantitatively (yield added predictive value) and qualitatively (intensivists consider the subgroups significant). Different methods yield different subgroups with varying degrees of predictive performance and clinical quality. External validation is needed to generalize the results to other populations and future research should explore which algorithm performs best in other settings

Utrecht University Repository

Subgroup discovery for structured target concepts

Author: Kalofolias Janis
Publication venue: Saarländische Universitäts- und Landesbibliothek
Publication date: 01/01/2023
Field of study

The main object of study in this thesis is subgroup discovery, a theoretical framework for finding subgroups in data—i.e., named sub-populations— whose behaviour with respect to a specified target concept is exceptional when compared to the rest of the dataset. This is a powerful tool that conveys crucial information to a human audience, but despite past advances has been limited to simple target concepts. In this work we propose algorithms that bring this framework to novel application domains. We introduce the concept of representative subgroups, which we use not only to ensure the fairness of a sub-population with regard to a sensitive trait, such as race or gender, but also to go beyond known trends in the data. For entities with additional relational information that can be encoded as a graph, we introduce a novel measure of robust connectedness which improves on established alternative measures of density; we then provide a method that uses this measure to discover which named sub-populations are more well-connected. Our contributions within subgroup discovery crescent with the introduction of kernelised subgroup discovery: a novel framework that enables the discovery of subgroups on i.i.d. target concepts with virtually any kind of structure. Importantly, our framework additionally provides a concrete and efficient tool that works out-of-the-box without any modification, apart from specifying the Gramian of a positive definite kernel. To use within kernelised subgroup discovery, but also on any other kind of kernel method, we additionally introduce a novel random walk graph kernel. Our kernel allows the fine tuning of the alignment between the vertices of the two compared graphs, during the count of the random walks, while we also propose meaningful structure-aware vertex labels to utilise this new capability. With these contributions we thoroughly extend the applicability of subgroup discovery and ultimately re-define it as a kernel method.Der Hauptgegenstand dieser Arbeit ist die Subgruppenentdeckung (Subgroup Discovery), ein theoretischer Rahmen für das Auffinden von Subgruppen in Daten—d. h. benannte Teilpopulationen—deren Verhalten in Bezug auf ein bestimmtes Targetkonzept im Vergleich zum Rest des Datensatzes außergewöhnlich ist. Es handelt sich hierbei um ein leistungsfähiges Instrument, das einem menschlichen Publikum wichtige Informationen vermittelt. Allerdings ist es trotz bisherigen Fortschritte auf einfache Targetkonzepte beschränkt. In dieser Arbeit schlagen wir Algorithmen vor, die diesen Rahmen auf neuartige Anwendungsbereiche übertragen. Wir führen das Konzept der repräsentativen Untergruppen ein, mit dem wir nicht nur die Fairness einer Teilpopulation in Bezug auf ein sensibles Merkmal wie Rasse oder Geschlecht sicherstellen, sondern auch über bekannte Trends in den Daten hinausgehen können. Für Entitäten mit zusätzlicher relationalen Information, die als Graph kodiert werden kann, führen wir ein neuartiges Maß für robuste Verbundenheit ein, das die etablierten alternativen Dichtemaße verbessert; anschließend stellen wir eine Methode bereit, die dieses Maß verwendet, um herauszufinden, welche benannte Teilpopulationen besser verbunden sind. Unsere Beiträge in diesem Rahmen gipfeln in der Einführung der kernelisierten Subgruppenentdeckung: ein neuartiger Rahmen, der die Entdeckung von Subgruppen für u.i.v. Targetkonzepten mit praktisch jeder Art von Struktur ermöglicht. Wichtigerweise, unser Rahmen bereitstellt zusätzlich ein konkretes und effizientes Werkzeug, das ohne jegliche Modifikation funktioniert, abgesehen von der Angabe des Gramian eines positiv definitiven Kernels. Für den Einsatz innerhalb der kernelisierten Subgruppentdeckung, aber auch für jede andere Art von Kernel-Methode, führen wir zusätzlich einen neuartigen Random-Walk-Graph-Kernel ein. Unser Kernel ermöglicht die Feinabstimmung der Ausrichtung zwischen den Eckpunkten der beiden unter-Vergleich-gestelltenen Graphen während der Zählung der Random Walks, während wir auch sinnvolle strukturbewusste Vertex-Labels vorschlagen, um diese neue Fähigkeit zu nutzen. Mit diesen Beiträgen erweitern wir die Anwendbarkeit der Subgruppentdeckung gründlich und definieren wir sie im Endeffekt als Kernel-Methode neu

Universaar

Acronym

AVATAR - Machine Learning Pipeline Evaluation Using Surrogate Model

Author: A Barker
A Tsakonas
AGC Sá de
F Mohr
M Martin Salvador
MM Salvador
W Tan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 28/01/2020
Field of study

© 2020, The Author(s). The evaluation of machine learning (ML) pipelines is essential during automatic ML pipeline composition and optimisation. The previous methods such as Bayesian-based and genetic-based optimisation, which are implemented in Auto-Weka, Auto-sklearn and TPOT, evaluate pipelines by executing them. Therefore, the pipeline composition and optimisation of these methods requires a tremendous amount of time that prevents them from exploring complex pipelines to find better predictive models. To further explore this research challenge, we have conducted experiments showing that many of the generated pipelines are invalid, and it is unnecessary to execute them to find out whether they are good pipelines. To address this issue, we propose a novel method to evaluate the validity of ML pipelines using a surrogate model (AVATAR). The AVATAR enables to accelerate automatic ML pipeline composition and optimisation by quickly ignoring invalid pipelines. Our experiments show that the AVATAR is more efficient in evaluating complex pipelines in comparison with the traditional evaluation approaches requiring their execution

arXiv.org e-Print Archive

Crossref

OPUS - University of Technology Sydney

Scipedia

Data quality assurance for strategic decision making in Abu Dhabi's public organisations

Author: Alketbi Omar
Publication venue: University of Bedfordshire
Publication date: 01/03/2014
Field of study

“A thesis submitted to the University of Bedfordshire, in partial fulfilment of the requirements for the degree of Master of Philosophy”.Data quality is an important aspect of an organisation’s strategies for supporting decision makers in reaching the best decisions possible and consequently attaining the organisation’s objectives. In the case of public organisations, decisions ultimately concern the public and hence further diligence is required to make sure that these decisions do, for instance, preserve economic resources, maintain public health, and provide national security. The decision making process requires a wealth of information in order to achieve efficient results. Public organisations typically acquire great amounts of data generated by public services. However, the vast amount of data stored in public organisations’ databases may be one of the main reasons for inefficient decisions made by public organisations. Processing vast amounts of data and extracting accurate information are not easy tasks. Although technology helps in this respect, for example, the use of decision support systems, it is not sufficient for improving decisions to a significant level of assurance. The research proposed using data mining to improve results obtained by decision support systems. However, more considerations are needed than the mere technological aspects. The research argues that a complete data quality framework is needed in order to improve data quality and consequently the decision making process in public organisations. A series of surveys conducted in seven public organisations in Abu Dhabi Emirate of the United Arab Emirates contributed to the design of a data quality framework. The framework comprises elements found necessary to attain the quality of data reaching decision makers. The framework comprises seven elements ranging from technical to human-based found important to attain data quality in public organisations taking Abu Dhabi public organisations as the case. The interaction and integration of these elements contributes to the quality of data reaching decision makers and hence to the efficiency of decisions made by public organisations. The framework suggests that public organisations may need to adopt a methodological basis to support the decision making process. This includes more training courses and supportive bodies of the organisational units, such as decision support centres, information security and strategic management. The framework also underscores the importance of acknowledging human and cultural factors involved in the decision making process. Such factors have implications for how training and raising awareness are implemented to lead to effective methods of system development

University of Bedfordshire Repository

Interactive Constrained {B}oolean Matrix Factorization

Author: Miettinen P.
Mukuze N.
Publication venue
Publication date: 01/01/2016
Field of study

MPG.PuRe

Network-driven strategies to integrate and exploit biomedical data

Author: Fernández Torras Adrià
Publication venue: 'Edicions de la Universitat de Barcelona'
Publication date: 28/11/2022
Field of study

[eng] In the quest for understanding complex biological systems, the scientific community has been delving into protein, chemical and disease biology, populating biomedical databases with a wealth of data and knowledge. Currently, the field of biomedicine has entered a Big Data era, in which computational-driven research can largely benefit from existing knowledge to better understand and characterize biological and chemical entities. And yet, the heterogeneity and complexity of biomedical data trigger the need for a proper integration and representation of this knowledge, so that it can be effectively and efficiently exploited. In this thesis, we aim at developing new strategies to leverage the current biomedical knowledge, so that meaningful information can be extracted and fused into downstream applications. To this goal, we have capitalized on network analysis algorithms to integrate and exploit biomedical data in a wide variety of scenarios, providing a better understanding of pharmacoomics experiments while helping accelerate the drug discovery process. More specifically, we have (i) devised an approach to identify functional gene sets associated with drug response mechanisms of action, (ii) created a resource of biomedical descriptors able to anticipate cellular drug response and identify new drug repurposing opportunities, (iii) designed a tool to annotate biomedical support for a given set of experimental observations, and (iv) reviewed different chemical and biological descriptors relevant for drug discovery, illustrating how they can be used to provide solutions to current challenges in biomedicine.[cat] En la cerca d’una millor comprensió dels sistemes biològics complexos, la comunitat científica ha estat aprofundint en la biologia de les proteïnes, fàrmacs i malalties, poblant les bases de dades biomèdiques amb un gran volum de dades i coneixement. En l’actualitat, el camp de la biomedicina es troba en una era de “dades massives” (Big Data), on la investigació duta a terme per ordinadors se’n pot beneficiar per entendre i caracteritzar millor les entitats químiques i biològiques. No obstant, la heterogeneïtat i complexitat de les dades biomèdiques requereix que aquestes s’integrin i es representin d’una manera idònia, permetent així explotar aquesta informació d’una manera efectiva i eficient. L’objectiu d’aquesta tesis doctoral és desenvolupar noves estratègies que permetin explotar el coneixement biomèdic actual i així extreure informació rellevant per aplicacions biomèdiques futures. Per aquesta finalitat, em fet servir algoritmes de xarxes per tal d’integrar i explotar el coneixement biomèdic en diferents tasques, proporcionant un millor enteniment dels experiments farmacoòmics per tal d’ajudar accelerar el procés de descobriment de nous fàrmacs. Com a resultat, en aquesta tesi hem (i) dissenyat una estratègia per identificar grups funcionals de gens associats a la resposta de línies cel·lulars als fàrmacs, (ii) creat una col·lecció de descriptors biomèdics capaços, entre altres coses, d’anticipar com les cèl·lules responen als fàrmacs o trobar nous usos per fàrmacs existents, (iii) desenvolupat una eina per descobrir quins contextos biològics corresponen a una associació biològica observada experimentalment i, finalment, (iv) hem explorat diferents descriptors químics i biològics rellevants pel procés de descobriment de nous fàrmacs, mostrant com aquests poden ser utilitzats per trobar solucions a reptes actuals dins el camp de la biomedicina

Tesis Doctorals en Xarxa

Diposit Digital de la Universitat de Barcelona

Prediction of Airport Arrival Rates Using Data Mining Methods

Author: Maxson Robert William
Publication venue: Scholarly Commons
Publication date: 01/08/2018
Field of study

This research sought to establish and utilize relationships between environmental variable inputs and airport efficiency estimates by data mining archived weather and airport performance data at ten geographically and climatologically different airports. Several meaningful relationships were discovered using various statistical modeling methods within an overarching data mining protocol and the developed models were tested using historical data. Additionally, a selected model was deployed using real-time predictive weather information to estimate airport efficiency as a demonstration of potential operational usefulness. This work employed SAS® Enterprise Miner TM data mining and modeling software to train and validate decision tree, neural network, and linear regression models to estimate the importance of weather input variables in predicting Airport Arrival Rates (AAR) using the FAA’s Aviation System Performance Metric (ASPM) database. The ASPM database contains airport performance statistics and limited weather variables archived at 15-minute and hourly intervals, and these data formed the foundation of this study. In order to add more weather parameters into the data mining environment, National Oceanic and Atmospheric Administration (NOAA) National Centers for Environmental Information (NCEI) meteorological hourly station data were merged with the ASPM data to increase the number of environmental variables (e.g., precipitation type and amount) into the analyses. Using the SAS® Enterprise Miner TM, three different types of models were created, compared, and scored at the following ten airports: a) Hartsfield-Jackson Atlanta International Airport (ATL), b) Los Angeles International Airport (LAX), c) O’Hare International Airport (ORD), d) Dallas/Fort Worth International Airport (DFW), e) John F. Kennedy International Airport (JFK), f) Denver International Airport (DEN), g) San Francisco International Airport (SFO), h) Charlotte-Douglas International Airport (CLT), i) LaGuardia Airport (LGA), and j) Newark Liberty International Airport (EWR). At each location, weather inputs were used to estimate AARs as a metric of efficiency easily interpreted by FAA airspace managers. To estimate Airport Arrival Rates, three data sets were used: a) 15-minute and b) hourly ASPM data, along with c) a merged ASPM and meteorological hourly station data set. For all three data sets, the models were trained and validated using data from 2014 and 2015, and then tested using 2016 data. Additionally, a selected airport model was deployed using National Weather Service (NWS) Localized Aviation MOS (Model Output Statistics) Program (LAMP) weather guidance as the input variables over a 24-hour period as a test. The resulting AAR output predictions were then compared with the real-world AARs observed. Based on model scoring using 2016 data, LAX, ATL, and EWR demonstrated useful predictive performance that potentially could be applied to estimate real-world AARs. Marginal, but perhaps useful AAR prediction might be gleaned operationally at LGA, SFO, and DFW, as the number of successfully scored cases fall loosely within one standard deviation of acceptable model performance arbitrarily set at ten percent of the airport’s maximum AAR. The remaining models studied, DEN, CLT, ORD, and JFK appeared to have little useful operational application based on the 2016 model scoring results

Embry-Riddle Aeronautical University

Front-Line Physicians' Satisfaction with Information Systems in Hospitals

Author: Junttila Kristiina
Peltonen Laura-Maria
Salanterä Sanna
Publication venue: 'IOS Press'
Publication date: 01/01/2018
Field of study

Day-to-day operations management in hospital units is difficult due to continuously varying situations, several actors involved and a vast number of information systems in use. The aim of this study was to describe front-line physicians' satisfaction with existing information systems needed to support the day-to-day operations management in hospitals. A cross-sectional survey was used and data chosen with stratified random sampling were collected in nine hospitals. Data were analyzed with descriptive and inferential statistical methods. The response rate was 65 % (n = 111). The physicians reported that information systems support their decision making to some extent, but they do not improve access to information nor are they tailored for physicians. The respondents also reported that they need to use several information systems to support decision making and that they would prefer one information system to access important information. Improved information access would better support physicians' decision making and has the potential to improve the quality of decisions and speed up the decision making process.Peer reviewe

Helsingin yliopiston digitaalinen arkisto