258 research outputs found

    Data- og ekspertdreven variabelseleksjon for prediktive modeller i helsevesenet : mot økt tolkbarhet i underbestemte maskinlæringsproblemer

    Get PDF
    Modern data acquisition techniques in healthcare generate large collections of data from multiple sources, such as novel diagnosis and treatment methodologies. Some concrete examples are electronic healthcare record systems, genomics, and medical images. This leads to situations with often unstructured, high-dimensional heterogeneous patient cohort data where classical statistical methods may not be sufficient for optimal utilization of the data and informed decision-making. Instead, investigating such data structures with modern machine learning techniques promises to improve the understanding of patient health issues and may provide a better platform for informed decision-making by clinicians. Key requirements for this purpose include (a) sufficiently accurate predictions and (b) model interpretability. Achieving both aspects in parallel is difficult, particularly for datasets with few patients, which are common in the healthcare domain. In such cases, machine learning models encounter mathematically underdetermined systems and may overfit easily on the training data. An important approach to overcome this issue is feature selection, i.e., determining a subset of informative features from the original set of features with respect to the target variable. While potentially raising the predictive performance, feature selection fosters model interpretability by identifying a low number of relevant model parameters to better understand the underlying biological processes that lead to health issues. Interpretability requires that feature selection is stable, i.e., small changes in the dataset do not lead to changes in the selected feature set. A concept to address instability is ensemble feature selection, i.e. the process of repeating the feature selection multiple times on subsets of samples of the original dataset and aggregating results in a meta-model. This thesis presents two approaches for ensemble feature selection, which are tailored towards high-dimensional data in healthcare: the Repeated Elastic Net Technique for feature selection (RENT) and the User-Guided Bayesian Framework for feature selection (UBayFS). While RENT is purely data-driven and builds upon elastic net regularized models, UBayFS is a general framework for ensembles with the capabilities to include expert knowledge in the feature selection process via prior weights and side constraints. A case study modeling the overall survival of cancer patients compares these novel feature selectors and demonstrates their potential in clinical practice. Beyond the selection of single features, UBayFS also allows for selecting whole feature groups (feature blocks) that were acquired from multiple data sources, as those mentioned above. Importance quantification of such feature blocks plays a key role in tracing information about the target variable back to the acquisition modalities. Such information on feature block importance may lead to positive effects on the use of human, technical, and financial resources if systematically integrated into the planning of patient treatment by excluding the acquisition of non-informative features. Since a generalization of feature importance measures to block importance is not trivial, this thesis also investigates and compares approaches for feature block importance rankings. This thesis demonstrates that high-dimensional datasets from multiple data sources in the medical domain can be successfully tackled by the presented approaches for feature selection. Experimental evaluations demonstrate favorable properties of both predictive performance, stability, as well as interpretability of results, which carries a high potential for better data-driven decision support in clinical practice.Moderne datainnsamlingsteknikker i helsevesenet genererer store datamengder fra flere kilder, som for eksempel nye diagnose- og behandlingsmetoder. Noen konkrete eksempler er elektroniske helsejournalsystemer, genomikk og medisinske bilder. Slike pasientkohortdata er ofte ustrukturerte, høydimensjonale og heterogene og hvor klassiske statistiske metoder ikke er tilstrekkelige for optimal utnyttelse av dataene og god informasjonsbasert beslutningstaking. Derfor kan det være lovende å analysere slike datastrukturer ved bruk av moderne maskinlæringsteknikker for å øke forståelsen av pasientenes helseproblemer og for å gi klinikerne en bedre plattform for informasjonsbasert beslutningstaking. Sentrale krav til dette formålet inkluderer (a) tilstrekkelig nøyaktige prediksjoner og (b) modelltolkbarhet. Å oppnå begge aspektene samtidig er vanskelig, spesielt for datasett med få pasienter, noe som er vanlig for data i helsevesenet. I slike tilfeller må maskinlæringsmodeller håndtere matematisk underbestemte systemer og dette kan lett føre til at modellene overtilpasses treningsdataene. Variabelseleksjon er en viktig tilnærming for å håndtere dette ved å identifisere en undergruppe av informative variabler med hensyn til responsvariablen. Samtidig som variabelseleksjonsmetoder kan lede til økt prediktiv ytelse, fremmes modelltolkbarhet ved å identifisere et lavt antall relevante modellparametere. Dette kan gi bedre forståelse av de underliggende biologiske prosessene som fører til helseproblemer. Tolkbarhet krever at variabelseleksjonen er stabil, dvs. at små endringer i datasettet ikke fører til endringer i hvilke variabler som velges. Et konsept for å adressere ustabilitet er ensemblevariableseleksjon, dvs. prosessen med å gjenta variabelseleksjon flere ganger på en delmengde av prøvene i det originale datasett og aggregere resultater i en metamodell. Denne avhandlingen presenterer to tilnærminger for ensemblevariabelseleksjon, som er skreddersydd for høydimensjonale data i helsevesenet: "Repeated Elastic Net Technique for feature selection" (RENT) og "User-Guided Bayesian Framework for feature selection" (UBayFS). Mens RENT er datadrevet og bygger på elastic net-regulariserte modeller, er UBayFS et generelt rammeverk for ensembler som muliggjør inkludering av ekspertkunnskap i variabelseleksjonsprosessen gjennom forhåndsbestemte vekter og sidebegrensninger. En case-studie som modellerer overlevelsen av kreftpasienter sammenligner disse nye variabelseleksjonsmetodene og demonstrerer deres potensiale i klinisk praksis. Utover valg av enkelte variabler gjør UBayFS det også mulig å velge blokker eller grupper av variabler som representerer de ulike datakildene som ble nevnt over. Kvantifisering av viktigheten av variabelgrupper spiller en nøkkelrolle for forståelsen av hvorvidt datakildene er viktige for responsvariablen. Tilgang til slik informasjon kan føre til at bruken av menneskelige, tekniske og økonomiske ressurser kan forbedres dersom informasjonen integreres systematisk i planleggingen av pasientbehandlingen. Slik kan man redusere innsamling av ikke-informative variabler. Siden generaliseringen av viktighet av variabelgrupper ikke er triviell, undersøkes og sammenlignes også tilnærminger for rangering av viktigheten til disse variabelgruppene. Denne avhandlingen viser at høydimensjonale datasett fra flere datakilder fra det medisinske domenet effektivt kan håndteres ved bruk av variabelseleksjonmetodene som er presentert i avhandlingen. Eksperimentene viser at disse kan ha positiv en effekt på både prediktiv ytelse, stabilitet og tolkbarhet av resultatene. Bruken av disse variabelseleksjonsmetodene bærer et stort potensiale for bedre datadrevet beslutningsstøtte i klinisk praksis

    Efficient algorithms for simulation and analysis of many-body systems

    Get PDF
    This thesis introduces methods to efficiently generate and analyze time series data of many-body systems. While we have a strong focus on biomolecular processes, the presented methods can also be applied more generally. Due to limitations of microscope resolution in both space and time, biomolecular processes are especially hard to observe experimentally. Computer models offer an opportunity to work around these limitations. However, as these models are bound by computational effort, careful selection of the model as well as its efficient implementation play a fundamental role in their successful sampling and/or estimation. Especially for high levels of resolution, computer simulations can produce vast amounts of high-dimensional data and in general it is not straightforward to visualize, let alone to identify the relevant features and processes. To this end, we cover tools for projecting time series data onto important processes, finding over time geometrically stable features in observable space, and identifying governing dynamics. We introduce the novel software library deeptime with two main goals: (1) making methods which were developed in different communities (such as molecular dynamics and fluid dynamics) accessible to a broad user base by implementing them in a general-purpose way, and (2) providing an easy to install, extend, and maintain library by employing a high degree of modularity and introducing as few hard dependencies as possible. We demonstrate and compare the capabilities of the provided methods based on numerical examples. Subsequently, the particle-based reaction-diffusion simulation software package ReaDDy2 is introduced. It can simulate dynamics which are more complicated than what is usually analyzed with the methods available in deeptime. It is a significantly more efficient, feature-rich, flexible, and user-friendly version of its predecessor ReaDDy. As such, it enables---at the simulation model's resolution---the possibility to study larger systems and to cover longer timescales. In particular, ReaDDy2 is capable of modeling complex processes featuring particle crowding, space exclusion, association and dissociation events, dynamic formation and dissolution of particle geometries on a mesoscopic scale. The validity of the ReaDDy2 model is asserted by several numerical studies which are compared to analytically obtained results, simulations from other packages, or literature data. Finally, we present reactive SINDy, a method that can detect reaction networks from concentration curves of chemical species. It extends the SINDy method---contained in deeptime---by introducing coupling terms over a system of ordinary differential equations in an ansatz reaction space. As such, it transforms an ordinary linear regression problem to a linear tensor regression. The method employs a sparsity-promoting regularization which leads to especially simple and interpretable models. We show in biologically motivated example systems that the method is indeed capable of detecting the correct underlying reaction dynamics and that the sparsity regularization plays a key role in pruning otherwise spuriously detected reactions

    Quantum-Inspired Machine Learning: a Survey

    Full text link
    Quantum-inspired Machine Learning (QiML) is a burgeoning field, receiving global attention from researchers for its potential to leverage principles of quantum mechanics within classical computational frameworks. However, current review literature often presents a superficial exploration of QiML, focusing instead on the broader Quantum Machine Learning (QML) field. In response to this gap, this survey provides an integrated and comprehensive examination of QiML, exploring QiML's diverse research domains including tensor network simulations, dequantized algorithms, and others, showcasing recent advancements, practical applications, and illuminating potential future research avenues. Further, a concrete definition of QiML is established by analyzing various prior interpretations of the term and their inherent ambiguities. As QiML continues to evolve, we anticipate a wealth of future developments drawing from quantum mechanics, quantum computing, and classical machine learning, enriching the field further. This survey serves as a guide for researchers and practitioners alike, providing a holistic understanding of QiML's current landscape and future directions.Comment: 56 pages, 13 figures, 8 table

    Patient Dropout Prediction in Virtual Health: A Multimodal Dynamic Knowledge Graph and Text Mining Approach

    Full text link
    Virtual health has been acclaimed as a transformative force in healthcare delivery. Yet, its dropout issue is critical that leads to poor health outcomes, increased health, societal, and economic costs. Timely prediction of patient dropout enables stakeholders to take proactive steps to address patients' concerns, potentially improving retention rates. In virtual health, the information asymmetries inherent in its delivery format, between different stakeholders, and across different healthcare delivery systems hinder the performance of existing predictive methods. To resolve those information asymmetries, we propose a Multimodal Dynamic Knowledge-driven Dropout Prediction (MDKDP) framework that learns implicit and explicit knowledge from doctor-patient dialogues and the dynamic and complex networks of various stakeholders in both online and offline healthcare delivery systems. We evaluate MDKDP by partnering with one of the largest virtual health platforms in China. MDKDP improves the F1-score by 3.26 percentage points relative to the best benchmark. Comprehensive robustness analyses show that integrating stakeholder attributes, knowledge dynamics, and compact bilinear pooling significantly improves the performance. Our work provides significant implications for healthcare IT by revealing the value of mining relations and knowledge across different service modalities. Practically, MDKDP offers a novel design artifact for virtual health platforms in patient dropout management

    Joint learning from multiple information sources for biological problems

    Get PDF
    Thanks to technological advancements, more and more biological data havebeen generated in recent years. Data availability offers unprecedented opportunities to look at the same problem from multiple aspects. It also unveils a more global view of the problem that takes into account the intricated inter-play between the involved molecules/entities. Nevertheless, biological datasets are biased, limited in quantity, and contain many false-positive samples. Such challenges often drastically downgrade the performance of a predictive model on unseen data and, thus, limit its applicability in real biological studies. Human learning is a multi-stage process in which we usually start with simple things. Through the accumulated knowledge over time, our cognition ability extends to more complex concepts. Children learn to speak simple words before being able to formulate sentences. Similarly, being able to speak correct sentences supports our learning to speak correct and meaningful paragraphs, etc. Generally, knowledge acquired from related learning tasks would help boost our learning capability in the current task. Motivated by such a phenomenon, in this thesis, we study supervised machine learning models for bioinformatics problems that can improve their performance through exploiting multiple related knowledge sources. More specifically, we concern with ways to enrich the supervised models’ knowledge base with publicly available related data to enhance the computational models’ prediction performance. Our work shares commonality with existing works in multimodal learning, multi-task learning, and transfer learning. Nevertheless, there are certain differences in some cases. Besides the proposed architectures, we present large-scale experiment setups with consensus evaluation metrics along with the creation and release of large datasets to showcase our approaches’ superiority. Moreover, we add case studies with detailed analyses in which we place no simplified assumptions to demonstrate the systems’ utilities in realistic application scenarios. Finally, we develop and make available an easy-to-use website for non-expert users to query the model’s generated prediction results to facilitate field experts’ assessments and adaptation. We believe that our work serves as one of the first steps in bridging the gap between “Computer Science” and “Biology” that will open a new era of fruitful collaboration between computer scientists and biological field experts

    Applications

    Get PDF
    Volume 3 describes how resource-aware machine learning methods and techniques are used to successfully solve real-world problems. The book provides numerous specific application examples: in health and medicine for risk modelling, diagnosis, and treatment selection for diseases in electronics, steel production and milling for quality control during manufacturing processes in traffic, logistics for smart cities and for mobile communications

    Large Scale Kernel Methods for Fun and Profit

    Get PDF
    Kernel methods are among the most flexible classes of machine learning models with strong theoretical guarantees. Wide classes of functions can be approximated arbitrarily well with kernels, while fast convergence and learning rates have been formally shown to hold. Exact kernel methods are known to scale poorly with increasing dataset size, and we believe that one of the factors limiting their usage in modern machine learning is the lack of scalable and easy to use algorithms and software. The main goal of this thesis is to study kernel methods from the point of view of efficient learning, with particular emphasis on large-scale data, but also on low-latency training, and user efficiency. We improve the state-of-the-art for scaling kernel solvers to datasets with billions of points using the Falkon algorithm, which combines random projections with fast optimization. Running it on GPUs, we show how to fully utilize available computing power for training kernel machines. To boost the ease-of-use of approximate kernel solvers, we propose an algorithm for automated hyperparameter tuning. By minimizing a penalized loss function, a model can be learned together with its hyperparameters, reducing the time needed for user-driven experimentation. In the setting of multi-class learning, we show that – under stringent but realistic assumptions on the separation between classes – a wide set of algorithms needs much fewer data points than in the more general setting (without assumptions on class separation) to reach the same accuracy. The first part of the thesis develops a framework for efficient and scalable kernel machines. This raises the question of whether our approaches can be used successfully in real-world applications, especially compared to alternatives based on deep learning which are often deemed hard to beat. The second part aims to investigate this question on two main applications, chosen because of the paramount importance of having an efficient algorithm. First, we consider the problem of instance segmentation of images taken from the iCub robot. Here Falkon is used as part of a larger pipeline, but the efficiency afforded by our solver is essential to ensure smooth human-robot interactions. In the second instance, we consider time-series forecasting of wind speed, analysing the relevance of different physical variables on the predictions themselves. We investigate different schemes to adapt i.i.d. learning to the time-series setting. Overall, this work aims to demonstrate, through novel algorithms and examples, that kernel methods are up to computationally demanding tasks, and that there are concrete applications in which their use is warranted and more efficient than that of other, more complex, and less theoretically grounded models

    Flexible estimation of temporal point processes and graphs

    Get PDF
    Handling complex data types with spatial structures, temporal dependencies, or discrete values, is generally a challenge in statistics and machine learning. In the recent years, there has been an increasing need of methodological and theoretical work to analyse non-standard data types, for instance, data collected on protein structures, genes interactions, social networks or physical sensors. In this thesis, I will propose a methodology and provide theoretical guarantees for analysing two general types of discrete data emerging from interactive phenomena, namely temporal point processes and graphs. On the one hand, temporal point processes are stochastic processes used to model event data, i.e., data that comes as discrete points in time or space where some phenomenon occurs. Some of the most successful applications of these discrete processes include online messages, financial transactions, earthquake strikes, and neuronal spikes. The popularity of these processes notably comes from their ability to model unobserved interactions and dependencies between temporally and spatially distant events. However, statistical methods for point processes generally rely on estimating a latent, unobserved, stochastic intensity process. In this context, designing flexible models and consistent estimation methods is often a challenging task. On the other hand, graphs are structures made of nodes (or agents) and edges (or links), where an edge represents an interaction or relationship between two nodes. Graphs are ubiquitous to model real-world social, transport, and mobility networks, where edges can correspond to virtual exchanges, physical connections between places, or migrations across geographical areas. Besides, graphs are used to represent correlations and lead-lag relationships between time series, and local dependence between random objects. Graphs are typical examples of non-Euclidean data, where adequate distance measures, similarity functions, and generative models need to be formalised. In the deep learning community, graphs have become particularly popular within the field of geometric deep learning. Structure and dependence can both be modelled by temporal point processes and graphs, although predominantly, the former act on the temporal domain while the latter conceptualise spatial interactions. Nonetheless, some statistical models combine graphs and point processes in order to account for both spatial and temporal dependencies. For instance, temporal point processes have been used to model the birth times of edges and nodes in temporal graphs. Moreover, some multivariate point processes models have a latent graph parameter governing the pairwise causal relationships between the components of the process. In this thesis, I will notably study such a model, called the Hawkes model, as well as graphs evolving in time. This thesis aims at designing inference methods that provide flexibility in the contexts of temporal point processes and graphs. This manuscript is presented in an integrated format, with four main chapters and two appendices. Chapters 2 and 3 are dedicated to the study of Bayesian nonparametric inference methods in the generalised Hawkes point process model. While Chapter 2 provides theoretical guarantees for existing methods, Chapter 3 also proposes, analyses, and evaluates a novel variational Bayes methodology. The other main chapters introduce and study model-free inference approaches for two estimation problems on graphs, namely spectral methods for the signed graph clustering problem in Chapter 4, and a deep learning algorithm for the network change point detection task on temporal graphs in Chapter 5. Additionally, Chapter 1 provides an introduction and background preliminaries on point processes and graphs. Chapter 6 concludes this thesis with a summary and critical thinking on the works in this manuscript, and proposals for future research. Finally, the appendices contain two supplementary papers. The first one, in Appendix A, initiated after the COVID-19 outbreak in March 2020, is an application of a discrete-time Hawkes model to COVID-related deaths counts during the first wave of the pandemic. The second work, in Appendix B, was conducted during an internship at Amazon Research in 2021, and proposes an explainability method for anomaly detection models acting on multivariate time series

    Causal Modeling with Stationary Diffusions

    Full text link
    We develop a novel approach towards causal inference. Rather than structural equations over a causal graph, we learn stochastic differential equations (SDEs) whose stationary densities model a system's behavior under interventions. These stationary diffusion models do not require the formalism of causal graphs, let alone the common assumption of acyclicity. We show that in several cases, they generalize to unseen interventions on their variables, often better than classical approaches. Our inference method is based on a new theoretical result that expresses a stationarity condition on the diffusion's generator in a reproducing kernel Hilbert space. The resulting kernel deviation from stationarity (KDS) is an objective function of independent interest

    PAC-Bayesian Bandit Algorithms With Guarantees

    Get PDF
    PAC-Bayes is a mathematical framework that can be used to provide performance guarantees for machine learning algorithms, explain why specific machine learning algorithms work well, and design new machine learning algorithms. Since the first PAC-Bayesian theorems were proven in the late 1990's, several impressive milestones have been achieved. PAC-Bayes generalisation bounds have been used to prove tight error bounds for deep neural networks. In addition, PAC-Bayes bounds have been used to explain why machine learning principles such as large margin classification and preference for flat minima of a loss function work well. However, these milestones were achieved in simple supervised learning problems. In this thesis, inspired by the success of the PAC-Bayes framework in supervised learning settings, we investigate the potential of the PAC-Bayes framework as a tool for designing and analysing bandit algorithms. First, we provide a comprehensive overview of PAC-Bayes bounds for bandit problems and an experimental comparison of these bounds. Previous works focused on PAC-Bayes bounds for martingales and their application to importance sampling-based estimates of the reward or regret of a policy. On the one hand, we found that these PAC-Bayes bounds are a useful tool for designing offline policy search algorithms with performance guarantees. In our experiments, a PAC-Bayesian offline policy search algorithm was able to learn randomised neural network polices with competitive expected reward and non-vacuous performance guarantees. On the other hand, the PAC-Bayesian online policy search algorithms that we tested had underwhelming performance and loose cumulative regret bounds. Next, we present novel PAC-Bayes-style algorithms with worst-case regret bounds for linear bandit problems. We combine PAC-Bayes bounds with the "optimism in the face of uncertainty" principle, which reduces a stochastic bandit problem to the construction of a confidence sequence for the unknown reward function. We use a novel PAC-Bayes-style tail bound for adaptive martingale mixtures to construct convex PAC-Bayes-style confidence sequences for (sparse) linear bandits. We show that (sparse) linear bandit algorithms based on our PAC-Bayes-style confidence sequences are guaranteed to achieve competitive worst-case regret. We also show that our confidence sequences yield confidence bounds that are tighter than competitors, both empirically and theoretically. Finally, we demonstrate that our tighter PAC-Bayes-style confidence bounds result in bandit algorithms with improved cumulative regret
    corecore