27 research outputs found

    Assessment of the Physiological Network in Sleep Apnea

    Get PDF
    Objective: Machine Learning models, in particular Artificial Neural Networks, have shown to be applicable in clinical research for tumor detection and sleep phase classification. Applications in systems medicine and biology, for example in Physiological Networks, could benefit from the ability of these methods to recognize patterns in high-dimensional data, but decisions of an Artificial Neural Network cannot be interpreted based on the model itself. In a medical context this is an undesirable characteristic, because hidden age, gender or other data biases negatively impact the model quality. If insights are based on a biased model, the ability of an independent study to come to similar conclusions is limited and therefore an essential property of scientific experiments, known as results reproducibility, is violated. Besides results reproducibility, methods reproducibility allows others to reproduce exact outputs of computational experiments, but requires data, code and runtime environments to be available. These challenges in interpretability and reproducibility are addressed as part of an assessment of the Physiological Network in Obstructive Sleep Apnea. Approach: A research platform is developed, that connects medical data, code and environ-ments to enable methods reproducibility. The platform employs a compute cluster or cloud to accelerate the demanding model training. Artificial Neural Networks are trained on the Physiological Network data of a healthy control group for age and gender prediction to verify the influence of these biases. In a subsequent study, an Artificial Neural Network is trained to classify the Physiological Networks in Obstructive Sleep Apnea and a healthy control group. The state-of-the-art interpretation method DeepLift is applied to explain model predictions. Results: An existing collaboration platform has been extended for sleep research data and modern container technologies are used to distribute training environments in compute clusters. Artificial Neural Network models predict the age of healthy subjects in a resolution of one decade and correctly classify the gender with 91% accuracy. Due to the verified biases, a matched dataset is created for the classification of Obstructive Sleep Apnea. The classification accuracy reaches 87% and DeepLift provides biomarkers as significant indicators towards or against the disorder. Analysis of misclassified samples shows potential Obstructive Sleep Apnea phenotypes. Significance: The presented platform is extensible for future use cases and focuses on the reproducibility of computational experiments, a concern across many disciplines. Machine learning approaches solve analysis tasks on high-dimensional data and novel interpretation techniques provide the required transparency for medical applications.Ziel: Methoden des maschinellen Lernens, insbesondere künstliche neuronale Netze, finden Anwendung in der klinischen Forschung, um beispielsweise Tumorzellen oder Schlafphasen zu klassifizieren. Anwendungen in der Systemmedizin und -biologie, wie physiologische Netzwerke, könnten von der Fähigkeit dieser Methoden, Muster in großen Merkmalsräumen zu finden, profitieren. Allerdings sind Entscheidungen eines künstlichen neuronalen Netzes nicht allein anhand des Modells interpretierbar. In einem medizinischen Kontext ist dies eine unerwünschte Charakteristik, weil die Daten, mit denen ein Modell trainiert wird, versteckte Einflüsse wie Alters- und Geschlechtsabhängigkeiten beinhalten können. Erkenntnisse, die auf einem beeinflussten Modell basieren, sind nur bedingt durch unabhängige Studien nach-vollziehbar, sodass keine Ergebnisreproduzierbarkeit gegeben ist. Neben der Ergebnisreproduzier-barkeit bezeichnet Methodenreproduzierbarkeit die Möglichkeit exakte Programmausgaben zu reproduzieren, was die Verfügbarkeit von Daten, Programmcode und Ausführungsumgebungen voraussetzt. Diese Promotion untersucht Veränderungen im physiologischen Netzwerk bei obstruktivem Schlafapnoesyndrom mit Methoden des maschinellen Lernens und adressiert dabei die genannten Herausforderungen der Interpretierbarkeit und Reproduzierbarkeit. Ansatz: Es wird eine Forschungsplattform entwickelt, die medizinische Daten, Programmcode und Ausführungsumgebungen verknüpft und damit Methodenreproduzierbarkeit ermöglicht. Die Plattform bindet zur Beschleunigung des ressourcenintensiven Modelltrainings verteilte Rechenressourcen in Form eines Clusters oder einer Cloud an. Künstliche neuronale Netze werden zur Bestimmung des Alters und des Geschlechts anhand der physiologischen Daten einer gesunden Kontrollgruppe trainiert, um den Einfluss der Alters- und Geschlechtsabhängigkeiten zu untersuchen. In einer Folgestudie werden die Unterschiede im physiologischen Netzwerk einer Gruppe mit obstruktivem Schlafapnoesyndrom und einer gesunden Kontrollgruppe klassifiziert. DeepLift, eine Interpretationsmethode nach aktuellem Stand der Technik, wird zur Erklärung der Modellvorhersagen angewendet. Ergebnisse: Eine existierende Forschungsplattform wurde für die Verarbeitung schlafbezogener Forschungsdaten erweitert und Containertechnologien ermöglichen die Bereitstellung der Ausführungsumgebung eines Experiments in einem Cluster. Künstliche neuronale Netze können anhand der physiologischen Daten das Alter einer Person bis auf eine Dekade genau bestimmen und eine Geschlechtsklassifikation erreicht eine Genauigkeit von 91%. Die Ergebnisse bestätigen den Einfluss der Alters- und Geschlechtsabhängigkeiten, sodass für Schlafapnoeklassifikationen zunächst eine Datenbasis geschaffen wird, in der die Geschlechts- und Altersverteilung zwischen gesunden und kranken Gruppen ausgeglichen ist. Die resultierenden Modelle erreichen eine Klassifikationsgenauigkeit von 87%. DeepLift weist auf Biomarker und mögliche physiologische Schlafapnoe-Phänotypen im Tiefschlaf hin. Signifikanz: Die vorgestellte Plattform ist für zukünftige Anwendungsfälle erweiterbar und ermöglicht Methodenreproduzierbarkeit, was über den Einsatz in der Medizin hinaus auch in anderen Disziplinen von Bedeutung ist. Maschinelles Lernen bietet sinnvolle Ansätze für die Analyse hochdimensionaler Daten und neue Interpretationstechniken schaffen die notwendige Transparenz für medizinische Anwendungszwecke

    Computer Science 2019 APR Self-Study & Documents

    Get PDF
    UNM Computer Science APR self-study report and review team report for Spring 2019, fulfilling requirements of the Higher Learning Commission

    Pacific Symposium on Biocomputing 2023

    Get PDF
    The Pacific Symposium on Biocomputing (PSB) 2023 is an international, multidisciplinary conference for the presentation and discussion of current research in the theory and application of computational methods in problems of biological significance. Presentations are rigorously peer reviewed and are published in an archival proceedings volume. PSB 2023 will be held on January 3-7, 2023 in Kohala Coast, Hawaii. Tutorials and workshops will be offered prior to the start of the conference.PSB 2023 will bring together top researchers from the US, the Asian Pacific nations, and around the world to exchange research results and address open issues in all aspects of computational biology. It is a forum for the presentation of work in databases, algorithms, interfaces, visualization, modeling, and other computational methods, as applied to biological problems, with emphasis on applications in data-rich areas of molecular biology.The PSB has been designed to be responsive to the need for critical mass in sub-disciplines within biocomputing. For that reason, it is the only meeting whose sessions are defined dynamically each year in response to specific proposals. PSB sessions are organized by leaders of research in biocomputing's 'hot topics.' In this way, the meeting provides an early forum for serious examination of emerging methods and approaches in this rapidly changing field

    Workflow models for heterogeneous distributed systems

    Get PDF
    The role of data in modern scientific workflows becomes more and more crucial. The unprecedented amount of data available in the digital era, combined with the recent advancements in Machine Learning and High-Performance Computing (HPC), let computers surpass human performances in a wide range of fields, such as Computer Vision, Natural Language Processing and Bioinformatics. However, a solid data management strategy becomes crucial for key aspects like performance optimisation, privacy preservation and security. Most modern programming paradigms for Big Data analysis adhere to the principle of data locality: moving computation closer to the data to remove transfer-related overheads and risks. Still, there are scenarios in which it is worth, or even unavoidable, to transfer data between different steps of a complex workflow. The contribution of this dissertation is twofold. First, it defines a novel methodology for distributed modular applications, allowing topology-aware scheduling and data management while separating business logic, data dependencies, parallel patterns and execution environments. In addition, it introduces computational notebooks as a high-level and user-friendly interface to this new kind of workflow, aiming to flatten the learning curve and improve the adoption of such methodology. Each of these contributions is accompanied by a full-fledged, Open Source implementation, which has been used for evaluation purposes and allows the interested reader to experience the related methodology first-hand. The validity of the proposed approaches has been demonstrated on a total of five real scientific applications in the domains of Deep Learning, Bioinformatics and Molecular Dynamics Simulation, executing them on large-scale mixed cloud-High-Performance Computing (HPC) infrastructures

    New Algorithms for Large Datasets and Distributions

    Get PDF
    In this dissertation, we make progress on certain algorithmic problems broadly over two computational models: the streaming model for large datasets and the distribution testing model for large probability distributions. First we consider the streaming model, where a large sequence of data items arrives one by one. The computer needs to make one pass over this sequence, processing every item quickly, in a limited space. In Chapter 2 motivated by a bioinformatics application, we consider the problem of estimating the number of low-frequency items in a stream, which has received only a limited theoretical work so far. We give an efficient streaming algorithm for this problem and show its complexity is almost optimal. In Chapter 3 we consider a distributed variation of the streaming model, where each item of the data sequence arrives arbitrarily to one among a set of computers, who together need to compute certain functions over the entire stream. In such scenarios combining the data at a computer is infeasible due to large communication overhead. We give the first algorithm for k-median clustering in this model. Moreover, we give new algorithms for frequency moments and clustering functions in the distributed sliding window model, where the computation is limited to the most recent W items, as the items arrive in the stream. In Chapter 5, in our identity testing problem, given two distributions P (unknown, only samples are obtained) and Q (known) over a common sample space of exponential size, we need to distinguish P = Q (output ‘yes’) versus P is far from Q (output ‘no’). This problem requires an exponential number of samples. To circumvent this lower bound, this problem was recently studied with certain structural assumptions. In particular, optimally efficient testers were given assuming P and Q are product distributions. For such product distributions, we give the first tolerant testers, which not only output yes when P = Q but also when P is close to Q, in Chapter 5. Likewise, we study the tolerant closeness testing problem for such product distributions, where Q too is accessed only by samples. Adviser: Vinodchandran N. Variya

    Understanding patient experience from online medium

    Get PDF
    Improving patient experience at hospitals leads to better health outcomes. To improve this, we must first understand and interpret patients' written feedback. Patient-generated texts such as patient reviews found on RateMD, or online health forums found on WebMD are venues where patients post about their experiences. Due to the massive amounts of patient-generated texts that exist online, an automated approach to identifying the topics from patient experience taxonomy is the only realistic option to analyze these texts. However, not only is there a lack of annotated taxonomy on these media, but also word usage is colloquial, making it challenging to apply standardized NLP technique to identify the topics that are present in the patient-generated texts. Furthermore, patients may describe multiple topics in the patient-generated texts which drastically increases the complexity of the task. In this thesis, we address the challenges in comprehensively and automatically understanding the patient experience from patient-generated texts. We first built a set of rich semantic features to represent the corpus which helps capture meanings that may not typically be captured by the bag-of-words (BOW) model. Unlike the BOW model, semantic feature representation captures the context and in-depth meaning behind each word in the corpus. To the best of our knowledge, no existing work in understanding patient experience from patient-generated texts delves into which semantic features help capture the characteristics of the corpus. Furthermore, patients generally talk about multiple topics when they write in patient-generated texts, and these are frequently interdependent of each other. There are two types of topic interdependencies, those that are semantically similar, and those that are not. We built a constraint-based deep neural network classifier to capture the two types of topic interdependencies and empirically show the classification performance improvement over the baseline approaches. Past research has also indicated that patient experiences differ depending on patient segments [1-4]. The segments can be based on demographics, for instance, by race, gender, or geographical location. Similarly, the segments can be based on health status, for example, whether or not the patient is taking medication, whether or not the patient has a particular disease, or whether or not the patient is readmitted to the hospital. To better understand patient experiences, we built an automated approach to identify patient segments with a focus on whether the person has stopped taking the medication or not. The technique used to identify the patient segment is general enough that we envision the approach to be applicable to other types of patient segments. With a comprehensive understanding of patient experiences, we envision an application system where clinicians can directly read the most relevant patient-generated texts that pertain to their interest. The system can capture topics from patient experience taxonomy that is of interest to each clinician or designated expert, and we believe the system is one of many approaches that can ultimately help improve the patient experience

    New Algorithms for Large Datasets and Distributions

    Get PDF
    In this dissertation, we make progress on certain algorithmic problems broadly over two computational models: the streaming model for large datasets and the distribution testing model for large probability distributions. First we consider the streaming model, where a large sequence of data items arrives one by one. The computer needs to make one pass over this sequence, processing every item quickly, in a limited space. In Chapter 2 motivated by a bioinformatics application, we consider the problem of estimating the number of low-frequency items in a stream, which has received only a limited theoretical work so far. We give an efficient streaming algorithm for this problem and show its complexity is almost optimal. In Chapter 3 we consider a distributed variation of the streaming model, where each item of the data sequence arrives arbitrarily to one among a set of computers, who together need to compute certain functions over the entire stream. In such scenarios combining the data at a computer is infeasible due to large communication overhead. We give the first algorithm for k-median clustering in this model. Moreover, we give new algorithms for frequency moments and clustering functions in the distributed sliding window model, where the computation is limited to the most recent W items, as the items arrive in the stream. In Chapter 5, in our identity testing problem, given two distributions P (unknown, only samples are obtained) and Q (known) over a common sample space of exponential size, we need to distinguish P = Q (output ‘yes’) versus P is far from Q (output ‘no’). This problem requires an exponential number of samples. To circumvent this lower bound, this problem was recently studied with certain structural assumptions. In particular, optimally efficient testers were given assuming P and Q are product distributions. For such product distributions, we give the first tolerant testers, which not only output yes when P = Q but also when P is close to Q, in Chapter 5. Likewise, we study the tolerant closeness testing problem for such product distributions, where Q too is accessed only by samples. Adviser: Vinodchandran N. Variya
    corecore