9,058 research outputs found
Machine Learning Approaches for the Prioritisation of Cardiovascular Disease Genes Following Genome- wide Association Study
Genome-wide association studies (GWAS) have revealed thousands of genetic loci, establishing itself as a valuable method for unravelling the complex biology of many diseases. As GWAS has grown in size and improved in study design to detect effects, identifying real causal signals, disentangling from other highly correlated markers associated by linkage disequilibrium (LD) remains challenging. This has severely limited GWAS findings and brought the method’s value into question. Although thousands of disease susceptibility loci have been reported, causal variants and genes at these loci remain elusive. Post-GWAS analysis aims to dissect the heterogeneity of variant and gene signals. In recent years, machine learning (ML) models have been developed for post-GWAS prioritisation. ML models have ranged from using logistic regression to more complex ensemble models such as random forests and gradient boosting, as well as deep learning models (i.e., neural networks). When combined with functional validation, these methods have shown important translational insights, providing a strong evidence-based approach to direct post-GWAS research. However, ML approaches are in their infancy across biological applications, and as they continue to evolve an evaluation of their robustness for GWAS prioritisation is needed. Here, I investigate the landscape of ML across: selected models, input features, bias risk, and output model performance, with a focus on building a prioritisation framework that is applied to blood pressure GWAS results and tested on re-application to blood lipid traits
Using machine learning to predict pathogenicity of genomic variants throughout the human genome
Geschätzt mehr als 6.000 Erkrankungen werden durch Veränderungen im Genom verursacht. Ursachen gibt es viele: Eine genomische Variante kann die Translation eines Proteins stoppen, die Genregulation stören oder das Spleißen der mRNA in eine andere Isoform begünstigen. All diese Prozesse müssen überprüft werden, um die zum beschriebenen Phänotyp passende Variante zu ermitteln. Eine Automatisierung dieses Prozesses sind Varianteneffektmodelle. Mittels maschinellem Lernen und Annotationen aus verschiedenen Quellen bewerten diese Modelle genomische Varianten hinsichtlich ihrer Pathogenität.
Die Entwicklung eines Varianteneffektmodells erfordert eine Reihe von Schritten: Annotation der Trainingsdaten, Auswahl von Features, Training verschiedener Modelle und Selektion eines Modells. Hier präsentiere ich ein allgemeines Workflow dieses Prozesses. Dieses ermöglicht es den Prozess zu konfigurieren, Modellmerkmale zu bearbeiten, und verschiedene Annotationen zu testen. Der Workflow umfasst außerdem die Optimierung von Hyperparametern, Validierung und letztlich die Anwendung des Modells durch genomweites Berechnen von Varianten-Scores.
Der Workflow wird in der Entwicklung von Combined Annotation Dependent Depletion (CADD), einem Varianteneffektmodell zur genomweiten Bewertung von SNVs und InDels, verwendet. Durch Etablierung des ersten Varianteneffektmodells für das humane Referenzgenome GRCh38 demonstriere ich die gewonnenen Möglichkeiten Annotationen aufzugreifen und neue Modelle zu trainieren. Außerdem zeige ich, wie Deep-Learning-Scores als Feature in einem CADD-Modell die Vorhersage von RNA-Spleißing verbessern. Außerdem werden Varianteneffektmodelle aufgrund eines neuen, auf Allelhäufigkeit basierten, Trainingsdatensatz entwickelt.
Diese Ergebnisse zeigen, dass der entwickelte Workflow eine skalierbare und flexible Möglichkeit ist, um Varianteneffektmodelle zu entwickeln. Alle entstandenen Scores sind unter cadd.gs.washington.edu und cadd.bihealth.org frei verfügbar.More than 6,000 diseases are estimated to be caused by genomic variants. This can happen in many possible ways: a variant may stop the translation of a protein, interfere with gene regulation, or alter splicing of the transcribed mRNA into an unwanted isoform. It is necessary to investigate all of these processes in order to evaluate which variant may be causal for the deleterious phenotype. A great help in this regard are variant effect scores. Implemented as machine learning classifiers, they integrate annotations from different resources to rank genomic variants in terms of pathogenicity.
Developing a variant effect score requires multiple steps: annotation of the training data, feature selection, model training, benchmarking, and finally deployment for the model's application. Here, I present a generalized workflow of this process. It makes it simple to configure how information is converted into model features, enabling the rapid exploration of different annotations. The workflow further implements hyperparameter optimization, model validation and ultimately deployment of a selected model via genome-wide scoring of genomic variants.
The workflow is applied to train Combined Annotation Dependent Depletion (CADD), a variant effect model that is scoring SNVs and InDels genome-wide. I show that the workflow can be quickly adapted to novel annotations by porting CADD to the genome reference GRCh38. Further, I demonstrate the integration of deep-neural network scores as features into a new CADD model, improving the annotation of RNA splicing events. Finally, I apply the workflow to train multiple variant effect models from training data that is based on variants selected by allele frequency.
In conclusion, the developed workflow presents a flexible and scalable method to train variant effect scores. All software and developed scores are freely available from cadd.gs.washington.edu and cadd.bihealth.org
Intelligent architecture to support second generation general accounting
Dissertation presented as the partial requirement for obtaining a Master's degree in Statistics and Information Management, specialization in Information Analysis and ManagementThis study aimed to innovate the world of accounting software. After so many years, accountants are faced
with an unbelievable amount of work, which is not always productive, effective and efficient for both the
accountant and the company that provided him with the data required to carry out the accounting. There is
already accounting software with various automation processes, from ornamentation to profitability analysis
and management reporting. There is also software that is updated in accordance with the accounting laws,
i.e., the platform changes its mechanisms according to the changes in the law.
Despite the existence of this software, manual work remains, and the amount of information accountants are
faced with is still very large. It is difficult for accountants to do a 100% reliable job with so much information
and data they have. One of the most common situations in the accounting world is undoubtedly the
miscalculation or forgetting of some financial or non-financial data found in accounting operations (income
statements, balance sheets, etc.). To render accounting operations efficient, effective and productive, errorfree
and 100% reliable, an intelligent architecture has been developed to support second generation general
accounting. This architectural design was developed with a view to make the existing software smarter with
the help of artificial intelligence.
A study was carried out on accounting keys and concepts, on AI and main process automation techniques to
build the model. With these studies it was intended to acquire all possible requirements for the creation of the
architecture. Towards the end of the thesis the model was validated
Unsupervised Learning of Distributional Properties can Supplement Human Labeling and Increase Active Learning Efficiency in Anomaly Detection
Exfiltration of data via email is a serious cybersecurity threat for many
organizations. Detecting data exfiltration (anomaly) patterns typically
requires labeling, most often done by a human annotator, to reduce the high
number of false alarms. Active Learning (AL) is a promising approach for
labeling data efficiently, but it needs to choose an efficient order in which
cases are to be labeled, and there are uncertainties as to what scoring
procedure should be used to prioritize cases for labeling, especially when
detecting rare cases of interest is crucial. We propose an adaptive AL sampling
strategy that leverages the underlying prior data distribution, as well as
model uncertainty, to produce batches of cases to be labeled that contain
instances of rare anomalies. We show that (1) the classifier benefits from a
batch of representative and informative instances of both normal and anomalous
examples, (2) unsupervised anomaly detection plays a useful role in building
the classifier in the early stages of training when relatively little labeling
has been done thus far. Our approach to AL for anomaly detection outperformed
existing AL approaches on three highly unbalanced UCI benchmarks and on one
real-world redacted email data set
Recommended from our members
Queues, Planes and Games: Algorithms for Scheduling Passengers, and Decision Making in Stackelberg Games
In this dissertation, I present three theoretical results with real-world applications related to scheduling and distributionally-robust games, important fields in discrete optimization, and computer science.
The first chapter provides simple, technology-free interventions to manage elevator queues in high-rise buildings when passenger demand far exceeds the capacity of the elevator system. The problem was motivated by the need to manage passengers safely in light of reduced elevator capacities during the COVID-19 pandemic. We use mathematical modeling, epidemiological expertise, and simulation to design and evaluate our algorithmic solutions. The key idea is to explicitly or implicitly group passengers that are going to the same floor into the same elevator as much as possible, substantiated theoretically using a technique from queuing theory known as stability analysis. This chapter is joint work with Charles Branas, Adam Elmachtoub, Clifford Stein, and Yeqing Zhou, directly in collaboration with the New York City Mayor’s Office of the Chief Technology Officer and the Department of Citywide Administrative Services.
The second chapter proposes new algorithms for recomputing passenger itineraries for airlines during major disruptions when carefully planned schedules are thrown into disarray. An airline network is a massive temporal graph, often with tight regulatory and operational constraints. When disruptions propagate through an airline network, the objective is to \textit{recover} within a given time frame from a disruption, meaning we replan schedules affected by the disruption such that the new schedules have to match the originally planned schedules after the time frame. We aim to solve the large-scale airline recovery problem with quick, user-independent, consistent, and near-optimal algorithms. We provide new algorithms to solve the passenger recovery problem, given recovered flight and crew solutions. We build a preprocessing step and construct an Integer Program as well as a network-based approach based on solving multiple-label shortest path problems. Experiments show the tractability of our proposed algorithms on airline data sets with heavy flight disruptions. This chapter is joint work with Clifford Stein, stemming from an internship and collaboration with the Machine Learning team (Artificial Intelligence organization) of GE Global Research, Niskayuna, New York.
The third chapter is about computing distributionally-robust strategies for a popular game theory model called Stackelberg games, where one player, called the leader, is able to commit to a strategy first, assuming the other player(s), called follower(s) would best respond to the strategy. In many of the real-world applications of Stackelberg games, parameters such as payoffs of the follower(s) are not known with certainty. Distributionally-robust optimization allows a distribution over possible model parameters, where this distribution comes from a set of possible distributions. The goal for the leader is to maximize their expected utility with respect to the worst-case distribution from the set. We initiate the study of distributionally-robust models for Stackelberg games, show that a distributionally-robust Stackelberg equilibrium always exists across a wide array of uncertainty models, and provide tractable algorithms for some general settings with experimental results. This chapter is joint work with Christian Kroer
Improved Deep Neural Networks for Generative Robotic Grasping
This thesis provides a thorough evaluation of current state-of-the-art robotic grasping methods and contributes to a subset of data-driven grasp estimation approaches, termed generative models. These models aim to directly generate grasp region proposals from a given image without the need for a separate analysis and ranking step, which can be computationally expensive. This approach allows for fully end-to-end training of a model and quick closed-loop operation of a robot arm.
A number of limitations are identified within these generative models, which are identified and addressed. Contributions are proposed that directly target each stage of the training pipeline that help to form accurate grasp proposals and generalise better to unseen objects. Firstly, inspired by theories of object manipulation within the mammalian visual system, the use of multi-task learning in existing generative architectures is evaluated. This aims to improve the performance of grasping algorithms when presented with impoverished colour (RGB) data by training models to perform simultaneous tasks such as object categorisation, saliency detection, and depth reconstruction. Secondly, a novel loss function is introduced which improves overall performance by rewarding the network to focus only on learning grasps at suitable positions. This reduces overall training times and results in better performance on fewer training examples. The last contribution analyses the problems with the most common metric used for evaluating and comparing offline performance between different grasping models and algorithms. To this end, a Gaussian method of representing ground-truth labelled grasps is put forward, which optimal grasp
locations tested in a simulated grasping environment.
The combination of these novel additions to generative models results in improved grasp success, accuracy, and performance on common benchmark datasets compared to previous approaches. Furthermore, the efficacy of these contributions is also tested when transferred to a physical robotic arm, demonstrating the ability to effectively grasp previously unseen 3D printed objects of varying complexity and difficulty without the need for domain adaptation. Finally, the future directions are discussed for generative convolutional models within the overall field of robotic grasping
Recommended from our members
Proceedings of the 33rd Annual Workshop of the Psychology of Programming Interest Group
This is the Proceedings of the 33rd Annual Workshop of the Psychology of Programming Interest Group (PPIG). This was the first PPIG to be held physically since 2019, following the two online-only PPIGs in 2020 and 2021, both during the Covid pandemic. It was also the first PPIG conference to be designed specifically for hybrid attendance. Reflecting the theme, it was hosted by Music Computing Lab at the Open University in Milton Keynes
Bayesian inference and learning in switching biological systems
This thesis is concerned with the stochastic modeling of and inference for switching biological systems. Motivated by the great variety of data obtainable from such systems by wet-lab experiments or computer simulations, continuous-time as well as discrete-time frameworks are devised. Similarly, different latent state-space configurations - both hybrid continuous-discrete and purely discrete state spaces - are considered. These models enable Bayesian inferences about the temporal system dynamics as well as the respective parameters.
Starting with the exact model formulations, principled approximations are derived using sampling and variational techniques, enabling computationally tractable algorithms. The resulting frameworks are evaluated under the modeling assumption and subsequently applied to common benchmark problems and real-world biological data.
These developments are divided into three scientific contributions:
First, a Markov chain Monte Carlo method for continuous-time and continuous-discrete state-space hybrid processes is derived. These hybrid processes are formulated as Markov-switching stochastic differential equations, for which the exact evolution equation is also presented. A Gibbs sampling scheme is then derived which enables tractable inference both for the system dynamics and the system parameters. This approach is validated under the modeling assumption as well as applied to data from a wet-lab gene-switching experiment.
Second, a variational approach to the same problem is taken to speed up the inference procedure. To this end, a mixture of Gaussian processes serves as the variational measure. The method is derived starting from the Kullback-Leibler divergence between two true switching stochastic differential equations, and it is shown in which regime the Gaussian mixture approximation is valid. It is then benchmarked on the same ground-truth data as the Gibbs sampler and applied to model systems from computational structural biology.
Third, a nonparametric inference framework is laid out for conformational molecule switching. Here, a purely discrete latent state space is assumed, where each latent state corresponds to one molecular structure. Utilizing variational techniques again, a method is presented to identify the number of conformations present in the data.
This method generalizes the framework of Markov state models, which is well-established in the field of computational structural biology. An observation likelihood model tailored to structural molecule data is introduced, along with a suitable approximation enabling tractable inference. This framework, too, is first evaluated on data generated under the model assumption and then applied to common problems in the field
- …