9,058 research outputs found

    Behavioral policies, evidence, and expertise

    Get PDF

    Machine Learning Approaches for the Prioritisation of Cardiovascular Disease Genes Following Genome- wide Association Study

    Get PDF
    Genome-wide association studies (GWAS) have revealed thousands of genetic loci, establishing itself as a valuable method for unravelling the complex biology of many diseases. As GWAS has grown in size and improved in study design to detect effects, identifying real causal signals, disentangling from other highly correlated markers associated by linkage disequilibrium (LD) remains challenging. This has severely limited GWAS findings and brought the method’s value into question. Although thousands of disease susceptibility loci have been reported, causal variants and genes at these loci remain elusive. Post-GWAS analysis aims to dissect the heterogeneity of variant and gene signals. In recent years, machine learning (ML) models have been developed for post-GWAS prioritisation. ML models have ranged from using logistic regression to more complex ensemble models such as random forests and gradient boosting, as well as deep learning models (i.e., neural networks). When combined with functional validation, these methods have shown important translational insights, providing a strong evidence-based approach to direct post-GWAS research. However, ML approaches are in their infancy across biological applications, and as they continue to evolve an evaluation of their robustness for GWAS prioritisation is needed. Here, I investigate the landscape of ML across: selected models, input features, bias risk, and output model performance, with a focus on building a prioritisation framework that is applied to blood pressure GWAS results and tested on re-application to blood lipid traits

    Using machine learning to predict pathogenicity of genomic variants throughout the human genome

    Get PDF
    Geschätzt mehr als 6.000 Erkrankungen werden durch Veränderungen im Genom verursacht. Ursachen gibt es viele: Eine genomische Variante kann die Translation eines Proteins stoppen, die Genregulation stören oder das Spleißen der mRNA in eine andere Isoform begünstigen. All diese Prozesse müssen überprüft werden, um die zum beschriebenen Phänotyp passende Variante zu ermitteln. Eine Automatisierung dieses Prozesses sind Varianteneffektmodelle. Mittels maschinellem Lernen und Annotationen aus verschiedenen Quellen bewerten diese Modelle genomische Varianten hinsichtlich ihrer Pathogenität. Die Entwicklung eines Varianteneffektmodells erfordert eine Reihe von Schritten: Annotation der Trainingsdaten, Auswahl von Features, Training verschiedener Modelle und Selektion eines Modells. Hier präsentiere ich ein allgemeines Workflow dieses Prozesses. Dieses ermöglicht es den Prozess zu konfigurieren, Modellmerkmale zu bearbeiten, und verschiedene Annotationen zu testen. Der Workflow umfasst außerdem die Optimierung von Hyperparametern, Validierung und letztlich die Anwendung des Modells durch genomweites Berechnen von Varianten-Scores. Der Workflow wird in der Entwicklung von Combined Annotation Dependent Depletion (CADD), einem Varianteneffektmodell zur genomweiten Bewertung von SNVs und InDels, verwendet. Durch Etablierung des ersten Varianteneffektmodells für das humane Referenzgenome GRCh38 demonstriere ich die gewonnenen Möglichkeiten Annotationen aufzugreifen und neue Modelle zu trainieren. Außerdem zeige ich, wie Deep-Learning-Scores als Feature in einem CADD-Modell die Vorhersage von RNA-Spleißing verbessern. Außerdem werden Varianteneffektmodelle aufgrund eines neuen, auf Allelhäufigkeit basierten, Trainingsdatensatz entwickelt. Diese Ergebnisse zeigen, dass der entwickelte Workflow eine skalierbare und flexible Möglichkeit ist, um Varianteneffektmodelle zu entwickeln. Alle entstandenen Scores sind unter cadd.gs.washington.edu und cadd.bihealth.org frei verfügbar.More than 6,000 diseases are estimated to be caused by genomic variants. This can happen in many possible ways: a variant may stop the translation of a protein, interfere with gene regulation, or alter splicing of the transcribed mRNA into an unwanted isoform. It is necessary to investigate all of these processes in order to evaluate which variant may be causal for the deleterious phenotype. A great help in this regard are variant effect scores. Implemented as machine learning classifiers, they integrate annotations from different resources to rank genomic variants in terms of pathogenicity. Developing a variant effect score requires multiple steps: annotation of the training data, feature selection, model training, benchmarking, and finally deployment for the model's application. Here, I present a generalized workflow of this process. It makes it simple to configure how information is converted into model features, enabling the rapid exploration of different annotations. The workflow further implements hyperparameter optimization, model validation and ultimately deployment of a selected model via genome-wide scoring of genomic variants. The workflow is applied to train Combined Annotation Dependent Depletion (CADD), a variant effect model that is scoring SNVs and InDels genome-wide. I show that the workflow can be quickly adapted to novel annotations by porting CADD to the genome reference GRCh38. Further, I demonstrate the integration of deep-neural network scores as features into a new CADD model, improving the annotation of RNA splicing events. Finally, I apply the workflow to train multiple variant effect models from training data that is based on variants selected by allele frequency. In conclusion, the developed workflow presents a flexible and scalable method to train variant effect scores. All software and developed scores are freely available from cadd.gs.washington.edu and cadd.bihealth.org

    Intelligent architecture to support second generation general accounting

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Statistics and Information Management, specialization in Information Analysis and ManagementThis study aimed to innovate the world of accounting software. After so many years, accountants are faced with an unbelievable amount of work, which is not always productive, effective and efficient for both the accountant and the company that provided him with the data required to carry out the accounting. There is already accounting software with various automation processes, from ornamentation to profitability analysis and management reporting. There is also software that is updated in accordance with the accounting laws, i.e., the platform changes its mechanisms according to the changes in the law. Despite the existence of this software, manual work remains, and the amount of information accountants are faced with is still very large. It is difficult for accountants to do a 100% reliable job with so much information and data they have. One of the most common situations in the accounting world is undoubtedly the miscalculation or forgetting of some financial or non-financial data found in accounting operations (income statements, balance sheets, etc.). To render accounting operations efficient, effective and productive, errorfree and 100% reliable, an intelligent architecture has been developed to support second generation general accounting. This architectural design was developed with a view to make the existing software smarter with the help of artificial intelligence. A study was carried out on accounting keys and concepts, on AI and main process automation techniques to build the model. With these studies it was intended to acquire all possible requirements for the creation of the architecture. Towards the end of the thesis the model was validated

    Unsupervised Learning of Distributional Properties can Supplement Human Labeling and Increase Active Learning Efficiency in Anomaly Detection

    Full text link
    Exfiltration of data via email is a serious cybersecurity threat for many organizations. Detecting data exfiltration (anomaly) patterns typically requires labeling, most often done by a human annotator, to reduce the high number of false alarms. Active Learning (AL) is a promising approach for labeling data efficiently, but it needs to choose an efficient order in which cases are to be labeled, and there are uncertainties as to what scoring procedure should be used to prioritize cases for labeling, especially when detecting rare cases of interest is crucial. We propose an adaptive AL sampling strategy that leverages the underlying prior data distribution, as well as model uncertainty, to produce batches of cases to be labeled that contain instances of rare anomalies. We show that (1) the classifier benefits from a batch of representative and informative instances of both normal and anomalous examples, (2) unsupervised anomaly detection plays a useful role in building the classifier in the early stages of training when relatively little labeling has been done thus far. Our approach to AL for anomaly detection outperformed existing AL approaches on three highly unbalanced UCI benchmarks and on one real-world redacted email data set

    k-Means

    Get PDF

    Improved Deep Neural Networks for Generative Robotic Grasping

    Get PDF
    This thesis provides a thorough evaluation of current state-of-the-art robotic grasping methods and contributes to a subset of data-driven grasp estimation approaches, termed generative models. These models aim to directly generate grasp region proposals from a given image without the need for a separate analysis and ranking step, which can be computationally expensive. This approach allows for fully end-to-end training of a model and quick closed-loop operation of a robot arm. A number of limitations are identified within these generative models, which are identified and addressed. Contributions are proposed that directly target each stage of the training pipeline that help to form accurate grasp proposals and generalise better to unseen objects. Firstly, inspired by theories of object manipulation within the mammalian visual system, the use of multi-task learning in existing generative architectures is evaluated. This aims to improve the performance of grasping algorithms when presented with impoverished colour (RGB) data by training models to perform simultaneous tasks such as object categorisation, saliency detection, and depth reconstruction. Secondly, a novel loss function is introduced which improves overall performance by rewarding the network to focus only on learning grasps at suitable positions. This reduces overall training times and results in better performance on fewer training examples. The last contribution analyses the problems with the most common metric used for evaluating and comparing offline performance between different grasping models and algorithms. To this end, a Gaussian method of representing ground-truth labelled grasps is put forward, which optimal grasp locations tested in a simulated grasping environment. The combination of these novel additions to generative models results in improved grasp success, accuracy, and performance on common benchmark datasets compared to previous approaches. Furthermore, the efficacy of these contributions is also tested when transferred to a physical robotic arm, demonstrating the ability to effectively grasp previously unseen 3D printed objects of varying complexity and difficulty without the need for domain adaptation. Finally, the future directions are discussed for generative convolutional models within the overall field of robotic grasping

    Bayesian inference and learning in switching biological systems

    Get PDF
    This thesis is concerned with the stochastic modeling of and inference for switching biological systems. Motivated by the great variety of data obtainable from such systems by wet-lab experiments or computer simulations, continuous-time as well as discrete-time frameworks are devised. Similarly, different latent state-space configurations - both hybrid continuous-discrete and purely discrete state spaces - are considered. These models enable Bayesian inferences about the temporal system dynamics as well as the respective parameters. Starting with the exact model formulations, principled approximations are derived using sampling and variational techniques, enabling computationally tractable algorithms. The resulting frameworks are evaluated under the modeling assumption and subsequently applied to common benchmark problems and real-world biological data. These developments are divided into three scientific contributions: First, a Markov chain Monte Carlo method for continuous-time and continuous-discrete state-space hybrid processes is derived. These hybrid processes are formulated as Markov-switching stochastic differential equations, for which the exact evolution equation is also presented. A Gibbs sampling scheme is then derived which enables tractable inference both for the system dynamics and the system parameters. This approach is validated under the modeling assumption as well as applied to data from a wet-lab gene-switching experiment. Second, a variational approach to the same problem is taken to speed up the inference procedure. To this end, a mixture of Gaussian processes serves as the variational measure. The method is derived starting from the Kullback-Leibler divergence between two true switching stochastic differential equations, and it is shown in which regime the Gaussian mixture approximation is valid. It is then benchmarked on the same ground-truth data as the Gibbs sampler and applied to model systems from computational structural biology. Third, a nonparametric inference framework is laid out for conformational molecule switching. Here, a purely discrete latent state space is assumed, where each latent state corresponds to one molecular structure. Utilizing variational techniques again, a method is presented to identify the number of conformations present in the data. This method generalizes the framework of Markov state models, which is well-established in the field of computational structural biology. An observation likelihood model tailored to structural molecule data is introduced, along with a suitable approximation enabling tractable inference. This framework, too, is first evaluated on data generated under the model assumption and then applied to common problems in the field
    corecore