2,489 research outputs found
Using machine learning to predict pathogenicity of genomic variants throughout the human genome
Geschätzt mehr als 6.000 Erkrankungen werden durch Veränderungen im Genom verursacht. Ursachen gibt es viele: Eine genomische Variante kann die Translation eines Proteins stoppen, die Genregulation stören oder das Spleißen der mRNA in eine andere Isoform begünstigen. All diese Prozesse müssen überprüft werden, um die zum beschriebenen Phänotyp passende Variante zu ermitteln. Eine Automatisierung dieses Prozesses sind Varianteneffektmodelle. Mittels maschinellem Lernen und Annotationen aus verschiedenen Quellen bewerten diese Modelle genomische Varianten hinsichtlich ihrer Pathogenität.
Die Entwicklung eines Varianteneffektmodells erfordert eine Reihe von Schritten: Annotation der Trainingsdaten, Auswahl von Features, Training verschiedener Modelle und Selektion eines Modells. Hier präsentiere ich ein allgemeines Workflow dieses Prozesses. Dieses ermöglicht es den Prozess zu konfigurieren, Modellmerkmale zu bearbeiten, und verschiedene Annotationen zu testen. Der Workflow umfasst außerdem die Optimierung von Hyperparametern, Validierung und letztlich die Anwendung des Modells durch genomweites Berechnen von Varianten-Scores.
Der Workflow wird in der Entwicklung von Combined Annotation Dependent Depletion (CADD), einem Varianteneffektmodell zur genomweiten Bewertung von SNVs und InDels, verwendet. Durch Etablierung des ersten Varianteneffektmodells für das humane Referenzgenome GRCh38 demonstriere ich die gewonnenen Möglichkeiten Annotationen aufzugreifen und neue Modelle zu trainieren. Außerdem zeige ich, wie Deep-Learning-Scores als Feature in einem CADD-Modell die Vorhersage von RNA-Spleißing verbessern. Außerdem werden Varianteneffektmodelle aufgrund eines neuen, auf Allelhäufigkeit basierten, Trainingsdatensatz entwickelt.
Diese Ergebnisse zeigen, dass der entwickelte Workflow eine skalierbare und flexible Möglichkeit ist, um Varianteneffektmodelle zu entwickeln. Alle entstandenen Scores sind unter cadd.gs.washington.edu und cadd.bihealth.org frei verfügbar.More than 6,000 diseases are estimated to be caused by genomic variants. This can happen in many possible ways: a variant may stop the translation of a protein, interfere with gene regulation, or alter splicing of the transcribed mRNA into an unwanted isoform. It is necessary to investigate all of these processes in order to evaluate which variant may be causal for the deleterious phenotype. A great help in this regard are variant effect scores. Implemented as machine learning classifiers, they integrate annotations from different resources to rank genomic variants in terms of pathogenicity.
Developing a variant effect score requires multiple steps: annotation of the training data, feature selection, model training, benchmarking, and finally deployment for the model's application. Here, I present a generalized workflow of this process. It makes it simple to configure how information is converted into model features, enabling the rapid exploration of different annotations. The workflow further implements hyperparameter optimization, model validation and ultimately deployment of a selected model via genome-wide scoring of genomic variants.
The workflow is applied to train Combined Annotation Dependent Depletion (CADD), a variant effect model that is scoring SNVs and InDels genome-wide. I show that the workflow can be quickly adapted to novel annotations by porting CADD to the genome reference GRCh38. Further, I demonstrate the integration of deep-neural network scores as features into a new CADD model, improving the annotation of RNA splicing events. Finally, I apply the workflow to train multiple variant effect models from training data that is based on variants selected by allele frequency.
In conclusion, the developed workflow presents a flexible and scalable method to train variant effect scores. All software and developed scores are freely available from cadd.gs.washington.edu and cadd.bihealth.org
The Omics basis of human health: investigating plasma proteins and their genetic effects on complex traits
Over the past decade, the advancements in technology and the growing amount of identified genetic variants have led to a high number of important discoveries in the field of precision medicine concerning human biology and pathophysiology. However, it became evident that genomics alone could not properly explain the onset and regulation of the specific molecular mechanisms of certain phenotypes. Studying omics helped complement this gap in genetic research, providing detailed information on the quantification of molecules that are involved in structural and functional processes in the organism. Specifically, protein production, levels, and regulation are dynamic and change during the course of one’s lifetime. This information has proven fundamental to understanding how certain proteins affect complex phenotypes such as neurological and psychiatric disorders.
In this thesis, I describe the three groups of analyses I conducted over the course of my doctoral programme on different sets of blood plasma proteins and over a broad range of neurological, psychiatric, cardiovascular, and electrophysiology phenotypes. The underlying mechanisms that trigger the onset of psychiatric and neurological conditions are often not limited to the nervous system, but rather stem from multi-system molecular triggers. The first part of the work I carried out aims at investigating the frequent co-occurrence and comorbidity of neurological and cardiovascular phenotypes by conducting a genome-wide association (GWA) meta-analysis of 183 neurology-related blood proteins on data from over 12000 individuals. The second part concerns the bivariate and multivariate analyses conducted on 276 cardiology and inflammatory proteins, while the third illustrates the contribution to consortia focussed on heart rate and electrophysiology. Results from the second and third parts of the work provided information that played an important role in understanding a part of the genetic mechanisms of the complex traits of interest.
Overall, the results presented in this thesis strongly support the notion
that proteomics is an important tool to be used to study complex traits and drug discovery and development should focus on targeting protein synthesis and regulation. Furthermore, the results also support the notion that complex diseases involve more than one biological system, and in order to gain a better understanding of human pathology, it is fundamental to study the causes and effects across the entire organism
Modeling and Simulation in Engineering
The Special Issue Modeling and Simulation in Engineering, belonging to the section Engineering Mathematics of the Journal Mathematics, publishes original research papers dealing with advanced simulation and modeling techniques. The present book, “Modeling and Simulation in Engineering I, 2022”, contains 14 papers accepted after peer review by recognized specialists in the field. The papers address different topics occurring in engineering, such as ferrofluid transport in magnetic fields, non-fractal signal analysis, fractional derivatives, applications of swarm algorithms and evolutionary algorithms (genetic algorithms), inverse methods for inverse problems, numerical analysis of heat and mass transfer, numerical solutions for fractional differential equations, Kriging modelling, theory of the modelling methodology, and artificial neural networks for fault diagnosis in electric circuits. It is hoped that the papers selected for this issue will attract a significant audience in the scientific community and will further stimulate research involving modelling and simulation in mathematical physics and in engineering
Applications of Molecular Dynamics simulations for biomolecular systems and improvements to density-based clustering in the analysis
Molecular Dynamics simulations provide a powerful tool to study biomolecular systems with atomistic detail. The key to better understand the function and behaviour of these molecules can often be found in their structural variability. Simulations can help to expose this information that is otherwise experimentally hard or impossible to attain. This work covers two application examples for which a sampling and a characterisation of the conformational ensemble could reveal the structural basis to answer a topical research question. For the fungal toxin phalloidin—a small bicyclic peptide—observed product ratios in different cyclisation reactions could be rationalised by assessing the conformational pre-organisation of precursor fragments. For the C-type lectin receptor langerin, conformational changes induced by different side-chain protonations could deliver an explanation
of the pH-dependency in the protein’s calcium-binding. The investigations were accompanied by the continued development of a density-based clustering protocol into a respective software package, which is generally well applicable for the use case of extracting conformational states from Molecular Dynamics data
Recommended from our members
Proceedings of the 33rd Annual Workshop of the Psychology of Programming Interest Group
This is the Proceedings of the 33rd Annual Workshop of the Psychology of Programming Interest Group (PPIG). This was the first PPIG to be held physically since 2019, following the two online-only PPIGs in 2020 and 2021, both during the Covid pandemic. It was also the first PPIG conference to be designed specifically for hybrid attendance. Reflecting the theme, it was hosted by Music Computing Lab at the Open University in Milton Keynes
Brain Computations and Connectivity [2nd edition]
This is an open access title available under the terms of a CC BY-NC-ND 4.0 International licence. It is free to read on the Oxford Academic platform and offered as a free PDF download from OUP and selected open access locations.
Brain Computations and Connectivity is about how the brain works. In order to understand this, it is essential to know what is computed by different brain systems; and how the computations are performed.
The aim of this book is to elucidate what is computed in different brain systems; and to describe current biologically plausible computational approaches and models of how each of these brain systems computes.
Understanding the brain in this way has enormous potential for understanding ourselves better in health and in disease. Potential applications of this understanding are to the treatment of the brain in disease; and to artificial intelligence which will benefit from knowledge of how the brain performs many of its extraordinarily impressive functions.
This book is pioneering in taking this approach to brain function: to consider what is computed by many of our brain systems; and how it is computed, and updates by much new evidence including the connectivity of the human brain the earlier book: Rolls (2021) Brain Computations: What and How, Oxford University Press.
Brain Computations and Connectivity will be of interest to all scientists interested in brain function and how the brain works, whether they are from neuroscience, or from medical sciences including neurology and psychiatry, or from the area of computational science including machine learning and artificial intelligence, or from areas such as theoretical physics
A Robust Unified Graph Model Based on Molecular Data Binning for Subtype Discovery in High-dimensional Spaces
Machine learning (ML) is a subfield of artificial intelligence (AI) that has already revolutionised the world around us. It is a widely employed process for discovering patterns and groups within datasets. It has a wide range of applications including disease subtyping, which aims to discover intrinsic subtypes of disease in large-scale unlabelled data. Whilst the groups discovered in multi-view high-dimensional data by ML algorithms are promising, their capacity to identify pertinent and meaningful groups is limited by the presence of data variability and outliers. Since outlier values represent potential but unlikely outcomes, they are statistically and philosophically fascinating.
Therefore, the primary aim of this thesis was to propose a robust approach that discovers meaningful groups while considering the presence of data variability and outliers in the data. To achieve this aim, a novel robust approach (ROMDEX) was developed that utilised the proposed intermediate graph models (IMGs) for robust computation of proximity between observations in the data. Finally, a robust multi-view graph-based clustering approach was developed based on ROMDEX that improved the discovery of meaningful groups that were hidden behind the noise in the data.
The proposed approach was validated on real-world, and synthetic data for disease subtyping. Additionally, the stability of the approach was assessed by evaluating its performance across different levels of noise in clustering data. The results were evaluated through Kaplan-Meier survival time analysis for disease subtyping. Also, the concordance index (CI) and normalised mutual information (NMI) are used to evaluate the predictive ability of the proposed clustering model. Additionally, the accuracy, Kappa statistic and rand index are computed to evaluate the clustering stability against various levels of Gaussian noise. The proposed approach outperformed the existing state-of-the-art approaches MRGC, PINS, SNF, Consensus Clustering, and Icluster+ on these datasets. The findings for all datasets were outstanding, demonstrating the predictive ability of the proposed unsupervised graph-based clustering approach
Comparing sentiment analysis tools on gitHub project discussions
Mestrado de dupla diplomação com a UTFPR - Universidade Tecnológica Federal do ParanáThe context of this work is situated in the rapidly evolving sphere of Natural Language Processing (NLP) within the scope of software engineering, focusing on sentiment analysis in software repositories. Sentiment analysis, a subfield of NLP, provides a potent method to parse, understand, and categorize these sentiments expressed in text. By applying sentiment analysis to software repositories, we can decode developers’ opinions and sentiments, providing key insights into team dynamics, project health, and potential areas of conflict or collaboration. However, the application of sentiment analysis in software engineering comes with its unique set of challenges. Technical jargon, code-specific ambiguities, and the brevity of software-related communications demand tailored NLP tools for effective analysis.
The study unfolds in two primary phases. In the initial phase, we embarked on a meticulous investigation into the impacts of expanding the training sets of two prominent sentiment analysis tools, namely, SentiCR and SentiSW. The objective was to delineate the correlation between the size of the training set and the resulting tool performance, thereby revealing any potential enhancements in performance. The subsequent phase of the research encapsulates a practical application of the enhanced tools. We employed these tools to categorize discussions drawn from issue tickets within a varied array of Open-Source projects. These projects span an extensive range, from relatively small repositories to large, well-established repositories, thus providing a rich and diverse sampling ground.O contexto deste trabalho situa-se na esfera em rápida evolução do Processamento de Linguagem Natural (PLN) no âmbito da engenharia de software, com foco na análise de sentimentos em repositórios de software. A análise de sentimentos, um subcampo do PLN, fornece um método poderoso para analisar, compreender e categorizar os sentimentos expressos em texto. Ao aplicar a análise de sentimentos aos repositórios de software, podemos decifrar as opiniões e sentimentos dos desenvolvedores, fornecendo informações importantes sobre a dinâmica da equipe, a saúde do projeto e áreas potenciais de conflito
ou colaboração. No entanto, a aplicação da análise de sentimentos na engenharia de software apresenta desafios únicos. Jargão técnico, ambiguidades específicas do código e a breviedade das comunicações relacionadas ao software exigem ferramentas de PLN personalizadas para uma análise eficaz. O estudo se desenvolve em duas fases principais. Na fase inicial, embarcamos em uma investigação meticulosa sobre os impactos da expansão dos conjuntos de treinamento de duas ferramentas proeminentes de análise de sentimentos, nomeadamente, SentiCR e SentiSW. O objetivo foi delinear a correlação entre o tamanho do conjunto de treinamento e o desempenho da ferramenta resultante, revelando assim possíveis aprimoramentos no desempenho.
A fase subsequente da pesquisa engloba uma aplicação prática das ferramentas aprimoradas. Utilizamos essas ferramentas para categorizar discussões retiradas de bilhetes de problemas em uma variedade diversificada de projetos de código aberto. Esses projetos abrangem uma ampla gama, desde repositórios relativamente pequenos até repositórios grandes e bem estabelecidos, fornecendo assim um campo de amostragem rico e diversificado
- …