4,339 research outputs found
Using machine learning to predict pathogenicity of genomic variants throughout the human genome
Geschätzt mehr als 6.000 Erkrankungen werden durch Veränderungen im Genom verursacht. Ursachen gibt es viele: Eine genomische Variante kann die Translation eines Proteins stoppen, die Genregulation stören oder das Spleißen der mRNA in eine andere Isoform begünstigen. All diese Prozesse müssen überprüft werden, um die zum beschriebenen Phänotyp passende Variante zu ermitteln. Eine Automatisierung dieses Prozesses sind Varianteneffektmodelle. Mittels maschinellem Lernen und Annotationen aus verschiedenen Quellen bewerten diese Modelle genomische Varianten hinsichtlich ihrer Pathogenität.
Die Entwicklung eines Varianteneffektmodells erfordert eine Reihe von Schritten: Annotation der Trainingsdaten, Auswahl von Features, Training verschiedener Modelle und Selektion eines Modells. Hier präsentiere ich ein allgemeines Workflow dieses Prozesses. Dieses ermöglicht es den Prozess zu konfigurieren, Modellmerkmale zu bearbeiten, und verschiedene Annotationen zu testen. Der Workflow umfasst außerdem die Optimierung von Hyperparametern, Validierung und letztlich die Anwendung des Modells durch genomweites Berechnen von Varianten-Scores.
Der Workflow wird in der Entwicklung von Combined Annotation Dependent Depletion (CADD), einem Varianteneffektmodell zur genomweiten Bewertung von SNVs und InDels, verwendet. Durch Etablierung des ersten Varianteneffektmodells für das humane Referenzgenome GRCh38 demonstriere ich die gewonnenen Möglichkeiten Annotationen aufzugreifen und neue Modelle zu trainieren. Außerdem zeige ich, wie Deep-Learning-Scores als Feature in einem CADD-Modell die Vorhersage von RNA-Spleißing verbessern. Außerdem werden Varianteneffektmodelle aufgrund eines neuen, auf Allelhäufigkeit basierten, Trainingsdatensatz entwickelt.
Diese Ergebnisse zeigen, dass der entwickelte Workflow eine skalierbare und flexible Möglichkeit ist, um Varianteneffektmodelle zu entwickeln. Alle entstandenen Scores sind unter cadd.gs.washington.edu und cadd.bihealth.org frei verfügbar.More than 6,000 diseases are estimated to be caused by genomic variants. This can happen in many possible ways: a variant may stop the translation of a protein, interfere with gene regulation, or alter splicing of the transcribed mRNA into an unwanted isoform. It is necessary to investigate all of these processes in order to evaluate which variant may be causal for the deleterious phenotype. A great help in this regard are variant effect scores. Implemented as machine learning classifiers, they integrate annotations from different resources to rank genomic variants in terms of pathogenicity.
Developing a variant effect score requires multiple steps: annotation of the training data, feature selection, model training, benchmarking, and finally deployment for the model's application. Here, I present a generalized workflow of this process. It makes it simple to configure how information is converted into model features, enabling the rapid exploration of different annotations. The workflow further implements hyperparameter optimization, model validation and ultimately deployment of a selected model via genome-wide scoring of genomic variants.
The workflow is applied to train Combined Annotation Dependent Depletion (CADD), a variant effect model that is scoring SNVs and InDels genome-wide. I show that the workflow can be quickly adapted to novel annotations by porting CADD to the genome reference GRCh38. Further, I demonstrate the integration of deep-neural network scores as features into a new CADD model, improving the annotation of RNA splicing events. Finally, I apply the workflow to train multiple variant effect models from training data that is based on variants selected by allele frequency.
In conclusion, the developed workflow presents a flexible and scalable method to train variant effect scores. All software and developed scores are freely available from cadd.gs.washington.edu and cadd.bihealth.org
Recommended from our members
Rigorous Experimentation For Reinforcement Learning
Scientific fields make advancements by leveraging the knowledge created by others to push the boundary of understanding. The primary tool in many fields for generating knowledge is empirical experimentation. Although common, generating accurate knowledge from empirical experiments is often challenging due to inherent randomness in execution and confounding variables that can obscure the correct interpretation of the results. As such, researchers must hold themselves and others to a high degree of rigor when designing experiments. Unfortunately, most reinforcement learning (RL) experiments lack this rigor, making the knowledge generated from experiments dubious. This dissertation proposes methods to address central issues in RL experimentation.
Evaluating the performance of an RL algorithm is the most common type of experiment in RL literature. Most performance evaluations are often incapable of answering a specific research question and produce misleading results. Thus, the first issue we address is how to create a performance evaluation procedure that holds up to scientific standards.
Despite the prevalence of performance evaluation, these types of experiments produce limited knowledge, e.g., they can only show how well an algorithm worked and not why, and they require significant amounts of time and computational resources. As an alternative, this dissertation proposes that scientific testing, the process of conducting carefully controlled experiments designed to further the knowledge and understanding of how an algorithm works, should be the primary form of experimentation.
Lastly, this dissertation provides a case study using policy gradient methods, showing how scientific testing can replace performance evaluation as the primary form of experimentation. As a result, this dissertation can motivate others in the field to adopt more rigorous experimental practices
Modelling, Monitoring, Control and Optimization for Complex Industrial Processes
This reprint includes 22 research papers and an editorial, collected from the Special Issue "Modelling, Monitoring, Control and Optimization for Complex Industrial Processes", highlighting recent research advances and emerging research directions in complex industrial processes. This reprint aims to promote the research field and benefit the readers from both academic communities and industrial sectors
Exploring QCD matter in extreme conditions with Machine Learning
In recent years, machine learning has emerged as a powerful computational
tool and novel problem-solving perspective for physics, offering new avenues
for studying strongly interacting QCD matter properties under extreme
conditions. This review article aims to provide an overview of the current
state of this intersection of fields, focusing on the application of machine
learning to theoretical studies in high energy nuclear physics. It covers
diverse aspects, including heavy ion collisions, lattice field theory, and
neutron stars, and discuss how machine learning can be used to explore and
facilitate the physics goals of understanding QCD matter. The review also
provides a commonality overview from a methodology perspective, from
data-driven perspective to physics-driven perspective. We conclude by
discussing the challenges and future prospects of machine learning applications
in high energy nuclear physics, also underscoring the importance of
incorporating physics priors into the purely data-driven learning toolbox. This
review highlights the critical role of machine learning as a valuable
computational paradigm for advancing physics exploration in high energy nuclear
physics.Comment: 146 pages,53 figure
Novel approaches for hierarchical classification with case studies in protein function prediction
A very large amount of research in the data mining, machine learning, statistical pattern recognition and related research communities has focused on flat classification problems. However, many problems in the real world such as hierarchical protein function prediction have their classes naturally organised into hierarchies. The task of hierarchical classification, however, needs to be better defined as researchers into one application domain are often unaware of similar efforts developed in other research areas.
The first contribution of this thesis is to survey the task of hierarchical classification across different application domains and present an unifying framework for the task. After clearly defining the problem, we explore novel approaches to the task.
Based on the understanding gained by surveying the task of hierarchical classification, there are three major approaches to deal with hierarchical classification problems. The first approach is to use one of the many existing flat classification algorithms to predict only the leaf classes in the hierarchy. Note that, in the training phase, this approach completely ignores the hierarchical class relationships, i.e. the parent-child and sibling class relationships, but in the testing phase the ancestral classes of an instance can be inferred from its predicted leaf classes. The second approach is to build a set of local models, by training one flat classification algorithm for each local view of the hierarchy. The two main variations of this approach are: (a) training a local flat multi-class classifier at each non-leaf class node, where each classifier discriminates among the child classes of its associated class; or (b) training a local fiat binary classifier at each node of the class hierarchy, where each classifier predicts whether or not a new instance has the classifier’s associated class. In both these variations, in the testing phase a procedure is used to combine the predictions of the set of local classifiers in a coherent way, avoiding inconsistent predictions. The third approach is to use a global-model hierarchical classification algorithm, which builds one single classification model by taking into account all the hierarchical class relationships in the training phase. In the context of this categorization of hierarchical classification approaches, the other contributions of this thesis are as follows.
The second contribution of this thesis is a novel algorithm which is based on the local classifier per parent node approach. The novel algorithm is the selective representation approach that automatically selects the best protein representation to use at each non-leaf class node.
The third contribution is a global-model hierarchical classification extension of the well known naive Bayes algorithm. Given the good predictive performance of the global-model hierarchical-classification naive Bayes algorithm, we relax the Naive Bayes’ assumption that attributes are independent from each other given the class by using the concept of k dependencies. Hence, we extend the flat classification /¿-Dependence Bayesian network classifier to the task of hierarchical classification, which is the fourth contribution of this thesis.
Both the proposed global-model hierarchical classification Naive Bayes and the proposed global-model hierarchical /Âż-Dependence Bayesian network classifier have achieved predictive accuracies that were, overall, significantly higher than the predictive accuracies obtained by their corresponding local hierarchical classification versions, across a number of datasets for the task of hierarchical protein function prediction
Computational Intelligence for Cooperative Swarm Control
Over the last few decades, swarm intelligence (SI) has shown significant benefits in many practical applications. Real-world applications of swarm intelligence include disaster response and wildlife conservation. Swarm robots can collaborate to search for survivors, locate victims, and assess damage in hazardous environments during an earthquake or natural disaster. They can coordinate their movements and share data in real-time to increase their efficiency and effectiveness while guiding the survivors. In addition to tracking animal movements and behaviour, robots can guide animals to or away from specific areas. Sheep herding is a significant source of income in Australia that could be significantly enhanced if the human shepherd could be supported by single or multiple robots.
Although the shepherding framework has become a popular SI mechanism, where a leading agent (sheepdog) controls a swarm of agents (sheep) to complete a task, controlling a swarm of agents is still not a trivial task, especially in the presence of some practical constraints. For example, most of the existing shepherding literature assumes that each swarm member has an unlimited sensing range to recognise all other members’ locations. However, this is not practical for physical systems. In addition, current approaches do not consider shepherding as a distributed system where an agent, namely a central unit, may observe the environment and commu- nicate with the shepherd to guide the swarm. However, this brings another hurdle when noisy communication channels between the central unit and the shepherd af- fect the success of the mission. Also, the literature lacks shepherding models that can cope with dynamic communication systems. Therefore, this thesis aims to design a multi-agent learning system for effective shepherding control systems in a partially observable environment under communication constraints.
To achieve this goal, the thesis first introduces a new methodology to guide agents whose sensing range is limited. In this thesis, the sheep are modelled as an induced network to represent the sheep’s sensing range and propose a geometric method for finding a shepherd-impacted subset of sheep. The proposed swarm optimal herding point uses a particle swarm optimiser and a clustering mechanism to find the sheepdog’s near-optimal herding location while considering flock cohesion. Then, an improved version of the algorithm (named swarm optimal modified centroid push) is proposed to estimate the sheepdog’s intermediate waypoints to the herding point considering the sheep cohesion. The approaches outperform existing shepherding methods in reducing task time and increasing the success rate for herding.
Next, to improve shepherding in noisy communication channels, this thesis pro- poses a collaborative learning-based method to enhance communication between the central unit and the herding agent. The proposed independent pre-training collab- orative learning technique decreases the transmission mean square error by half in 10% of the training time compared to existing approaches. The algorithm is then ex- tended so that the sheepdog can read the modulated herding points from the central unit. The results demonstrate the efficiency of the new technique in time-varying noisy channels.
Finally, the central unit is modelled as a mobile agent to lower the time-varying noise caused by the sheepdog’s motion during the task. So, I propose a Q-learning- based incremental search to increase transmission success between the shepherd and the central unit. In addition, two unique reward functions are presented to ensure swarm guidance success with minimal energy consumption. The results demonstrate an increase in the success rate for shepherding
Acoustic modelling, data augmentation and feature extraction for in-pipe machine learning applications
Gathering measurements from infrastructure, private premises, and harsh environments can be difficult and expensive. From this perspective, the development of
new machine learning algorithms is strongly affected by the availability of training
and test data. We focus on audio archives for in-pipe events. Although several
examples of pipe-related applications can be found in the literature, datasets of
audio/vibration recordings are much scarcer, and the only references found relate
to leakage detection and characterisation. Therefore, this work proposes a methodology to relieve the burden of data collection for acoustic events in deployed pipes.
The aim is to maximise the yield of small sets of real recordings and demonstrate
how to extract effective features for machine learning. The methodology developed
requires the preliminary creation of a soundbank of audio samples gathered with
simple weak annotations. For practical reasons, the case study is given by a range
of appliances, fittings, and fixtures connected to pipes in domestic environments.
The source recordings are low-reverberated audio signals enhanced through a
bespoke spectral filter and containing the desired audio fingerprints. The soundbank is then processed to create an arbitrary number of synthetic augmented
observations. The data augmentation improves the quality and the quantity of
the metadata and automatically creates strong and accurate annotations that
are both machine and human-readable. Besides, the implemented processing
chain allows precise control of properties such as signal-to-noise ratio, duration
of the events, and the number of overlapping events. The inter-class variability
is expanded by recombining source audio blocks and adding simulated artificial
reverberation obtained through an acoustic model developed for the purpose.
Finally, the dataset is synthesised to guarantee separability and balance. A few
signal representations are optimised to maximise the classification performance,
and the results are reported as a benchmark for future developments. The contribution to the existing knowledge concerns several aspects of the processing chain
implemented. A novel quasi-analytic acoustic model is introduced to simulate
in-pipe reverberations, adopting a three-layer architecture particularly convenient
for batch processing. The first layer includes two algorithms: one for the numerical
calculation of the axial wavenumbers and one for the separation of the modes. The
latter, in particular, provides a workaround for a problem not explicitly treated in the
literature and related to the modal non-orthogonality given by the solid-liquid interface in the analysed domain. A set of results for different waveguides is reported
to compare the dispersive behaviour against different mechanical configurations.
Two more novel solutions are also included in the second layer of the model and
concern the integration of the acoustic sources. Specifically, the amplitudes of the
non-orthogonal modal potentials are obtained using either a distance minimisation
objective function or by solving an analytical decoupling problem. In both cases,
results show that sources sufficiently smooth can be approximated with a limited
number of modes keeping the error below 1%. The last layer proposes a bespoke
approach for the integration of the acoustic model into the synthesiser as a reverberation simulator. Additional elements of novelty relate to the other blocks of the
audio synthesiser. The statistical spectral filter, for instance, is a batch-processing
solution for the attenuation of the background noise of the source recordings. The
signal-to-noise ratio analysis for both moderate and high noise levels indicates
a clear improvement of several decibels against the closest filter example in the
literature. The recombination of the audio blocks and the system of fully tracked
annotations are also novel extensions of similar approaches recently adopted in
other contexts. Moreover, a bespoke synthesis strategy is proposed to guarantee
separable and balanced datasets. The last contribution concerns the extraction
of convenient sets of audio features. Elements of novelty are introduced for the
optimisation of the filter banks of the mel-frequency cepstral coefficients and the
scattering wavelet transform. In particular, compared to the respective standard
definitions, the average F-score performance of the optimised features is roughly
6% higher in the first case and 2.5% higher for the latter. Finally, the soundbank,
the synthetic dataset, and the fundamental blocks of the software library developed
are publicly available for further research
What Matters in Model Training to Transfer Adversarial Examples
Despite state-of-the-art performance on natural data, Deep Neural Networks (DNNs) are highly vulnerable to adversarial examples, i.e., imperceptible, carefully crafted perturbations of inputs applied at test time. Adversarial examples can transfer: an adversarial example against one model is likely to be adversarial against another independently trained model. This dissertation investigates the characteristics of the surrogate weight space that lead to the transferability of adversarial examples. Our research covers three complementary aspects of the weight space exploration: the multimodal exploration to obtain multiple models from different vicinities, the local exploration to obtain multiple models in the same vicinity, and the point selection to obtain a single transferable representation.
First, from a probabilistic perspective, we argue that transferability is fundamentally related to uncertainty. The unknown weights of the target DNN can be treated as random variables. Under a specified threat model, deep ensemble can produce a surrogate by sampling from the distribution of the target model. Unfortunately, deep ensembles are computationally expensive. We propose an efficient alternative by approximately sampling surrogate models from the posterior distribution using cSGLD, a state-of-the-art Bayesian deep learning technique. Our extensive experiments show that our approach improves and complements four attacks, three transferability techniques, and five more training methods significantly on ImageNet, CIFAR-10, and MNIST (up to 83.2 percentage points), while reducing training computations from 11.6 to 2.4 exaflops compared to deep ensemble on ImageNet.
Second, we propose transferability from Large Geometric Vicinity (LGV), a new technique based on the local exploration of the weight space. LGV starts from a pretrained model and collects multiple weights in a few additional training epochs with a constant and high learning rate. LGV exploits two geometric properties that we relate to transferability. First, we show that LGV explores a flatter region of the weight space and generates flatter adversarial examples in the input space. We present the surrogate-target misalignment hypothesis to explain why flatness could increase transferability. Second, we show that the LGV weights span a dense weight subspace whose geometry is intrinsically connected to transferability. Through extensive experiments, we show that LGV alone outperforms all (combinations of) four established transferability techniques by 1.8 to 59.9 percentage points.
Third, we investigate how to train a transferable representation, that is, a single model for transferability. First, we refute a common hypothesis from previous research to explain why early stopping improves transferability. We then establish links between transferability and the exploration dynamics of the weight space, in which early stopping has an inherent effect. More precisely, we observe that transferability peaks when the learning rate decays, which is also the time at which the sharpness of the loss significantly drops. This leads us to propose RFN, a new approach to transferability that minimises the sharpness of the loss during training. We show that by searching for large flat neighbourhoods, RFN always improves over early stopping (by up to 47 points of success rate) and is competitive to (if not better than) strong state-of-the-art baselines.
Overall, our three complementary techniques provide an extensive and practical method to obtain highly transferable adversarial examples from the multimodal and local exploration of flatter vicinities in the weight space. Our probabilistic and geometric approaches demonstrate that the way to train the surrogate model has been overlooked, although both the training noise and the flatness of the loss landscape are important elements of transfer-based attacks
- …