4,339 research outputs found

    Using machine learning to predict pathogenicity of genomic variants throughout the human genome

    Get PDF
    Geschätzt mehr als 6.000 Erkrankungen werden durch Veränderungen im Genom verursacht. Ursachen gibt es viele: Eine genomische Variante kann die Translation eines Proteins stoppen, die Genregulation stören oder das Spleißen der mRNA in eine andere Isoform begünstigen. All diese Prozesse müssen überprüft werden, um die zum beschriebenen Phänotyp passende Variante zu ermitteln. Eine Automatisierung dieses Prozesses sind Varianteneffektmodelle. Mittels maschinellem Lernen und Annotationen aus verschiedenen Quellen bewerten diese Modelle genomische Varianten hinsichtlich ihrer Pathogenität. Die Entwicklung eines Varianteneffektmodells erfordert eine Reihe von Schritten: Annotation der Trainingsdaten, Auswahl von Features, Training verschiedener Modelle und Selektion eines Modells. Hier präsentiere ich ein allgemeines Workflow dieses Prozesses. Dieses ermöglicht es den Prozess zu konfigurieren, Modellmerkmale zu bearbeiten, und verschiedene Annotationen zu testen. Der Workflow umfasst außerdem die Optimierung von Hyperparametern, Validierung und letztlich die Anwendung des Modells durch genomweites Berechnen von Varianten-Scores. Der Workflow wird in der Entwicklung von Combined Annotation Dependent Depletion (CADD), einem Varianteneffektmodell zur genomweiten Bewertung von SNVs und InDels, verwendet. Durch Etablierung des ersten Varianteneffektmodells für das humane Referenzgenome GRCh38 demonstriere ich die gewonnenen Möglichkeiten Annotationen aufzugreifen und neue Modelle zu trainieren. Außerdem zeige ich, wie Deep-Learning-Scores als Feature in einem CADD-Modell die Vorhersage von RNA-Spleißing verbessern. Außerdem werden Varianteneffektmodelle aufgrund eines neuen, auf Allelhäufigkeit basierten, Trainingsdatensatz entwickelt. Diese Ergebnisse zeigen, dass der entwickelte Workflow eine skalierbare und flexible Möglichkeit ist, um Varianteneffektmodelle zu entwickeln. Alle entstandenen Scores sind unter cadd.gs.washington.edu und cadd.bihealth.org frei verfügbar.More than 6,000 diseases are estimated to be caused by genomic variants. This can happen in many possible ways: a variant may stop the translation of a protein, interfere with gene regulation, or alter splicing of the transcribed mRNA into an unwanted isoform. It is necessary to investigate all of these processes in order to evaluate which variant may be causal for the deleterious phenotype. A great help in this regard are variant effect scores. Implemented as machine learning classifiers, they integrate annotations from different resources to rank genomic variants in terms of pathogenicity. Developing a variant effect score requires multiple steps: annotation of the training data, feature selection, model training, benchmarking, and finally deployment for the model's application. Here, I present a generalized workflow of this process. It makes it simple to configure how information is converted into model features, enabling the rapid exploration of different annotations. The workflow further implements hyperparameter optimization, model validation and ultimately deployment of a selected model via genome-wide scoring of genomic variants. The workflow is applied to train Combined Annotation Dependent Depletion (CADD), a variant effect model that is scoring SNVs and InDels genome-wide. I show that the workflow can be quickly adapted to novel annotations by porting CADD to the genome reference GRCh38. Further, I demonstrate the integration of deep-neural network scores as features into a new CADD model, improving the annotation of RNA splicing events. Finally, I apply the workflow to train multiple variant effect models from training data that is based on variants selected by allele frequency. In conclusion, the developed workflow presents a flexible and scalable method to train variant effect scores. All software and developed scores are freely available from cadd.gs.washington.edu and cadd.bihealth.org

    Modelling, Monitoring, Control and Optimization for Complex Industrial Processes

    Get PDF
    This reprint includes 22 research papers and an editorial, collected from the Special Issue "Modelling, Monitoring, Control and Optimization for Complex Industrial Processes", highlighting recent research advances and emerging research directions in complex industrial processes. This reprint aims to promote the research field and benefit the readers from both academic communities and industrial sectors

    Exploring QCD matter in extreme conditions with Machine Learning

    Full text link
    In recent years, machine learning has emerged as a powerful computational tool and novel problem-solving perspective for physics, offering new avenues for studying strongly interacting QCD matter properties under extreme conditions. This review article aims to provide an overview of the current state of this intersection of fields, focusing on the application of machine learning to theoretical studies in high energy nuclear physics. It covers diverse aspects, including heavy ion collisions, lattice field theory, and neutron stars, and discuss how machine learning can be used to explore and facilitate the physics goals of understanding QCD matter. The review also provides a commonality overview from a methodology perspective, from data-driven perspective to physics-driven perspective. We conclude by discussing the challenges and future prospects of machine learning applications in high energy nuclear physics, also underscoring the importance of incorporating physics priors into the purely data-driven learning toolbox. This review highlights the critical role of machine learning as a valuable computational paradigm for advancing physics exploration in high energy nuclear physics.Comment: 146 pages,53 figure

    Novel approaches for hierarchical classification with case studies in protein function prediction

    Get PDF
    A very large amount of research in the data mining, machine learning, statistical pattern recognition and related research communities has focused on flat classification problems. However, many problems in the real world such as hierarchical protein function prediction have their classes naturally organised into hierarchies. The task of hierarchical classification, however, needs to be better defined as researchers into one application domain are often unaware of similar efforts developed in other research areas. The first contribution of this thesis is to survey the task of hierarchical classification across different application domains and present an unifying framework for the task. After clearly defining the problem, we explore novel approaches to the task. Based on the understanding gained by surveying the task of hierarchical classification, there are three major approaches to deal with hierarchical classification problems. The first approach is to use one of the many existing flat classification algorithms to predict only the leaf classes in the hierarchy. Note that, in the training phase, this approach completely ignores the hierarchical class relationships, i.e. the parent-child and sibling class relationships, but in the testing phase the ancestral classes of an instance can be inferred from its predicted leaf classes. The second approach is to build a set of local models, by training one flat classification algorithm for each local view of the hierarchy. The two main variations of this approach are: (a) training a local flat multi-class classifier at each non-leaf class node, where each classifier discriminates among the child classes of its associated class; or (b) training a local fiat binary classifier at each node of the class hierarchy, where each classifier predicts whether or not a new instance has the classifier’s associated class. In both these variations, in the testing phase a procedure is used to combine the predictions of the set of local classifiers in a coherent way, avoiding inconsistent predictions. The third approach is to use a global-model hierarchical classification algorithm, which builds one single classification model by taking into account all the hierarchical class relationships in the training phase. In the context of this categorization of hierarchical classification approaches, the other contributions of this thesis are as follows. The second contribution of this thesis is a novel algorithm which is based on the local classifier per parent node approach. The novel algorithm is the selective representation approach that automatically selects the best protein representation to use at each non-leaf class node. The third contribution is a global-model hierarchical classification extension of the well known naive Bayes algorithm. Given the good predictive performance of the global-model hierarchical-classification naive Bayes algorithm, we relax the Naive Bayes’ assumption that attributes are independent from each other given the class by using the concept of k dependencies. Hence, we extend the flat classification /¿-Dependence Bayesian network classifier to the task of hierarchical classification, which is the fourth contribution of this thesis. Both the proposed global-model hierarchical classification Naive Bayes and the proposed global-model hierarchical /¿-Dependence Bayesian network classifier have achieved predictive accuracies that were, overall, significantly higher than the predictive accuracies obtained by their corresponding local hierarchical classification versions, across a number of datasets for the task of hierarchical protein function prediction

    Computational Intelligence for Cooperative Swarm Control

    Full text link
    Over the last few decades, swarm intelligence (SI) has shown significant benefits in many practical applications. Real-world applications of swarm intelligence include disaster response and wildlife conservation. Swarm robots can collaborate to search for survivors, locate victims, and assess damage in hazardous environments during an earthquake or natural disaster. They can coordinate their movements and share data in real-time to increase their efficiency and effectiveness while guiding the survivors. In addition to tracking animal movements and behaviour, robots can guide animals to or away from specific areas. Sheep herding is a significant source of income in Australia that could be significantly enhanced if the human shepherd could be supported by single or multiple robots. Although the shepherding framework has become a popular SI mechanism, where a leading agent (sheepdog) controls a swarm of agents (sheep) to complete a task, controlling a swarm of agents is still not a trivial task, especially in the presence of some practical constraints. For example, most of the existing shepherding literature assumes that each swarm member has an unlimited sensing range to recognise all other members’ locations. However, this is not practical for physical systems. In addition, current approaches do not consider shepherding as a distributed system where an agent, namely a central unit, may observe the environment and commu- nicate with the shepherd to guide the swarm. However, this brings another hurdle when noisy communication channels between the central unit and the shepherd af- fect the success of the mission. Also, the literature lacks shepherding models that can cope with dynamic communication systems. Therefore, this thesis aims to design a multi-agent learning system for effective shepherding control systems in a partially observable environment under communication constraints. To achieve this goal, the thesis first introduces a new methodology to guide agents whose sensing range is limited. In this thesis, the sheep are modelled as an induced network to represent the sheep’s sensing range and propose a geometric method for finding a shepherd-impacted subset of sheep. The proposed swarm optimal herding point uses a particle swarm optimiser and a clustering mechanism to find the sheepdog’s near-optimal herding location while considering flock cohesion. Then, an improved version of the algorithm (named swarm optimal modified centroid push) is proposed to estimate the sheepdog’s intermediate waypoints to the herding point considering the sheep cohesion. The approaches outperform existing shepherding methods in reducing task time and increasing the success rate for herding. Next, to improve shepherding in noisy communication channels, this thesis pro- poses a collaborative learning-based method to enhance communication between the central unit and the herding agent. The proposed independent pre-training collab- orative learning technique decreases the transmission mean square error by half in 10% of the training time compared to existing approaches. The algorithm is then ex- tended so that the sheepdog can read the modulated herding points from the central unit. The results demonstrate the efficiency of the new technique in time-varying noisy channels. Finally, the central unit is modelled as a mobile agent to lower the time-varying noise caused by the sheepdog’s motion during the task. So, I propose a Q-learning- based incremental search to increase transmission success between the shepherd and the central unit. In addition, two unique reward functions are presented to ensure swarm guidance success with minimal energy consumption. The results demonstrate an increase in the success rate for shepherding

    Acoustic modelling, data augmentation and feature extraction for in-pipe machine learning applications

    Get PDF
    Gathering measurements from infrastructure, private premises, and harsh environments can be difficult and expensive. From this perspective, the development of new machine learning algorithms is strongly affected by the availability of training and test data. We focus on audio archives for in-pipe events. Although several examples of pipe-related applications can be found in the literature, datasets of audio/vibration recordings are much scarcer, and the only references found relate to leakage detection and characterisation. Therefore, this work proposes a methodology to relieve the burden of data collection for acoustic events in deployed pipes. The aim is to maximise the yield of small sets of real recordings and demonstrate how to extract effective features for machine learning. The methodology developed requires the preliminary creation of a soundbank of audio samples gathered with simple weak annotations. For practical reasons, the case study is given by a range of appliances, fittings, and fixtures connected to pipes in domestic environments. The source recordings are low-reverberated audio signals enhanced through a bespoke spectral filter and containing the desired audio fingerprints. The soundbank is then processed to create an arbitrary number of synthetic augmented observations. The data augmentation improves the quality and the quantity of the metadata and automatically creates strong and accurate annotations that are both machine and human-readable. Besides, the implemented processing chain allows precise control of properties such as signal-to-noise ratio, duration of the events, and the number of overlapping events. The inter-class variability is expanded by recombining source audio blocks and adding simulated artificial reverberation obtained through an acoustic model developed for the purpose. Finally, the dataset is synthesised to guarantee separability and balance. A few signal representations are optimised to maximise the classification performance, and the results are reported as a benchmark for future developments. The contribution to the existing knowledge concerns several aspects of the processing chain implemented. A novel quasi-analytic acoustic model is introduced to simulate in-pipe reverberations, adopting a three-layer architecture particularly convenient for batch processing. The first layer includes two algorithms: one for the numerical calculation of the axial wavenumbers and one for the separation of the modes. The latter, in particular, provides a workaround for a problem not explicitly treated in the literature and related to the modal non-orthogonality given by the solid-liquid interface in the analysed domain. A set of results for different waveguides is reported to compare the dispersive behaviour against different mechanical configurations. Two more novel solutions are also included in the second layer of the model and concern the integration of the acoustic sources. Specifically, the amplitudes of the non-orthogonal modal potentials are obtained using either a distance minimisation objective function or by solving an analytical decoupling problem. In both cases, results show that sources sufficiently smooth can be approximated with a limited number of modes keeping the error below 1%. The last layer proposes a bespoke approach for the integration of the acoustic model into the synthesiser as a reverberation simulator. Additional elements of novelty relate to the other blocks of the audio synthesiser. The statistical spectral filter, for instance, is a batch-processing solution for the attenuation of the background noise of the source recordings. The signal-to-noise ratio analysis for both moderate and high noise levels indicates a clear improvement of several decibels against the closest filter example in the literature. The recombination of the audio blocks and the system of fully tracked annotations are also novel extensions of similar approaches recently adopted in other contexts. Moreover, a bespoke synthesis strategy is proposed to guarantee separable and balanced datasets. The last contribution concerns the extraction of convenient sets of audio features. Elements of novelty are introduced for the optimisation of the filter banks of the mel-frequency cepstral coefficients and the scattering wavelet transform. In particular, compared to the respective standard definitions, the average F-score performance of the optimised features is roughly 6% higher in the first case and 2.5% higher for the latter. Finally, the soundbank, the synthetic dataset, and the fundamental blocks of the software library developed are publicly available for further research

    What Matters in Model Training to Transfer Adversarial Examples

    Get PDF
    Despite state-of-the-art performance on natural data, Deep Neural Networks (DNNs) are highly vulnerable to adversarial examples, i.e., imperceptible, carefully crafted perturbations of inputs applied at test time. Adversarial examples can transfer: an adversarial example against one model is likely to be adversarial against another independently trained model. This dissertation investigates the characteristics of the surrogate weight space that lead to the transferability of adversarial examples. Our research covers three complementary aspects of the weight space exploration: the multimodal exploration to obtain multiple models from different vicinities, the local exploration to obtain multiple models in the same vicinity, and the point selection to obtain a single transferable representation. First, from a probabilistic perspective, we argue that transferability is fundamentally related to uncertainty. The unknown weights of the target DNN can be treated as random variables. Under a specified threat model, deep ensemble can produce a surrogate by sampling from the distribution of the target model. Unfortunately, deep ensembles are computationally expensive. We propose an efficient alternative by approximately sampling surrogate models from the posterior distribution using cSGLD, a state-of-the-art Bayesian deep learning technique. Our extensive experiments show that our approach improves and complements four attacks, three transferability techniques, and five more training methods significantly on ImageNet, CIFAR-10, and MNIST (up to 83.2 percentage points), while reducing training computations from 11.6 to 2.4 exaflops compared to deep ensemble on ImageNet. Second, we propose transferability from Large Geometric Vicinity (LGV), a new technique based on the local exploration of the weight space. LGV starts from a pretrained model and collects multiple weights in a few additional training epochs with a constant and high learning rate. LGV exploits two geometric properties that we relate to transferability. First, we show that LGV explores a flatter region of the weight space and generates flatter adversarial examples in the input space. We present the surrogate-target misalignment hypothesis to explain why flatness could increase transferability. Second, we show that the LGV weights span a dense weight subspace whose geometry is intrinsically connected to transferability. Through extensive experiments, we show that LGV alone outperforms all (combinations of) four established transferability techniques by 1.8 to 59.9 percentage points. Third, we investigate how to train a transferable representation, that is, a single model for transferability. First, we refute a common hypothesis from previous research to explain why early stopping improves transferability. We then establish links between transferability and the exploration dynamics of the weight space, in which early stopping has an inherent effect. More precisely, we observe that transferability peaks when the learning rate decays, which is also the time at which the sharpness of the loss significantly drops. This leads us to propose RFN, a new approach to transferability that minimises the sharpness of the loss during training. We show that by searching for large flat neighbourhoods, RFN always improves over early stopping (by up to 47 points of success rate) and is competitive to (if not better than) strong state-of-the-art baselines. Overall, our three complementary techniques provide an extensive and practical method to obtain highly transferable adversarial examples from the multimodal and local exploration of flatter vicinities in the weight space. Our probabilistic and geometric approaches demonstrate that the way to train the surrogate model has been overlooked, although both the training noise and the flatness of the loss landscape are important elements of transfer-based attacks
    • …
    corecore