1,586 research outputs found

    Deep learning of the dynamics of complex systems with its applications to biochemical molecules

    Get PDF
    Recent advancements in deep learning have revolutionized method development in several scientific fields and beyond. One central application is the extraction of equilibrium structures and long- timescale kinetics from molecular dynamics simulations, i.e. the well-known sampling problem. Previous state-of-the art methods employed a multi-step handcrafted data processing pipeline resulting in Markov state models (MSM), which can be understood as an approximation of the underlying Koopman operator. However, this approach demands choosing a set of features characterizing the molecular structure, methods and their parameters for dimension reduction to collective variables and clustering, and estimation strategies for MSMs throughout the processing pipeline. As this requires specific expertise, the approach is ultimately inaccessible to a broader community. In this thesis we apply deep learning techniques to approximate the Koopman operator in an end-to-end learning framework by employing the variational approach for Markov processes (VAMP). Thereby, the framework bypasses the multi-step process and automates the pipeline while yielding a model similar to a coarse-grained MSM. We further transfer advanced techniques from the MSM field to the deep learning framework, making it possible to (i) include experimental evidence into the model estimation, (ii) enforce reversibility, and (iii) perform coarse-graining. At this stage, post-analysis tools from MSMs can be borrowed to estimate rates of relevant rare events. Finally, we extend this approach to decompose a system into its (almost) independent subsystems and simultaneously estimate dynamical models for each of them, making it much more data efficient and enabling applications to larger proteins. Although our results solely focus on protein dynamics, the application to climate, weather, and ocean currents data is an intriguing possibility with potential to yield new insights and improve predictive power in these fields

    A fast classifier-based approach to credit card fraud detection

    Get PDF
    openThis thesis aims at addressing the problem of anomaly detection in the context of credit card fraud detection with machine learning. Specifically, the goal is to apply a new approach to two-sample testing based on classifiers recently developed for new physic searches in high-energy physics. This strategy allows one to compare batches of incoming data with a control sample of standard transactions in a statistically sound way without prior knowledge of the type of fraudulent activity. The learning algorithm at the basis of this approach is a modern implementation of kernel methods that allows for fast online training and high flexibility. This work is the first attempt to export this method to a real-world use case outside the domain of particle physics.This thesis aims at addressing the problem of anomaly detection in the context of credit card fraud detection with machine learning. Specifically, the goal is to apply a new approach to two-sample testing based on classifiers recently developed for new physic searches in high-energy physics. This strategy allows one to compare batches of incoming data with a control sample of standard transactions in a statistically sound way without prior knowledge of the type of fraudulent activity. The learning algorithm at the basis of this approach is a modern implementation of kernel methods that allows for fast online training and high flexibility. This work is the first attempt to export this method to a real-world use case outside the domain of particle physics

    Computational Investigations of Biomolecular Mechanisms in Genomic Replication, Repair and Transcription

    Get PDF
    High fidelity maintenance of the genome is imperative to ensuring stability and proliferation of cells. The genetic material (DNA) of a cell faces a constant barrage of metabolic and environmental assaults throughout the its lifetime, ultimately leading to DNA damage. Left unchecked, DNA damage can result in genomic instability, inviting a cascade of mutations that initiate cancer and other aging disorders. Thus, a large area of focus has been dedicated to understanding how DNA is damaged, repaired, expressed and replicated. At the heart of these processes lie complex macromolecular dynamics coupled with intricate protein-DNA interactions. Through advanced computational techniques it has become possible to probe these mechanisms at the atomic level, providing a physical basis to describe biomolecular phenomena. To this end, we have performed studies aimed at elucidating the dynamics and interactions intrinsic to the functionality of biomolecules critical to maintaining genomic integrity: modeling the DNA editing mechanism of DNA polymerase III, uncovering the DNA damage recognition/repair mechanism of thymine DNA glycosylase and linking genetic disease to the functional dynamics of the pre-initiation complex transcription machinery. Collectively, our results elucidate the dynamic interplay between proteins and DNA, further broadening our understanding of these complex processes involved with genomic maintenance

    Cross Entropy-based Analysis of Spacecraft Control Systems

    Get PDF
    Space missions increasingly require sophisticated guidance, navigation and control algorithms, the development of which is reliant on verification and validation (V&V) techniques to ensure mission safety and success. A crucial element of V&V is the assessment of control system robust performance in the presence of uncertainty. In addition to estimating average performance under uncertainty, it is critical to determine the worst case performance. Industrial V&V approaches typically employ mu-analysis in the early control design stages, and Monte Carlo simulations on high-fidelity full engineering simulators at advanced stages of the design cycle. While highly capable, such techniques present a critical gap between pessimistic worst case estimates found using analytical methods, and the optimistic outlook often presented by Monte Carlo runs. Conservative worst case estimates are problematic because they can demand a controller redesign procedure, which is not justified if the poor performance is unlikely to occur. Gaining insight into the probability associated with the worst case performance is valuable in bridging this gap. It should be noted that due to the complexity of industrial-scale systems, V&V techniques are required to be capable of efficiently analysing non-linear models in the presence of significant uncertainty. As well, they must be computationally tractable. It is desirable that such techniques demand little engineering effort before each analysis, to be applied widely in industrial systems. Motivated by these factors, this thesis proposes and develops an efficient algorithm, based on the cross entropy simulation method. The proposed algorithm efficiently estimates the probabilities associated with various performance levels, from nominal performance up to degraded performance values, resulting in a curve of probabilities associated with various performance values. Such a curve is termed the probability profile of performance (PPoP), and is introduced as a tool that offers insight into a control system's performance, principally the probability associated with the worst case performance. The cross entropy-based robust performance analysis is implemented here on various industrial systems in European Space Agency-funded research projects. The implementation on autonomous rendezvous and docking models for the Mars Sample Return mission constitutes the core of the thesis. The proposed technique is implemented on high-fidelity models of the Vega launcher, as well as on a generic long coasting launcher upper stage. In summary, this thesis (a) develops an algorithm based on the cross entropy simulation method to estimate the probability associated with the worst case, (b) proposes the cross entropy-based PPoP tool to gain insight into system performance, (c) presents results of the robust performance analysis of three space industry systems using the proposed technique in conjunction with existing methods, and (d) proposes an integrated template for conducting robust performance analysis of linearised aerospace systems

    State Estimation for Distributed Systems with Stochastic and Set-membership Uncertainties

    Get PDF
    State estimation techniques for centralized, distributed, and decentralized systems are studied. An easy-to-implement state estimation concept is introduced that generalizes and combines basic principles of Kalman filter theory and ellipsoidal calculus. By means of this method, stochastic and set-membership uncertainties can be taken into consideration simultaneously. Different solutions for implementing these estimation algorithms in distributed networked systems are presented

    Design Methods for Reducing Failure Probabilities with Examples from Electrical Engineering

    Get PDF
    This thesis addresses the quantification of uncertainty and optimization under uncertainty. We focus on uncertainties in the manufacturing process of devices, e.g. caused by manufacturing imperfections, natural material deviations or environmental influences. These uncertainties may lead to deviations in the geometry or the materials, which may cause deviations in the operation of the device. The term yield refers to the fraction of realizations in a manufacturing process under uncertainty, fulfilling all performance requirements. It is the counterpart of the failure probability (yield = 1 - failure probability) and serves as a measure for (un)certainty. The main goal of this work is to efficiently estimate and to maximize the yield. In this way, we increase the reliability of designs which reduces rejects of devices due to malfunction and hence saves resources, money and time. One main challenge in the field of yield estimation is the reduction of computing effort, maintaining high accuracy. In this work we propose two hybrid yield estimation methods. Both are sampling based and evaluate most of the sample points on a surrogate model, while only a small subset of so-called critical sample points is evaluated on the original high fidelity model. The SC-Hybrid approach is based on stochastic collocation and adjoint error indicators. The non-intrusive GPR-Hybrid approach uses Gaussian process regression and allows surrogate model updates on the fly. For efficient yield optimization we propose the adaptive Newton-Monte-Carlo (Newton-MC) method, where the sample size is adaptively increased. Another topic is the optimization of problems with mixed gradient information, i.e., problems, where the derivatives of the objective function are available with respect to some optimization variables, but not for all. The usage of gradient based solvers like the adaptive Newton-MC would require the costly approximation of the derivatives. We propose two methods for this case: the Hermite least squares and the Hermite BOBYQA optimization. Both are modifications of the originally derivative free BOBYQA (Bound constrained Optimization BY Quadratic Approximation) method, but are able to handle derivative information and use least squares regression instead of interpolation. In addition, an advantage of the Hermite-type approaches is their robustness in case of noisy objective functions. The global convergence of these methods is proven. In the context of yield optimization the case of mixed gradient information is particularly relevant, if - besides Gaussian distributed uncertain optimization variables - there are deterministic or non-Gaussian distributed uncertain optimization variables. The proposed methods can be applied to any design process affected by uncertainties. However, in this work we focus on application to the design of electrotechnical devices. We evaluate the approaches on two benchmark problems, a rectangular waveguide and a permanent magnet synchronous machine (PMSM). Significant savings of computing effort can be observed in yield estimation, and single- and multi-objective yield optimization. This allows the application of design optimization under uncertainty in industry

    Measuring primate gene expression evolution using high throughput transcriptomics and massively parallel reporter assays

    Get PDF
    A key question in biology is how one genome sequence can lead to the great cellular diversity present in multicellular organisms. Enabled by he sequencing revolution, RNA sequencing (RNA-seq) has emerged as a central tool to measure transcriptome-wide gene expression levels. More recently, single cell RNA-seq was introduced and is becoming a feasible alternative to the more established bulk sequencing. While many different methods have been proposed, a thorough optimisation of established protocols can lead to improvements in robustness, sensitivity, scalability and cost effectiveness. Towards this goal, I have contributed to optimizing the single cell RNA-seq method "Single Cell RNA Barcoding and sequencing" (SCRB-seq) and publishing an improved version that uses optimized reaction conditions and molecular crowding (mcSCRB-seq). mcSCRB-seq achieves higher sensitivity at lower cost per cell and shows the highest RNA capture rate when compared with other published methods. We next sought the direct comparison to other scRNA-seq protocols within the Human Cell Atlas (HCA) benchmarking effort. Here we used mcSCRB-seq to profile a common reference sample that included heterogeneous cell populations from different sources. Transfer of the acquired knowledge on single cell RNA sequencing methods to bulk RNA-seq, led to the development of the prime-seq protocol. A sensitive, robust and cost-efficient bulk RNA-seq protocol that can be performed in any molecular biology laboratory. We compared the data generated, using the prime-seq protocol to the gold standard method TruSeq, using power simulations and found that the statistical power to detect differentially expressed genes is comparable, at 40-fold lower cost. While gene expression is an informative phenotype, the regulation that leads to the different phenotypes is still poorly understood. A state-of-the-art method to measure the activity of cis-regulatory elements (CRE) in a high throughput fashion are Massively Parallel Reporter Assays (MPRA). These assays can be used to measure the activity of thousands of cis-Regulatory Elements (CRE) in parallel. A good way to decode the genotype to phenotype conundrum is using evolutionary information. Cross-species comparisons of closely related species can help understand how particular diverging phenotypes emerged and how conserved gene regulatory programs are encoded in the genome. A very useful tool to perform comparative studies are cell lines, particularly induced Pluripotent Stem Cells (iPSCs). iPSCs can be reprogrammed from different primary somatic cells and are per definition pluripotent, meaning they can be differentiated into cells of all three germlayers. A main challenge for primate research is to obtain primary cells. To this end I contributed to establishing a protocol to generate iPSCs from a non-invasive source of primary cells, namely urine. By using prime-seq we characterized the primary Urine Derived Stem Cells (UDSCs) and the reprogrammed iPSCs. Finally, I used an MPRA to measure activity of putative regulatory elements of the gene TRNP1 across the mammalian phylogeny. We found co-evolution of one particular CRE with brain folding in old world monkeys. To validate the finding we looked for transcription factor binding sites within the identified CRE and intersected the list with transcription factors confirmed to be expressed in the cellular system using prime-seq. In addition we found that changes in the protein coding sequence of TRNP1 and neural stem cell proliferation induced by TRNP1 orthologs correlate with brain size. In summary, within my doctorate I developed methods that enable measuring gene expression and gene regulation in a comparative genomics setting. I further applied these methods in a cross mammalian study of the regulatory sequences of the gene TRNP1 and its association with brain phenotypes

    Predicting Flavonoid UGT Regioselectivity with Graphical Residue Models and Machine Learning.

    Get PDF
    Machine learning is applied to a challenging and biologically significant protein classification problem: the prediction of flavonoid UGT acceptor regioselectivity from primary protein sequence. Novel indices characterizing graphical models of protein residues are introduced. The indices are compared with existing amino acid indices and found to cluster residues appropriately. A variety of models employing the indices are then investigated by examining their performance when analyzed using nearest neighbor, support vector machine, and Bayesian neural network classifiers. Improvements over nearest neighbor classifications relying on standard alignment similarity scores are reported

    ANALYSIS AND SIMULATION OF TANDEM MASS SPECTROMETRY DATA

    Get PDF
    This dissertation focuses on improvements to data analysis in mass spectrometry-based proteomics, which is the study of an organism’s full complement of proteins. One of the biggest surprises from the Human Genome Project was the relatively small number of genes (~20,000) encoded in our DNA. Since genes code for proteins, scientists expected more genes would be necessary to produce a diverse set of proteins to cover the many functions that support the complexity of life. Thus, there is intense interest in studying proteomics, including post-translational modifications (how proteins change after translation from their genes), and their interactions (e.g. proteins binding together to form complex molecular machines) to fill the void in molecular diversity. The goal of mass spectrometry in proteomics is to determine the abundance and amino acid sequence of every protein in a biological sample. A mass spectrometer can determine mass/charge ratios and abundance for fragments of short peptides (which are subsequences of a protein); sequencing algorithms determine which peptides are most likely to have generated the fragmentation patterns observed in the mass spectrum, and protein identity is inferred from the peptides. My work improves the computational tools for mass spectrometry by removing limitations on present algorithms, simulating mass spectroscopy instruments to facilitate algorithm development, and creating algorithms that approximate isotope distributions, deconvolve chimeric spectra, and predict protein-protein interactions. While most sequencing algorithms attempt to identify a single peptide per mass spectrum, multiple peptides are often fragmented together. Here, I present a method to deconvolve these chimeric mass spectra into their individual peptide components by examining the isotopic distributions of their fragments. First, I derived the equation to calculate the theoretical isotope distribution of a peptide fragment. Next, for cases where elemental compositions are not known, I developed methods to approximate the isotope distributions. Ultimately, I created a non-negative least squares model that deconvolved chimeric spectra and increased peptide-spectrum-matches by 15-30%. To improve the operation of mass spectrometer instruments, I developed software that simulates liquid chromatography-mass spectrometry data and the subsequent execution of custom data acquisition algorithms. The software provides an opportunity for researchers to test, refine, and evaluate novel algorithms prior to implementation on a mass spectrometer. Finally, I created a logistic regression classifier for predicting protein-protein interactions defined by affinity purification and mass spectrometry (APMS). The classifier increased the area under the receiver operating characteristic curve by 16% compared to previous methods. Furthermore, I created a web application to facilitate APMS data scoring within the scientific community.Doctor of Philosoph
    • …
    corecore