1,120 research outputs found

    Computational Methods for Protein Inference in Shotgun Proteomics Experiments

    Get PDF
    In den letzten Jahrzehnten kam es zu einem signifikanten Anstiegs des Einsatzes von Hochdurchsatzmethoden in verschiedensten Bereichen der Naturwissenschaften, welche zu einem regelrechten Paradigmenwechsel führte. Eine große Anzahl an neuen Technologien wurde entwickelt um die Quantifizierung von Molekülen, die in verschiedenste biologische Prozesse involviert sind, voranzutreiben und zu beschleunigen. Damit einhergehend konnte eine beträchtliche Steigerung an Daten festgestellt werden, die durch diese verbesserten Methoden generiert wurden. Durch die Bereitstellung von computergestützten Verfahren zur Analyse eben dieser Masse an Rohdaten, spielt der Forschungsbereich der Bioinformatik eine immer größere Rolle bei der Extraktion biologischer Erkenntnisse. Im Speziellen hilft die computergestützte Massenspektrometrie bei der Prozessierung, Analyse und Visualisierung von Daten aus massenspektrometrischen Hochdursatzexperimenten. Bei der Erforschung der Gesamtheit aller Proteine einer Zelle oder einer anderweitigen Probe biologischen Materials, kommen selbst neueste Methoden an ihre Grenzen. Deswegen greifen viele Labore zu einer, dem Massenspektrometer vorgeschalteten, Verdauung der Probe um die Komplexität der zu messenden Moleküle zu verringern. Diese sogenannten "Bottom-up"-Proteomikexperimente mit Massenspektrometern führen allerdings zu einer erhöhten Schwierigkeit bei der anschließenden computergestützen Analyse. Durch die Verdauung von Proteinen zu Peptiden müssen komplexe Mehrdeutigkeiten während Proteininferenz, Proteingruppierung und Proteinquantifizierung berücksichtigt und/oder aufgelöst werden. Im Rahmen dieser Dissertation stellen wir mehrere Entwicklungen vor, die dabei helfen sollen eine effiziente und vollständig automatisierte Analyse von komplexen und umfangreichen \glqq Bottom-up\grqq{}-Proteomikexperimenten zu ermöglichen. Um die hinderliche Komplexität diskreter, Bayes'scher Proteininferenzmethoden zu verringern, wird neuerdings von sogenannten Faltungsbäumen (engl. "convolution trees") Gebrauch gemacht. Diese bieten bis jetzt jedoch keine genaue und gleichzeitig numerisch stabile Möglichkeit um "max-product"-Inferenz zu betreiben. Deswegen wird in dieser Dissertation zunächst eine neue Methode beschrieben die das mithilfe eines stückweisen bzw. extrapolierendem Verfahren ermöglicht. Basierend auf der Integration dieser Methode in eine mitentwickelte Bibliothek für Bayes'sche Inferenz, wird dann ein OpenMS-Tool für Proteininferenz präsentiert. Dieses Tool ermöglicht effiziente Proteininferenz auf Basis eines diskreten Bayes'schen Netzwerks mithilfe eines "loopy belief propagation" Algorithmus'. Trotz der streng probabilistischen Formulierung des Problems übertrifft unser Verfahren die meisten etablierten Methoden in Recheneffizienz. Das Interface des Algorithmus' bietet außerdem einzigartige Eingabe- und Ausgabeoptionen, wie z.B. das Regularisieren der Anzahl von Proteinen in einer Gruppe, proteinspezifische "Priors", oder rekalibrierte "Posteriors" der Peptide. Schließlich zeigt diese Arbeit einen kompletten, einfach zu benutzenden, aber trotzdem skalierenden Workflow für Proteininferenz und -quantifizierung, welcher um das neue Tool entwickelt wurde. Die Pipeline wurde in nextflow implementiert und ist Teil einer Gruppe von standardisierten, regelmäßig getesteten und von einer Community gepflegten Standardworkflows gebündelt unter dem Projekt nf-core. Unser Workflow ist in der Lage selbst große Datensätze mit komplizierten experimentellen Designs zu prozessieren. Mit einem einzigen Befehl erlaubt er eine (Re-)Analyse von lokalen oder öffentlich verfügbaren Datensätzen mit kompetetiver Genauigkeit und ausgezeichneter Performance auf verschiedensten Hochleistungsrechenumgebungen oder der Cloud.Since the beginning of this millennium, the advent of high-throughput methods in numerous fields of the life sciences led to a shift in paradigms. A broad variety of technologies emerged that allow comprehensive quantification of molecules involved in biological processes. Simultaneously, a major increase in data volume has been recorded with these techniques through enhanced instrumentation and other technical advances. By supplying computational methods that automatically process raw data to obtain biological information, the field of bioinformatics plays an increasingly important role in the analysis of the ever-growing mass of data. Computational mass spectrometry in particular, is a bioinformatics field of research which provides means to gather, analyze and visualize data from high-throughput mass spectrometric experiments. For the study of the entirety of proteins in a cell or an environmental sample, even current techniques reach limitations that need to be circumvented by simplifying the samples subjected to the mass spectrometer. These pre-digested (so-called bottom-up) proteomics experiments then pose an even bigger computational burden during analysis since complex ambiguities need to be resolved during protein inference, grouping and quantification. In this thesis, we present several developments in the pursuit of our goal to provide means for a fully automated analysis of complex and large-scale bottom-up proteomics experiments. Firstly, due to prohibitive computational complexities in state-of-the-art Bayesian protein inference techniques, a refined, more stable technique for performing inference on sums of random variables was developed to enable a variation of standard Bayesian inference for the problem. nextflow and part of a set of standardized, well-tested, and community-maintained workflows by the nf-core collective. Our workflow runs on large-scale data with complex experimental designs and allows a one-command analysis of local and publicly available data sets with state-of-the-art accuracy on various high-performance computing environments or the cloud

    Better approximations for Tree Sparsity in Nearly-Linear Time

    Get PDF
    The Tree Sparsity problem is defined as follows: given a node-weighted tree of size n and an integer k, output a rooted subtree of size k with maximum weight. The best known algorithm solves this problem in time O(kn), i.e., quadratic in the size of the input tree for k = Θ(n). In this work, we design (1+ε)-approximation algorithms for the Tree Sparsity problem that run in nearly-linear time. Unlike prior algorithms for this problem, our results offer single criterion approximations, i.e., they do not increase the sparsity of the output solution, and work for arbitrary trees (not only balanced trees). We also provide further algorithms for this problem with different runtime vs approximation trade-offs. Finally, we show that if the exact version of the Tree Sparsity problem can be solved in strongly subquadratic time, then the (min, +) convolution problem can be solved in strongly subquadratic time as well. The latter is a well- studied problem for which no strongly subquadratic time algorithm is known

    Sparse Bayesian mass-mapping with uncertainties: hypothesis testing of structure

    Get PDF
    A crucial aspect of mass-mapping, via weak lensing, is quantification of the uncertainty introduced during the reconstruction process. Properly accounting for these errors has been largely ignored to date. We present results from a new method that reconstructs maximum a posteriori (MAP) convergence maps by formulating an unconstrained Bayesian inference problem with Laplace-type 1\ell_1-norm sparsity-promoting priors, which we solve via convex optimization. Approaching mass-mapping in this manner allows us to exploit recent developments in probability concentration theory to infer theoretically conservative uncertainties for our MAP reconstructions, without relying on assumptions of Gaussianity. For the first time these methods allow us to perform hypothesis testing of structure, from which it is possible to distinguish between physical objects and artifacts of the reconstruction. Here we present this new formalism, demonstrate the method on illustrative examples, before applying the developed formalism to two observational datasets of the Abel-520 cluster. In our Bayesian framework it is found that neither Abel-520 dataset can conclusively determine the physicality of individual local massive substructure at significant confidence. However, in both cases the recovered MAP estimators are consistent with both sets of data

    Frontiers in Nonparametric Statistics

    Get PDF
    The goal of this workshop was to discuss recent developments of nonparametric statistical inference. A particular focus was on high dimensional statistics, semiparametrics, adaptation, nonparametric bayesian statistics, shape constraint estimation and statistical inverse problems. The close interaction of these issues with optimization, machine learning and inverse problems has been addressed as well

    Cleaning large correlation matrices: tools from random matrix theory

    Full text link
    This review covers recent results concerning the estimation of large covariance matrices using tools from Random Matrix Theory (RMT). We introduce several RMT methods and analytical techniques, such as the Replica formalism and Free Probability, with an emphasis on the Marchenko-Pastur equation that provides information on the resolvent of multiplicatively corrupted noisy matrices. Special care is devoted to the statistics of the eigenvectors of the empirical correlation matrix, which turn out to be crucial for many applications. We show in particular how these results can be used to build consistent "Rotationally Invariant" estimators (RIE) for large correlation matrices when there is no prior on the structure of the underlying process. The last part of this review is dedicated to some real-world applications within financial markets as a case in point. We establish empirically the efficacy of the RIE framework, which is found to be superior in this case to all previously proposed methods. The case of additively (rather than multiplicatively) corrupted noisy matrices is also dealt with in a special Appendix. Several open problems and interesting technical developments are discussed throughout the paper.Comment: 165 pages, article submitted to Physics Report

    Bayesian Variational Regularisation for Dark Matter Reconstruction with Uncertainty Quantification

    Get PDF
    Despite the great wealth of cosmological knowledge accumulated since the early 20th century, the nature of dark-matter, which accounts for ~85% of the matter content of the universe, remains illusive. Unfortunately, though dark-matter is scientifically interesting, with implications for our fundamental understanding of the Universe, it cannot be directly observed. Instead, dark-matter may be inferred from e.g. the optical distortion (lensing) of distant galaxies which, at linear order, manifests as a perturbation to the apparent magnitude (convergence) and ellipticity (shearing). Ensemble observations of the shear are collected and leveraged to construct estimates of the convergence, which can directly be related to the universal dark-matter distribution. Imminent stage IV surveys are forecast to accrue an unprecedented quantity of cosmological information; a discriminative partition of which is accessible through the convergence, and is disproportionately concentrated at high angular resolutions, where the echoes of cosmological evolution under gravity are most apparent. Capitalising on advances in probability concentration theory, this thesis merges the paradigms of Bayesian inference and optimisation to develop hybrid convergence inference techniques which are scalable, statistically principled, and operate over the Euclidean plane, celestial sphere, and 3-dimensional ball. Such techniques can quantify the plausibility of inferences at one-millionth the computational overhead of competing sampling methods. These Bayesian techniques are applied to the hotly debated Abell-520 merging cluster, concluding that observational catalogues contain insufficient information to determine the existence of dark-matter self-interactions. Further, these techniques were applied to all public lensing catalogues, recovering the then largest global dark-matter mass-map. The primary methodological contributions of this thesis depend only on posterior log-concavity, paving the way towards a, potentially revolutionary, complete hybridisation with artificial intelligence techniques. These next-generation techniques are the first to operate over the full 3-dimensional ball, laying the foundations for statistically principled universal dark-matter cartography, and the cosmological insights such advances may provide
    corecore