1,120 research outputs found
Computational Methods for Protein Inference in Shotgun Proteomics Experiments
In den letzten Jahrzehnten kam es zu einem signifikanten Anstiegs des Einsatzes von Hochdurchsatzmethoden in verschiedensten Bereichen der Naturwissenschaften, welche zu einem regelrechten Paradigmenwechsel führte. Eine große Anzahl an neuen Technologien wurde entwickelt um
die Quantifizierung von Molekülen, die in verschiedenste biologische Prozesse involviert sind, voranzutreiben und zu beschleunigen. Damit einhergehend konnte eine beträchtliche Steigerung an
Daten festgestellt werden, die durch diese verbesserten Methoden generiert wurden. Durch die Bereitstellung von computergestützten Verfahren zur Analyse eben dieser Masse an Rohdaten, spielt der Forschungsbereich der Bioinformatik eine immer größere Rolle bei der Extraktion biologischer Erkenntnisse.
Im Speziellen hilft die computergestützte Massenspektrometrie bei der Prozessierung, Analyse und Visualisierung von Daten aus massenspektrometrischen Hochdursatzexperimenten. Bei der Erforschung der Gesamtheit aller Proteine einer Zelle oder einer anderweitigen Probe biologischen Materials,
kommen selbst neueste Methoden an ihre Grenzen. Deswegen greifen viele Labore zu einer, dem Massenspektrometer vorgeschalteten, Verdauung der Probe um die Komplexität der zu messenden Moleküle
zu verringern. Diese sogenannten "Bottom-up"-Proteomikexperimente mit Massenspektrometern führen allerdings zu einer erhöhten Schwierigkeit bei der anschließenden computergestützen Analyse. Durch die Verdauung von Proteinen zu Peptiden müssen komplexe Mehrdeutigkeiten während Proteininferenz, Proteingruppierung und Proteinquantifizierung berücksichtigt und/oder aufgelöst werden.
Im Rahmen dieser Dissertation stellen wir mehrere Entwicklungen vor, die dabei helfen sollen eine
effiziente und vollständig automatisierte Analyse von komplexen und umfangreichen \glqq Bottom-up\grqq{}-Proteomikexperimenten zu ermöglichen.
Um die hinderliche Komplexität diskreter, Bayes'scher Proteininferenzmethoden zu verringern, wird
neuerdings von sogenannten Faltungsbäumen (engl. "convolution trees") Gebrauch gemacht. Diese bieten bis jetzt jedoch keine genaue und gleichzeitig numerisch stabile Möglichkeit um "max-product"-Inferenz zu betreiben. Deswegen wird in dieser Dissertation zunächst eine neue Methode beschrieben die das mithilfe eines stückweisen bzw. extrapolierendem Verfahren ermöglicht.
Basierend auf der Integration dieser Methode in eine mitentwickelte Bibliothek für Bayes'sche Inferenz, wird dann ein OpenMS-Tool für Proteininferenz präsentiert. Dieses Tool ermöglicht
effiziente Proteininferenz auf Basis eines diskreten Bayes'schen Netzwerks mithilfe eines
"loopy belief propagation" Algorithmus'. Trotz der streng probabilistischen Formulierung des Problems übertrifft unser Verfahren die meisten etablierten Methoden in Recheneffizienz. Das Interface des Algorithmus' bietet außerdem einzigartige Eingabe- und Ausgabeoptionen, wie z.B. das Regularisieren der Anzahl von Proteinen in einer Gruppe, proteinspezifische "Priors", oder rekalibrierte "Posteriors" der Peptide.
Schließlich zeigt diese Arbeit einen kompletten, einfach zu benutzenden, aber trotzdem skalierenden
Workflow für Proteininferenz und -quantifizierung, welcher um das neue Tool entwickelt wurde.
Die Pipeline wurde in nextflow implementiert und ist Teil einer Gruppe von standardisierten,
regelmäßig getesteten und von einer Community gepflegten Standardworkflows gebündelt unter dem Projekt nf-core. Unser Workflow ist in der Lage selbst große Datensätze mit komplizierten experimentellen
Designs zu prozessieren. Mit einem einzigen Befehl erlaubt er eine (Re-)Analyse von lokalen oder öffentlich verfügbaren Datensätzen mit kompetetiver Genauigkeit und ausgezeichneter Performance
auf verschiedensten Hochleistungsrechenumgebungen oder der Cloud.Since the beginning of this millennium, the advent of high-throughput methods in numerous fields of the life sciences led to a shift in paradigms. A broad variety of technologies emerged that allow comprehensive quantification of molecules involved in biological processes. Simultaneously, a major increase in data volume has been recorded with these techniques through enhanced instrumentation and other technical advances. By supplying computational methods that automatically process raw data to obtain biological information, the field of bioinformatics plays an increasingly important role in the analysis of the ever-growing mass of data.
Computational mass spectrometry in particular, is a bioinformatics field of research which provides means to gather, analyze and visualize data from high-throughput mass spectrometric experiments.
For the study of the entirety of proteins in a cell or an environmental sample, even current techniques reach limitations that need to be circumvented by simplifying the samples subjected to the mass spectrometer. These pre-digested (so-called bottom-up) proteomics experiments then pose an even bigger computational burden during analysis since complex ambiguities need to be resolved during protein inference, grouping and quantification.
In this thesis, we present several developments in the pursuit of our goal to provide means for a fully automated analysis of complex and large-scale bottom-up proteomics experiments.
Firstly, due to prohibitive computational complexities in state-of-the-art Bayesian protein inference techniques, a refined, more stable technique for performing inference on sums of random variables was developed to enable a variation of standard Bayesian inference for the problem.
nextflow and part of a set of standardized, well-tested, and community-maintained workflows by the nf-core collective. Our workflow runs on large-scale data with complex experimental designs and allows a one-command analysis of local and publicly available data sets with state-of-the-art accuracy on various high-performance computing environments or the cloud
Better approximations for Tree Sparsity in Nearly-Linear Time
The Tree Sparsity problem is defined as follows: given a node-weighted tree of size n and an integer k, output a rooted subtree of size k with maximum weight. The best known algorithm solves this problem in time O(kn), i.e., quadratic in the size of the input tree for k = Θ(n).
In this work, we design (1+ε)-approximation algorithms for the Tree Sparsity problem that run in nearly-linear time. Unlike prior algorithms for this problem, our results offer single criterion approximations, i.e., they do not increase the sparsity of the output solution, and work for arbitrary trees (not only balanced trees). We also provide further algorithms for this problem with different runtime vs approximation trade-offs.
Finally, we show that if the exact version of the Tree Sparsity problem can be solved in strongly subquadratic time, then the (min, +) convolution problem can be solved in strongly subquadratic time as well. The latter is a well- studied problem for which no strongly subquadratic time algorithm is known
Sparse Bayesian mass-mapping with uncertainties: hypothesis testing of structure
A crucial aspect of mass-mapping, via weak lensing, is quantification of the
uncertainty introduced during the reconstruction process. Properly accounting
for these errors has been largely ignored to date. We present results from a
new method that reconstructs maximum a posteriori (MAP) convergence maps by
formulating an unconstrained Bayesian inference problem with Laplace-type
-norm sparsity-promoting priors, which we solve via convex
optimization. Approaching mass-mapping in this manner allows us to exploit
recent developments in probability concentration theory to infer theoretically
conservative uncertainties for our MAP reconstructions, without relying on
assumptions of Gaussianity. For the first time these methods allow us to
perform hypothesis testing of structure, from which it is possible to
distinguish between physical objects and artifacts of the reconstruction. Here
we present this new formalism, demonstrate the method on illustrative examples,
before applying the developed formalism to two observational datasets of the
Abel-520 cluster. In our Bayesian framework it is found that neither Abel-520
dataset can conclusively determine the physicality of individual local massive
substructure at significant confidence. However, in both cases the recovered
MAP estimators are consistent with both sets of data
Frontiers in Nonparametric Statistics
The goal of this workshop was to discuss recent developments of nonparametric statistical inference. A particular focus was on high dimensional statistics, semiparametrics, adaptation, nonparametric bayesian statistics, shape constraint estimation and statistical inverse problems. The close interaction of these issues with optimization, machine learning and inverse problems has been addressed as well
Cleaning large correlation matrices: tools from random matrix theory
This review covers recent results concerning the estimation of large
covariance matrices using tools from Random Matrix Theory (RMT). We introduce
several RMT methods and analytical techniques, such as the Replica formalism
and Free Probability, with an emphasis on the Marchenko-Pastur equation that
provides information on the resolvent of multiplicatively corrupted noisy
matrices. Special care is devoted to the statistics of the eigenvectors of the
empirical correlation matrix, which turn out to be crucial for many
applications. We show in particular how these results can be used to build
consistent "Rotationally Invariant" estimators (RIE) for large correlation
matrices when there is no prior on the structure of the underlying process. The
last part of this review is dedicated to some real-world applications within
financial markets as a case in point. We establish empirically the efficacy of
the RIE framework, which is found to be superior in this case to all previously
proposed methods. The case of additively (rather than multiplicatively)
corrupted noisy matrices is also dealt with in a special Appendix. Several open
problems and interesting technical developments are discussed throughout the
paper.Comment: 165 pages, article submitted to Physics Report
Bayesian Variational Regularisation for Dark Matter Reconstruction with Uncertainty Quantification
Despite the great wealth of cosmological knowledge accumulated since the early 20th century, the nature of dark-matter, which accounts for ~85% of the matter content of the universe, remains illusive. Unfortunately, though dark-matter is scientifically interesting, with implications for our fundamental understanding of the Universe, it cannot be directly observed. Instead, dark-matter may be inferred from e.g. the optical distortion (lensing) of distant galaxies which, at linear order, manifests as a perturbation to the apparent magnitude (convergence) and ellipticity (shearing). Ensemble observations of the shear are collected and leveraged to construct estimates of the convergence, which can directly be related to the universal dark-matter distribution. Imminent stage IV surveys are forecast to accrue an unprecedented quantity of cosmological information; a discriminative partition of which is accessible through the convergence, and is disproportionately concentrated at high angular resolutions, where the echoes of cosmological evolution under gravity are most apparent. Capitalising on advances in probability concentration theory, this thesis merges the paradigms of Bayesian inference and optimisation to develop hybrid convergence inference techniques which are scalable, statistically principled, and operate over the Euclidean plane, celestial sphere, and 3-dimensional ball. Such techniques can quantify the plausibility of inferences at one-millionth the computational overhead of competing sampling methods. These Bayesian techniques are applied to the hotly debated Abell-520 merging cluster, concluding that observational catalogues contain insufficient information to determine the existence of dark-matter self-interactions. Further, these techniques were applied to all public lensing catalogues, recovering the then largest global dark-matter mass-map. The primary methodological contributions of this thesis depend only on posterior log-concavity, paving the way towards a, potentially revolutionary, complete hybridisation with artificial intelligence techniques. These next-generation techniques are the first to operate over the full 3-dimensional ball, laying the foundations for statistically principled universal dark-matter cartography, and the cosmological insights such advances may provide
- …