163 research outputs found

    The role of artificial intelligence-driven soft sensors in advanced sustainable process industries: a critical review

    Get PDF
    With the predicted depletion of natural resources and alarming environmental issues, sustainable development has become a popular as well as a much-needed concept in modern process industries. Hence, manufacturers are quite keen on adopting novel process monitoring techniques to enhance product quality and process efficiency while minimizing possible adverse environmental impacts. Hardware sensors are employed in process industries to aid process monitoring and control, but they are associated with many limitations such as disturbances to the process flow, measurement delays, frequent need for maintenance, and high capital costs. As a result, soft sensors have become an attractive alternative for predicting quality-related parameters that are ‘hard-to-measure’ using hardware sensors. Due to their promising features over hardware counterparts, they have been employed across different process industries. This article attempts to explore the state-of-the-art artificial intelligence (Al)-driven soft sensors designed for process industries and their role in achieving the goal of sustainable development. First, a general introduction is given to soft sensors, their applications in different process industries, and their significance in achieving sustainable development goals. AI-based soft sensing algorithms are then introduced. Next, a discussion on how AI-driven soft sensors contribute toward different sustainable manufacturing strategies of process industries is provided. This is followed by a critical review of the most recent state-of-the-art AI-based soft sensors reported in the literature. Here, the use of powerful AI-based algorithms for addressing the limitations of traditional algorithms, that restrict the soft sensor performance is discussed. Finally, the challenges and limitations associated with the current soft sensor design, application, and maintenance aspects are discussed with possible future directions for designing more intelligent and smart soft sensing technologies to cater the future industrial needs

    Current Challenges in the Application of Algorithms in Multi-institutional Clinical Settings

    Get PDF
    The Coronavirus disease pandemic has highlighted the importance of artificial intelligence in multi-institutional clinical settings. Particularly in situations where the healthcare system is overloaded, and a lot of data is generated, artificial intelligence has great potential to provide automated solutions and to unlock the untapped potential of acquired data. This includes the areas of care, logistics, and diagnosis. For example, automated decision support applications could tremendously help physicians in their daily clinical routine. Especially in radiology and oncology, the exponential growth of imaging data, triggered by a rising number of patients, leads to a permanent overload of the healthcare system, making the use of artificial intelligence inevitable. However, the efficient and advantageous application of artificial intelligence in multi-institutional clinical settings faces several challenges, such as accountability and regulation hurdles, implementation challenges, and fairness considerations. This work focuses on the implementation challenges, which include the following questions: How to ensure well-curated and standardized data, how do algorithms from other domains perform on multi-institutional medical datasets, and how to train more robust and generalizable models? Also, questions of how to interpret results and whether there exist correlations between the performance of the models and the characteristics of the underlying data are part of the work. Therefore, besides presenting a technical solution for manual data annotation and tagging for medical images, a real-world federated learning implementation for image segmentation is introduced. Experiments on a multi-institutional prostate magnetic resonance imaging dataset showcase that models trained by federated learning can achieve similar performance to training on pooled data. Furthermore, Natural Language Processing algorithms with the tasks of semantic textual similarity, text classification, and text summarization are applied to multi-institutional, structured and free-text, oncology reports. The results show that performance gains are achieved by customizing state-of-the-art algorithms to the peculiarities of the medical datasets, such as the occurrence of medications, numbers, or dates. In addition, performance influences are observed depending on the characteristics of the data, such as lexical complexity. The generated results, human baselines, and retrospective human evaluations demonstrate that artificial intelligence algorithms have great potential for use in clinical settings. However, due to the difficulty of processing domain-specific data, there still exists a performance gap between the algorithms and the medical experts. In the future, it is therefore essential to improve the interoperability and standardization of data, as well as to continue working on algorithms to perform well on medical, possibly, domain-shifted data from multiple clinical centers

    Transfer Learning of Deep Learning Models for Cloud Masking in Optical Satellite Images

    Get PDF
    Los satélites de observación de la Tierra proporcionan una oportunidad sin precedentes para monitorizar nuestro planeta a alta resolución tanto espacial como temporal. Sin embargo, para procesar toda esta cantidad creciente de datos, necesitamos desarrollar modelos rápidos y precisos adaptados a las características específicas de los datos de cada sensor. Para los sensores ópticos, detectar las nubes en la imagen es un primer paso inevitable en la mayoría de aplicaciones tanto terrestres como oceánicas. Aunque detectar nubes brillantes y opacas es relativamente fácil, identificar automáticamente nubes delgadas semitransparentes o diferenciar nubes de nieve o superficies brillantes es mucho más difícil. Además, en el escenario actual, donde el número de sensores en el espacio crece constantemente, desarrollar metodologías para transferir modelos que funcionen con datos de nuevos satélites es una necesidad urgente. Por tanto, los objetivos de esta tesis son desarrollar modelos precisos de detección de nubes que exploten las diferentes propiedades de las imágenes de satélite y desarrollar metodologías para transferir esos modelos a otros sensores. La tesis está basada en cuatro trabajos los cuales proponen soluciones a estos problemas. En la primera contribución, "Multitemporal cloud masking in the Google Earth Engine", implementamos un modelo de detección de nubes multitemporal que se ejecuta en la plataforma Google Earth Engine y que supera los modelos operativos de Landsat-8. La segunda contribución, "Transferring deep learning models for Cloud Detection between Landsat-8 and Proba-V", es un caso de estudio de transferencia de un algoritmo de detección de nubes basado en aprendizaje profundo de Landsat-8 (resolución 30m, 12 bandas espectrales y muy buena calidad radiométrica) a Proba-V, que tiene una resolución de 333m, solo cuatro bandas y una calidad radiométrica peor. El tercer artículo, "Cross sensor adversarial domain adaptation of Landsat-8 and Proba-V images for cloud detection", propone aprender una transformación de adaptación de dominios que haga que las imágenes de Proba-V se parezcan a las tomadas por Landsat-8 con el objetivo de transferir productos diseñados con datos de Landsat-8 a Proba-V. Finalmente, la cuarta contribución, "Towards global flood mapping onboard low cost satellites with machine learning", aborda simultáneamente la detección de inundaciones y nubes con un único modelo de aprendizaje profundo, implementado para que pueda ejecutarse a bordo de un CubeSat (ϕSat-I) con un chip acelerador de aplicaciones de inteligencia artificial. El modelo está entrenado en imágenes Sentinel-2 y demostramos cómo transferir este modelo a la cámara del ϕSat-I. Este modelo se lanzó en junio de 2021 a bordo de la misión WildRide de D-Orbit para probar su funcionamiento en el espacio.Remote sensing sensors onboard Earth observation satellites provide a great opportunity to monitor our planet at high spatial and temporal resolutions. Nevertheless, to process all this ever-growing amount of data, we need to develop fast and accurate models adapted to the specific characteristics of the data acquired by each sensor. For optical sensors, detecting the clouds present in the image is an unavoidable first step for most of the land and ocean applications. Although detecting bright and opaque clouds is relatively easy, automatically identifying thin semi-transparent clouds or distinguishing clouds from snow or bright surfaces is much more challenging. In addition, in the current scenario where the number of sensors in orbit is constantly growing, developing methodologies to transfer models across different satellite data is a pressing need. Henceforth, the overreaching goal of this Thesis is to develop accurate cloud detection models that exploit the different properties of the satellite images, and to develop methodologies to transfer those models across different sensors. The four contributions of this Thesis are stepping stones in that direction. In the first contribution,"Multitemporal cloud masking in the Google Earth Engine", we implemented a lightweight multitemporal cloud detection model that runs on the Google Earth Engine platform and which outperforms the operational models for Landsat-8. The second contribution, "Transferring deep learning models for Cloud Detection between Landsat-8 and Proba-V", is a case-study of transferring a deep learning based cloud detection algorithm from Landsat-8 (30m resolution, 12 spectral bands and very good radiometric quality) to Proba-V, which has a lower{333m resolution, only four bands and a less accurate radiometric quality. The third paper, "Cross sensor adversarial domain adaptation of Landsat-8 and Proba-V images for cloud detection", proposes a learning-based domain adaptation transformation of Proba-V images to resemble those taken by Landsat-8, with the objective of transferring products designed on Landsat-8 to Proba-V. Finally, the fourth contribution, "Towards global flood mapping onboard low cost satellites with machine learning", tackles simultaneously cloud and flood water detection with a single deep learning model, which was implemented to run onboard a CubeSat (ϕSat-I) with an AI accelerator chip. In this case, the model is trained on Sentinel-2 and transferred to theϕSat-I camera. This model was launched in June 2021 onboard the Wild Ride D-Orbit mission in order to test its performance in space

    Discovering structure without labels

    Get PDF
    The scarcity of labels combined with an abundance of data makes unsupervised learning more attractive than ever. Without annotations, inductive biases must guide the identification of the most salient structure in the data. This thesis contributes to two aspects of unsupervised learning: clustering and dimensionality reduction. The thesis falls into two parts. In the first part, we introduce Mod Shift, a clustering method for point data that uses a distance-based notion of attraction and repulsion to determine the number of clusters and the assignment of points to clusters. It iteratively moves points towards crisp clusters like Mean Shift but also has close ties to the Multicut problem via its loss function. As a result, it connects signed graph partitioning to clustering in Euclidean space. The second part treats dimensionality reduction and, in particular, the prominent neighbor embedding methods UMAP and t-SNE. We analyze the details of UMAP's implementation and find its actual loss function. It differs drastically from the one usually stated. This discrepancy allows us to explain some typical artifacts in UMAP plots, such as the dataset size-dependent tendency to produce overly crisp substructures. Contrary to existing belief, we find that UMAP's high-dimensional similarities are not critical to its success. Based on UMAP's actual loss, we describe its precise connection to the other state-of-the-art visualization method, t-SNE. The key insight is a new, exact relation between the contrastive loss functions negative sampling, employed by UMAP, and noise-contrastive estimation, which has been used to approximate t-SNE. As a result, we explain that UMAP embeddings appear more compact than t-SNE plots due to increased attraction between neighbors. Varying the attraction strength further, we obtain a spectrum of neighbor embedding methods, encompassing both UMAP- and t-SNE-like versions as special cases. Moving from more attraction to more repulsion shifts the focus of the embedding from continuous, global to more discrete and local structure of the data. Finally, we emphasize the link between contrastive neighbor embeddings and self-supervised contrastive learning. We show that different flavors of contrastive losses can work for both of them with few noise samples

    Deep Neural Networks and Tabular Data: Inference, Generation, and Explainability

    Get PDF
    Over the last decade, deep neural networks have enabled remarkable technological advancements, potentially transforming a wide range of aspects of our lives in the future. It is becoming increasingly common for deep-learning models to be used in a variety of situations in the modern life, ranging from search and recommendations to financial and healthcare solutions, and the number of applications utilizing deep neural networks is still on the rise. However, a lot of recent research efforts in deep learning have focused primarily on neural networks and domains in which they excel. This includes computer vision, audio processing, and natural language processing. It is a general tendency for data in these areas to be homogeneous, whereas heterogeneous tabular datasets have received relatively scant attention despite the fact that they are extremely prevalent. In fact, more than half of the datasets on the Google dataset platform are structured and can be represented in a tabular form. The first aim of this study is to provide a thoughtful and comprehensive analysis of deep neural networks' application to modeling and generating tabular data. Apart from that, an open-source performance benchmark on tabular data is presented, where we thoroughly compare over twenty machine and deep learning models on heterogeneous tabular datasets. The second contribution relates to synthetic tabular data generation. Inspired by their success in other homogeneous data modalities, deep generative models such as variational autoencoders and generative adversarial networks are also commonly applied for tabular data generation. However, the use of Transformer-based large language models (which are also generative) for tabular data generation have been received scant research attention. Our contribution to this literature consists of the development of a novel method for generating tabular data based on this family of autoregressive generative models that, on multiple challenging benchmarks, outperformed the current state-of-the-art methods for tabular data generation. Another crucial aspect for a deep-learning data system is that it needs to be reliable and trustworthy to gain broader acceptance in practice, especially in life-critical fields. One of the possible ways to bring trust into a data-driven system is to use explainable machine-learning methods. In spite of this, the current explanation methods often fail to provide robust explanations due to their high sensitivity to the hyperparameter selection or even changes of the random seed. Furthermore, most of these methods are based on feature-wise importance, ignoring the crucial relationship between variables in a sample. The third aim of this work is to address both of these issues by offering more robust and stable explanations, as well as taking into account the relationships between variables using a graph structure. In summary, this thesis made a significant contribution that touched many areas related to deep neural networks and heterogeneous tabular data as well as the usage of explainable machine learning methods

    Deep generative modelling of the imaged human brain

    Get PDF
    Human-machine symbiosis is a very promising opportunity for the field of neurology given that the interpretation of the imaged human brain is a trivial feat for neither entity. However, before machine learning systems can be used in real world clinical situations, many issues with automated analysis must first be solved. In this thesis I aim to address what I consider the three biggest hurdles to the adoption of automated machine learning interpretative systems. For each issue, I will first elucidate the reader on its importance given the overarching narratives of both neurology and machine learning, and then showcase my proposed solutions to these issues through the use of deep generative models of the imaged human brain. First, I start by addressing what is an uncontroversial and universal sign of intelligence: the ability to extrapolate knowledge to unseen cases. Human neuroradiologists have studied the anatomy of the healthy brain and can therefore, with some success, identify most pathologies present on an imaged brain, even without having ever been previously exposed to them. Current discriminative machine learning systems require vast amounts of labelled data in order to accurately identify diseases. In this first part I provide a generative framework that permits machine learning models to more efficiently leverage unlabelled data for better diagnoses with either none or small amounts of labels. Secondly, I address a major ethical concern in medicine: equitable evaluation of all patients, regardless of demographics or other identifying characteristics. This is, unfortunately, something that even human practitioners fail at, making the matter ever more pressing: unaddressed biases in data will become biases in the models. To address this concern I suggest a framework through which a generative model synthesises demographically counterfactual brain imaging to successfully reduce the proliferation of demographic biases in discriminative models. Finally, I tackle the challenge of spatial anatomical inference, a task at the centre of the field of lesion-deficit mapping, which given brain lesions and associated cognitive deficits attempts to discover the true functional anatomy of the brain. I provide a new Bayesian generative framework and implementation that allows for greatly improved results on this challenge, hopefully, paving part of the road towards a greater and more complete understanding of the human brain

    Generative adversarial networks review in earthquake-related engineering fields

    Get PDF
    Within seismology, geology, civil and structural engineering, deep learning (DL), especially via generative adversarial networks (GANs), represents an innovative, engaging, and advantageous way to generate reliable synthetic data that represent actual samples' characteristics, providing a handy data augmentation tool. Indeed, in many practical applications, obtaining a significant number of high-quality information is demanding. Data augmentation is generally based on artificial intelligence (AI) and machine learning data-driven models. The DL GAN-based data augmentation approach for generating synthetic seismic signals revolutionized the current data augmentation paradigm. This study delivers a critical state-of-art review, explaining recent research into AI-based GAN synthetic generation of ground motion signals or seismic events, and also with a comprehensive insight into seismic-related geophysical studies. This study may be relevant, especially for the earth and planetary science, geology and seismology, oil and gas exploration, and on the other hand for assessing the seismic response of buildings and infrastructures, seismic detection tasks, and general structural and civil engineering applications. Furthermore, highlighting the strengths and limitations of the current studies on adversarial learning applied to seismology may help to guide research efforts in the next future toward the most promising directions

    A Comprehensive Survey on Deep Graph Representation Learning

    Full text link
    Graph representation learning aims to effectively encode high-dimensional sparse graph-structured data into low-dimensional dense vectors, which is a fundamental task that has been widely studied in a range of fields, including machine learning and data mining. Classic graph embedding methods follow the basic idea that the embedding vectors of interconnected nodes in the graph can still maintain a relatively close distance, thereby preserving the structural information between the nodes in the graph. However, this is sub-optimal due to: (i) traditional methods have limited model capacity which limits the learning performance; (ii) existing techniques typically rely on unsupervised learning strategies and fail to couple with the latest learning paradigms; (iii) representation learning and downstream tasks are dependent on each other which should be jointly enhanced. With the remarkable success of deep learning, deep graph representation learning has shown great potential and advantages over shallow (traditional) methods, there exist a large number of deep graph representation learning techniques have been proposed in the past decade, especially graph neural networks. In this survey, we conduct a comprehensive survey on current deep graph representation learning algorithms by proposing a new taxonomy of existing state-of-the-art literature. Specifically, we systematically summarize the essential components of graph representation learning and categorize existing approaches by the ways of graph neural network architectures and the most recent advanced learning paradigms. Moreover, this survey also provides the practical and promising applications of deep graph representation learning. Last but not least, we state new perspectives and suggest challenging directions which deserve further investigations in the future

    Cell type identification, differential expression analysis and trajectory inference in single-cell transcriptomics

    Get PDF
    Single-cell RNA-sequencing (scRNA-seq) is a cutting-edge technology that enables to quantify the transcriptome, the set of expressed RNA transcripts, of a group of cells at the single-cell level. It represents a significant upgrade from bulk RNA-seq, which measures the combined signal of thousands of cells. Measuring gene expression by bulk RNA-seq is an invaluable tool for biomedical researchers who want to understand how cells alter their gene expression due to an illness, differentiation, ternal stimulus, or other events. Similarly, scRNA-seq has become an essential method for biomedical researchers, and it has brought several new applications previously unavailable with bulk RNA-seq. scRNA-seq has the same applications as bulk RNA-seq. However, the single-cell resolution also enables cell annotation based on gene markers of clusters, that is, cell populations that have been identified based on machine learning to be, on average, dissimilar at the transcriptomic level. Researchers can use the cell clusters to detect cell-type-specific gene expression changes between conditions such as case and control groups. Clustering can sometimes even discover entirely new cell types. Besides the cluster-level representation, the single-cell resolution also enables to model cells as a trajectory, representing how the cells are related at the cell level and what is the dynamic differentiation process that the cells undergo in a tissue. This thesis introduces new computational methods for cell type identification and trajectory inference from scRNA-seq data. A new cell type identification method (ILoReg) was proposed, which enables high-resolution clustering of cells into populations with subtle transcriptomic differences. In addition, two new trajectory inference methods were developed: scShaper, which is an accurate and robust method for inferring linear trajectories; and Totem, which is a user-friendly and flexible method for inferring tree-shaped trajectories. In addition, one of the works benchmarked methods for detecting cell-type-specific differential states from scRNA-seq data with multiple subjects per comparison group, requiring tailored methods to confront false discoveries. KEYWORDS: Single-cell RNA sequencing, transcriptome, cell type identification, trajectory inference, differential expressionYksisoluinen RNA-sekvensointi on huipputeknologia, joka mahdollistaa transkriptomin eli ilmentyneiden RNA-transkriptien laskennallisen määrittämisen joukolle soluja yhden solun tarkkuudella, ja sen kehittäminen oli merkittävä askel eteenpäin perinteisestä bulkki-RNA-sekvensoinnista, joka mittaa tuhansien solujen yhteistä signaalia. Bulkki-RNA-sekvensointi on tärkeä työväline biolääketieteen tutkijoille, jotka haluavat ymmärtää miten solut muuttavat geenien ilmentymistä sairauden, erilaistumisen, ulkoisen ärsykkeen tai muun tapahtuman seurauksena. Yksisoluisesta RNA-sekvensoinnista on vastaavasti kehittynyt tärkeä työväline tutkijoille, ja se on tuonut useita uusia sovelluksia. Yksisoluisella RNA-sekvensoinnilla on samat sovellukset kuin bulkki-RNA-sekvensoinnilla, mutta sen lisäksi se mahdollistaa solujen tunnistamisen geenimarkkerien perusteella. Geenimarkkerit etsitään tilastollisin menetelmin solupopulaatioille, joiden on tunnistettu koneoppimisen menetelmin muodostavan transkriptomitasolla keskenään erilaisia joukkoja eli klustereita. Tutkijat voivat hyödyntää soluklustereita tutkimaan geeniekspressioeroja solutyyppien sisällä esimerkiksi sairaiden ja terveiden välillä, ja joskus klusterointi voi jopa tunnistaa uusia solutyyppejä. Yksisolutason mittaukset mahdollistavat myös solujen mallintamisen trajektorina, joka esittää kuinka solut kehittyvät dynaamisesti toisistaan geenien ilmentymistä vaativien prosessien aikana. Tämä väitöskirja esittelee uusia laskennallisia menetelmiä solutyyppien ja trajektorien tunnistamiseen yksisoluisesta RNA-sekvensointidatasta. Väitöskirja esittelee uuden solutyyppitunnistusmenetelmän (ILoReg), joka mahdollistaa hienovaraisia geeniekspressioeroja sisältävien solutyyppien tunnistamisen. Sen lisäksi väitöskirjassa kehitettiin kaksi uutta trajektorin tunnistusmenetelmää: scShaper, joka on tarkka ja robusti menetelmä lineaaristen trajektorien tunnistamiseen, sekä Totem, joka on käyttäjäystävällinen ja joustava menetelmä puumallisten trajektorien tunnistamiseen. Lopuksi väitöskirjassa vertailtiin menetelmiä solutyyppien sisäisten geeniekspressioerojen tunnistamiseen ryhmien välillä, joissa on useita koehenkilöitä tai muita biologisia replikaatteja, mikä vaatii erityisiä menetelmiä väärien positiivisten löydösten vähentämiseen. ASIASANAT: yksisoluinen RNA-sekvensointi, klusterointi, trajektorin tunnistus, geeniekspressi

    Computational Methods for Protein Inference in Shotgun Proteomics Experiments

    Get PDF
    In den letzten Jahrzehnten kam es zu einem signifikanten Anstiegs des Einsatzes von Hochdurchsatzmethoden in verschiedensten Bereichen der Naturwissenschaften, welche zu einem regelrechten Paradigmenwechsel führte. Eine große Anzahl an neuen Technologien wurde entwickelt um die Quantifizierung von Molekülen, die in verschiedenste biologische Prozesse involviert sind, voranzutreiben und zu beschleunigen. Damit einhergehend konnte eine beträchtliche Steigerung an Daten festgestellt werden, die durch diese verbesserten Methoden generiert wurden. Durch die Bereitstellung von computergestützten Verfahren zur Analyse eben dieser Masse an Rohdaten, spielt der Forschungsbereich der Bioinformatik eine immer größere Rolle bei der Extraktion biologischer Erkenntnisse. Im Speziellen hilft die computergestützte Massenspektrometrie bei der Prozessierung, Analyse und Visualisierung von Daten aus massenspektrometrischen Hochdursatzexperimenten. Bei der Erforschung der Gesamtheit aller Proteine einer Zelle oder einer anderweitigen Probe biologischen Materials, kommen selbst neueste Methoden an ihre Grenzen. Deswegen greifen viele Labore zu einer, dem Massenspektrometer vorgeschalteten, Verdauung der Probe um die Komplexität der zu messenden Moleküle zu verringern. Diese sogenannten "Bottom-up"-Proteomikexperimente mit Massenspektrometern führen allerdings zu einer erhöhten Schwierigkeit bei der anschließenden computergestützen Analyse. Durch die Verdauung von Proteinen zu Peptiden müssen komplexe Mehrdeutigkeiten während Proteininferenz, Proteingruppierung und Proteinquantifizierung berücksichtigt und/oder aufgelöst werden. Im Rahmen dieser Dissertation stellen wir mehrere Entwicklungen vor, die dabei helfen sollen eine effiziente und vollständig automatisierte Analyse von komplexen und umfangreichen \glqq Bottom-up\grqq{}-Proteomikexperimenten zu ermöglichen. Um die hinderliche Komplexität diskreter, Bayes'scher Proteininferenzmethoden zu verringern, wird neuerdings von sogenannten Faltungsbäumen (engl. "convolution trees") Gebrauch gemacht. Diese bieten bis jetzt jedoch keine genaue und gleichzeitig numerisch stabile Möglichkeit um "max-product"-Inferenz zu betreiben. Deswegen wird in dieser Dissertation zunächst eine neue Methode beschrieben die das mithilfe eines stückweisen bzw. extrapolierendem Verfahren ermöglicht. Basierend auf der Integration dieser Methode in eine mitentwickelte Bibliothek für Bayes'sche Inferenz, wird dann ein OpenMS-Tool für Proteininferenz präsentiert. Dieses Tool ermöglicht effiziente Proteininferenz auf Basis eines diskreten Bayes'schen Netzwerks mithilfe eines "loopy belief propagation" Algorithmus'. Trotz der streng probabilistischen Formulierung des Problems übertrifft unser Verfahren die meisten etablierten Methoden in Recheneffizienz. Das Interface des Algorithmus' bietet außerdem einzigartige Eingabe- und Ausgabeoptionen, wie z.B. das Regularisieren der Anzahl von Proteinen in einer Gruppe, proteinspezifische "Priors", oder rekalibrierte "Posteriors" der Peptide. Schließlich zeigt diese Arbeit einen kompletten, einfach zu benutzenden, aber trotzdem skalierenden Workflow für Proteininferenz und -quantifizierung, welcher um das neue Tool entwickelt wurde. Die Pipeline wurde in nextflow implementiert und ist Teil einer Gruppe von standardisierten, regelmäßig getesteten und von einer Community gepflegten Standardworkflows gebündelt unter dem Projekt nf-core. Unser Workflow ist in der Lage selbst große Datensätze mit komplizierten experimentellen Designs zu prozessieren. Mit einem einzigen Befehl erlaubt er eine (Re-)Analyse von lokalen oder öffentlich verfügbaren Datensätzen mit kompetetiver Genauigkeit und ausgezeichneter Performance auf verschiedensten Hochleistungsrechenumgebungen oder der Cloud.Since the beginning of this millennium, the advent of high-throughput methods in numerous fields of the life sciences led to a shift in paradigms. A broad variety of technologies emerged that allow comprehensive quantification of molecules involved in biological processes. Simultaneously, a major increase in data volume has been recorded with these techniques through enhanced instrumentation and other technical advances. By supplying computational methods that automatically process raw data to obtain biological information, the field of bioinformatics plays an increasingly important role in the analysis of the ever-growing mass of data. Computational mass spectrometry in particular, is a bioinformatics field of research which provides means to gather, analyze and visualize data from high-throughput mass spectrometric experiments. For the study of the entirety of proteins in a cell or an environmental sample, even current techniques reach limitations that need to be circumvented by simplifying the samples subjected to the mass spectrometer. These pre-digested (so-called bottom-up) proteomics experiments then pose an even bigger computational burden during analysis since complex ambiguities need to be resolved during protein inference, grouping and quantification. In this thesis, we present several developments in the pursuit of our goal to provide means for a fully automated analysis of complex and large-scale bottom-up proteomics experiments. Firstly, due to prohibitive computational complexities in state-of-the-art Bayesian protein inference techniques, a refined, more stable technique for performing inference on sums of random variables was developed to enable a variation of standard Bayesian inference for the problem. nextflow and part of a set of standardized, well-tested, and community-maintained workflows by the nf-core collective. Our workflow runs on large-scale data with complex experimental designs and allows a one-command analysis of local and publicly available data sets with state-of-the-art accuracy on various high-performance computing environments or the cloud
    corecore