156 research outputs found

    New Statistical Algorithms for the Analysis of Mass Spectrometry Time-Of-Flight Mass Data with Applications in Clinical Diagnostics

    Get PDF
    Mass spectrometry (MS) based techniques have emerged as a standard forlarge-scale protein analysis. The ongoing progress in terms of more sensitive machines and improved data analysis algorithms led to a constant expansion of its fields of applications. Recently, MS was introduced into clinical proteomics with the prospect of early disease detection using proteomic pattern matching. Analyzing biological samples (e.g. blood) by mass spectrometry generates mass spectra that represent the components (molecules) contained in a sample as masses and their respective relative concentrations. In this work, we are interested in those components that are constant within a group of individuals but differ much between individuals of two distinct groups. These distinguishing components that dependent on a particular medical condition are generally called biomarkers. Since not all biomarkers found by the algorithms are of equal (discriminating) quality we are only interested in a small biomarker subset that - as a combination - can be used as a fingerprint for a disease. Once a fingerprint for a particular disease (or medical condition) is identified, it can be used in clinical diagnostics to classify unknown spectra. In this thesis we have developed new algorithms for automatic extraction of disease specific fingerprints from mass spectrometry data. Special emphasis has been put on designing highly sensitive methods with respect to signal detection. Thanks to our statistically based approach our methods are able to detect signals even below the noise level inherent in data acquired by common MS machines, such as hormones. To provide access to these new classes of algorithms to collaborating groups we have created a web-based analysis platform that provides all necessary interfaces for data transfer, data analysis and result inspection. To prove the platform's practical relevance it has been utilized in several clinical studies two of which are presented in this thesis. In these studies it could be shown that our platform is superior to commercial systems with respect to fingerprint identification. As an outcome of these studies several fingerprints for different cancer types (bladder, kidney, testicle, pancreas, colon and thyroid) have been detected and validated. The clinical partners in fact emphasize that these results would be impossible with a less sensitive analysis tool (such as the currently available systems). In addition to the issue of reliably finding and handling signals in noise we faced the problem to handle very large amounts of data, since an average dataset of an individual is about 2.5 Gigabytes in size and we have data of hundreds to thousands of persons. To cope with these large datasets, we developed a new framework for a heterogeneous (quasi) ad-hoc Grid - an infrastructure that allows to integrate thousands of computing resources (e.g. Desktop Computers, Computing Clusters or specialized hardware, such as IBM's Cell Processor in a Playstation 3)

    Application of machine learning and deep learning for proteomics data analysis

    Get PDF

    Deriving statistical inference from the application of artificial neural networks to clinical metabolomics data

    Get PDF
    Metabolomics data are complex with a high degree of multicollinearity. As such, multivariate linear projection methods, such as partial least squares discriminant analysis (PLS-DA) have become standard. Non-linear projections methods, typified by Artificial Neural Networks (ANNs) may be more appropriate to model potential nonlinear latent covariance; however, they are not widely used due to difficulty in deriving statistical inference, and thus biological interpretation. Herein, we illustrate the utility of ANNs for clinical metabolomics using publicly available data sets and develop an open framework for deriving and visualising statistical inference from ANNs equivalent to standard PLS-DA methods

    Knowledge Management Approaches for predicting Biomarker and Assessing its Impact on Clinical Trials

    Get PDF
    The recent success of companion diagnostics along with the increasing regulatory pressure for better identification of the target population has created an unprecedented incentive for the drug discovery companies to invest into novel strategies for stratified biomarker discovery. Catching with this trend, trials with stratified biomarker in drug development have quadrupled in the last decade but represent a small part of all Interventional trials reflecting multiple co-developmental challenges of therapeutic compounds and companion diagnostics. To overcome the challenge, varied knowledge management and system biology approaches are adopted in the clinics to analyze/interpret an ever increasing collection of OMICS data. By semi-automatic screening of more than 150,000 trials, we filtered trials with stratified biomarker to analyse their therapeutic focus, major drivers and elucidated the impact of stratified biomarker programs on trial duration and completion. The analysis clearly shows that cancer is the major focus for trials with stratified biomarker. But targeted therapies in cancer require more accurate stratification of patient population. This can be augmented by a fresh approach of selecting a new class of biomolecules i.e. miRNA as candidate stratification biomarker. miRNA plays an important role in tumorgenesis in regulating expression of oncogenes and tumor suppressors; thus affecting cell proliferation, differentiation, apoptosis, invasion, angiogenesis. miRNAs are potential biomarkers in different cancer. However, the relationship between response of cancer patients towards targeted therapy and resulting modifications of the miRNA transcriptome in pathway regulation is poorly understood. With ever-increasing pathways and miRNA-mRNA interaction databases, freely available mRNA and miRNA expression data in multiple cancer therapy have created an unprecedented opportunity to decipher the role of miRNAs in early prediction of therapeutic efficacy in diseases. We present a novel SMARTmiR algorithm to predict the role of miRNA as therapeutic biomarker for an anti-EGFR monoclonal antibody i.e. cetuximab treatment in colorectal cancer. The application of an optimised and fully automated version of the algorithm has the potential to be used as clinical decision support tool. Moreover this research will also provide a comprehensive and valuable knowledge map demonstrating functional bimolecular interactions in colorectal cancer to scientific community. This research also detected seven miRNA i.e. hsa-miR-145, has-miR-27a, has- miR-155, hsa-miR-182, hsa-miR-15a, hsa-miR-96 and hsa-miR-106a as top stratified biomarker candidate for cetuximab therapy in CRC which were not reported previously. Finally a prospective plan on future scenario of biomarker research in cancer drug development has been drawn focusing to reduce the risk of most expensive phase III drug failures

    Metabolome-based studies of virulence factors in Pseudomonas aeruginosa

    Get PDF
    Pseudomonas aeruginosa is an opportunistic pathogen and an important causative agent of potentially life-threatening nosocomial infections in predisposed patients. The Gram-negative bacterium produces a large and diverse repertoire of small-molecule secondary metabolites that serve as regulators and effectors of its virulence. In this study, a range of mass spectrometry-based bacterial metabolomics approaches was used to investigate these small-molecule virulence factors and their interplay with pseudomonal metabolism as well as with phenotypic traits related to virulence. The groundwork was laid by exploring the metabolite inventory of P. aeruginosa and improving the coverage of its metabolome by the application of a custom software named CluMSID, that clusters analytes based on similarities of their MS² spectra. CluMSID led to the annotation of, i.a., 27 novel members of the class of alkylquinolone quorum sensing signalling molecules, which represent crucial players in the highly complex network that regulates pseudomonal virulence. The tool was developed towards a versatile and user-friendly R package hosted on Bioconductor, whose functionalities and benefits are described in detail. The new findings on the alkylquinolone chemodiversity led to further studies with a mechanistic focus that probed the substrate specificity of the enzyme complex PqsBC. It was demonstrated that PqsBC accepts different medium-chain acyl-coenzyme A substrates for the condensation with 2-aminobenzoylacetate and thereby produces alkylquinolones with various side chain lengths, whose distribution is a function of substrate specificity and substrate availability. Moreover, it was shown that PqsBC also synthesises alkylquinolones with unsaturated side chains. The focus was further broadened from metabolite and pathway-centred questions to a more global perspective on pseudomonal virulence and metabolism, which directed attention at PrmC, an enzyme with a partially unknown function indispensable for in vivo virulence. An untargeted metabolomics experiment yielded insights into the role of PrmC and its influence on the pseudomonal endo- and exometabolome. Finally, clinical P. aeruginosa strains with different virulence phenotypes were examined by untargeted metabolomics in order to disclose metabolic variation and interconnections between virulence and metabolism. The analysis resulted in the discovery of a putative virulence biomarker and enabled the construction of a random forest classification model for certain virulence phenotypes based only on metabolomics data. In summary, this study demonstrated the potential of metabolomics for the investigation of P. aeruginosa virulence factors and thereby contributed towards the comprehension of the complex interplay of metabolism and virulence in this important pathogen.Pseudomonas aeruginosa ist ein wichtiger opportunistischer Erreger potenziell lebensbedrohlicher nosokomialer Infektionen bei prädisponierten Patienten. Das Gram-negative Bakterium produziert ein vielfältiges Repertoire an niedermolekularen Sekundärmetaboliten, die als Regulatoren und Effektoren seiner Virulenz dienen. In dieser Studie wurde eine Reihe von Massenspektrometrie-basierten Ansätzen der bakteriellen Metabolomik verwendet, um diese niedermolekularen Virulenzfaktoren und ihre Wechselwirkungen mit dem pseudomonalen Metabolismus sowie mit virulenzassoziierten phänotypischen Merkmalen zu untersuchen. Die Grundlage bilden die Untersuchung des Metaboliteninventars von P. aeruginosa und die Verbesserung der analytischen Abdeckung des Metaboloms durch die Anwendung einer selbstentwickelten Software namens CluMSID, die MS²-Spektren nach Ähnlichkeit clustert. CluMSID führte zur Annotation von u.a. 27 neuen Mitgliedern der Klasse der Alkylchinolone, die als Quorum-Sensing-Signalmoleküle entscheidende Akteure im hochkomplexen Netzwerk der Virulenzregulation darstellen. Das Tool wurde zu einem R-Paket entwickelt, das auf Bioconductor verfügbar ist und dessen Funktionalitäten und Vorteile ausführlich beschrieben werden. Die neuen Erkenntnisse über die Chemodiversität der Alkylchinolone führten zu weiteren Studien mit mechanistischem Schwerpunkt, die die Substratspezifität des Enzymkomplexes PqsBC untersuchten. Es wurde nachgewiesen, dass PqsBC verschiedene mittelkettige Acyl-Coenzym-A-Substrate für die Kondensation mit 2-Aminobenzoylacetat akzeptiert und dadurch Alkylchinolone mit verschiedenen Seitenkettenlängen produziert, deren Verteilung eine Funktion der Substratspezifität und der Substratverfügbarkeit ist. Zudem konnte gezeigt werden, dass PqsBC auch Alkylchinolone mit ungesättigten Seitenketten synthetisiert. Im Weiteren wurde der Fokus von Metaboliten- und Stoffwechselweg-zentrierten Fragen hin zu einer globaleren Perspektive der pseudomonalen Virulenz und des Metabolismus erweitert, was die Aufmerksamkeit auf PrmC lenkte, ein Enzym mit teilweise unbekannter, für die in vivo-Virulenz unverzichtbarer Funktion. Ein globales Metabolomik-Experiment lieferte Einblicke in die Rolle von PrmC und seinen Einfluss auf das pseudomonale Endo- und Exometabolom. Schließlich wurden klinische P. aeruginosa-Stämme mit unterschiedlichen Virulenzphänotypen mittels ungerichteter Metabolomik untersucht, um metabolische Variationen und Zusammenhänge zwischen Virulenz und Metabolismus aufzudecken. Die Analyse resultierte in der Entdeckung eines putativen Virulenzbiomarkers und ermöglichte die Konstruktion eines Random-Forest-Klassifikationsmodells für bestimmte Virulenzphänotypen, das nur auf Metabolomik-Daten basiert. Zusammenfassend hat diese Studie das Potenzial der Metabolomik für die Untersuchung der Virulenzfaktoren von P. aeruginosa aufgezeigt und damit zum Verständnis des komplexen Zusammenspiels von Metabolismus und Virulenz bei diesem wichtigen Pathogen beigetragen

    Algorithms for complex systems in the life sciences: AI for gene fusion prioritization and multi-omics data integration

    Get PDF
    Due to the continuous increase in the number and complexity of the genomics and biological data, new computer science techniques are needed to analyse these data and provide valuable insights into the main features. The thesis research topic consists of designing and developing bioinformatics methods for complex systems in life sciences to provide informative models about biological processes. The thesis is divided into two main sub-topics. The first sub-topic concerns machine and deep learning techniques applied to the analysis of aberrant genetic sequences like, for instance, gene fusions. The second one is the development of statistics and deep learning techniques for heterogeneous biological and clinical data integration. Referring to the first sub-topic, a gene fusion is a biological event in which two distinct regions in the DNA create a new fused gene. Gene fusions are a relevant issue in medicine because many gene fusions are involved in cancer, and some of them can even be used as cancer predictors. However, not all of them are necessarily oncogenic. The first part of this thesis is devoted to the automated recognition of oncogenic gene fusions, a very open and challenging problem in cancer development analysis. In this context, an automated model for the recognition of oncogenic gene fusions relying exclusively on the amino acid sequence of the resulting proteins has been developed. The main contributions consist of: 1. creation of a proper database used to train and test the model; 2. development of the methodology through the design and the implementation of a predictive model based on a Convolutional Neural Network (CNN) followed by a bidirectional Long Short Term Memory (LSTM) network; 3. extensive comparative analysis with other reference tools in the literature; 4. engineering of the developed method through the implementation and release of an automated tool for gene fusions prioritization downstream of gene fusion detection tools. Since the previous approach does not consider post-transcriptional regulation effects, new biological features have been considered (e.g., micro RNA data, gene ontologies, and transcription factors) to improve the overall performance, and a new integrated approach based on MLP has explicitly been designed. In the end, extensive comparisons with other methods present in the literature have been made. These contributions led to an improved model that outperforms the previous ones, and it competes with state-of-the-art tools. The rationale behind the second sub-topic of this thesis is the following: due to the widespread of Next Generation Sequencing (NGS) technologies, a large amount of heterogeneous complex data related to several diseases and healthy individuals is now available (e.g., RNA-seq, gene expression data, miRNAs expression data, methylation sequencing data, and many others). Each one of these data is also called omic, and their integrative study is called multi-omics. In this context, the aim is to integrate multi-omics data involving thousands of features (genes, microRNA) and identifying which of them are relevant for a specific biological process. From a computational point of view, finding the best strategies for multi-omics analysis and relevant features identification is a very open challenge. The first chapter dedicated to this second sub-topic focuses on the integrative analysis of gene expression and connectivity data of mouse brains exploiting machine learning techniques. The rational behind this study is the exploration of the capability to evaluate the grade of physical connection between brain regions starting from their gene expression data. Many studies have been performed considering the functional connection of two or more brain areas (which areas are activated in response to a specific stimulus). While, analyzing physical connections (i.e., axon bundles) starting from gene expression data is still an open problem. Despite this study is scientifically very relevant to deepen human brain functioning, ethical reasons strongly limit the availability of samples. For this reason, several studies have been carried out on the mouse brain, anatomically similar to the human one. The neuronal connection data (obtained by viral tracers) of mouse brains were processed to identify brain regions physically connected and then evaluated with these areas’ gene expression data. A multi-layer perceptron was applied to perform the classification task between connected and unconnected regions providing gene expression data as input. Furthermore, a second model was created to infer the degree of connection between distinct brain regions. The implemented models successfully executed the binary classification task (connected regions against unconnected regions) and distinguished the intensity of the connection in low, medium, and high. A second chapter describes a statistical method to reveal pathology-determining microRNA targets in multi-omic datasets. In this work, two multi-omics datasets are used: breast cancer and medulloblastoma datasets. Both the datasets are composed of miRNA, mRNA, and proteomics data related to the same patients. The main computational contribution to the field consists of designing and implementing an algorithm based on the statistical conditional probability to infer the impact of miRNA post-transcriptional regulation on target genes exploiting the protein expression values. The developed methodology allowed a more in-depth understanding and identification of target genes. Also, it proved to be significantly enriched in three well-known databases (miRDB, TargetScan, and miRTarBase), leading to relevant biological insights. Another chapter deals with the classification of multi-omics samples. The literature’s main approaches integrate all the features available for each sample upstream of the classifier (early integration approach) or create separate classifiers for each omic and subsequently define a consensus set rules (late integration approach). In this context, the main contribution consists of introducing the probability concept by creating a model based on Bayesian and MLP networks to achieve a consensus guided by the class label and its probability. This approach has shown how a probabilistic late integration classification is more specific than an early integration approach and can identify samples out of the training domain. To provide new molecular profiles and patients’ categorization, class labels could be helpful. However, they are not always available. Therefore, the need to cluster samples based on their intrinsic characteristics is revealed and dealt with in a specific chapter. Multi-omic clustering in literature is mainly addressed by creating graphs or methods based on multidimensional data reduction. This field’s main contribution is creating a model based on deep learning techniques by implementing an MLP with a specifically designed loss function. The loss represents the input samples in a reduced dimensional space by calculating the intra-cluster and inter-cluster distance at each epoch. This approach reported performances comparable to those of most referred methods in the literature, avoiding pre-processing steps for either feature selection or dimensionality reduction. Moreover, it has no limitations on the number of omics to integrate

    A Semantic Framework for Declarative and Procedural Knowledge

    Get PDF
    In any scientic domain, the full set of data and programs has reached an-ome status, i.e. it has grown massively. The original article on the Semantic Web describes the evolution of a Web of actionable information, i.e.\ud information derived from data through a semantic theory for interpreting the symbols. In a Semantic Web, methodologies are studied for describing, managing and analyzing both resources (domain knowledge) and applications (operational knowledge) - without any restriction on what and where they\ud are respectively suitable and available in the Web - as well as for realizing automatic and semantic-driven work\ud ows of Web applications elaborating Web resources.\ud This thesis attempts to provide a synthesis among Semantic Web technologies, Ontology Research, Knowledge and Work\ud ow Management. Such a synthesis is represented by Resourceome, a Web-based framework consisting of two components which strictly interact with each other: an ontology-based and domain-independent knowledge manager system (Resourceome KMS) - relying on a knowledge model where resource and operational knowledge are contextualized in any domain - and a semantic-driven work ow editor, manager and agent-based execution system (Resourceome WMS).\ud The Resourceome KMS and the Resourceome WMS are exploited in order to realize semantic-driven formulations of work\ud ows, where activities are semantically linked to any involved resource. In the whole, combining the use of domain ontologies and work ow techniques, Resourceome provides a exible domain and operational knowledge organization, a powerful engine for semantic-driven work\ud ow composition, and a distributed, automatic and\ud transparent environment for work ow execution
    • …
    corecore