578 research outputs found

    Novel methods for the analysis of small molecule fragmentation mass spectra

    Get PDF
    The identification of small molecules, such as metabolites, in a high throughput manner plays an important in many research areas. Mass spectrometry (MS) is one of the predominant analysis technologies and is much more sensitive than nuclear magnetic resonance spectroscopy. Fragmentation of the molecules is used to obtain information beyond its mass. Gas chromatography-MS is one of the oldest and most widespread techniques for the analysis of small molecules. Commonly, the molecule is fragmented using electron ionization (EI). Using this technique, the molecular ion peak is often barely visible in the mass spectrum or even absent. We present a method to calculate fragmentation trees from high mass accuracy EI spectra, which annotate the peaks in the mass spectrum with molecular formulas of fragments and explain relevant fragmentation pathways. Fragmentation trees enable the identification of the molecular ion and its molecular formula if the molecular ion is present in the spectrum. The method works even if the molecular ion is of very low abundance. MS experts confirm that the calculated trees correspond very well to known fragmentation mechanisms.Using pairwise local alignments of fragmentation trees, structural and chemical similarities to already-known molecules can be determined. In order to compare a fragmentation tree of an unknown metabolite to a huge database of fragmentation trees, fast algorithms for solving the tree alignment problem are required. Unfortunately the alignment of unordered trees, such as fragmentation trees, is NP-hard. We present three exact algorithms for the problem. Evaluation of our methods showed that thousands of alignments can be computed in a matter of minutes. Both the computation and the comparison of fragmentation trees are rule-free approaches that require no chemical knowledge about the unknown molecule and thus will be very helpful in the automated analysis of metabolites that are not included in common libraries

    Optimized data processing algorithms for biomarker discovery by LC-MS

    Get PDF
    This thesis reports techniques and optimization of algorithms to analyse label-free LC-MS data sets for clinical proteomics studies with an emphasis on time alignment algorithms and feature selection methods. The presented work is intended to support ongoing medical and biomarker research. The thesis starts with a review of important steps in a data processing pipeline of label-free Liquid Chromatography – Mass Spectrometry (LC-MS) data. The first part of the thesis discusses an optimization strategy for aligning complex LC-MS chromatograms. It explains the combination of time alignment algorithms (Correlation Optimized Warping, Parametric Time Warping and Dynamic Time Warping) with a Component Detection Algorithm to overcome limitations of the original methods that use Total Ion Chromatograms when applied to highly complex data. A novel reference selection method to facilitate the pre-alignment process and an approach to globally compare the quality of time alignment using overlapping peak area are introduced and used in the study. The second part of this thesis highlights an ongoing challenge faced in the field of biomarker discovery where improvements in instrument resolution coupled with low sample numbers has led to a large discrepancy between the number of measurements and the number of measured variables. A comparative study of various commonly used feature selection methods for tackling this problem is presented. These methods are applied to spiked urine data sets with variable sample size and class separation to mimic typical conditions of biomarker research. Finally, the summary and the remaining challenges in the data processing field are summarized at the end of this thesis.

    Improving the Performance and Precision of Bioinformatics Algorithms

    Get PDF
    Recent advances in biotechnology have enabled scientists to generate and collect huge amounts of biological experimental data. Software tools for analyzing both genomic (DNA) and proteomic (protein) data with high speed and accuracy have thus become very important in modern biological research. This thesis presents several techniques for improving the performance and precision of bioinformatics algorithms used by biologists. Improvements in both the speed and cost of automated DNA sequencers have allowed scientists to sequence the DNA of an increasing number of organisms. One way biologists can take advantage of this genomic DNA data is to use it in conjunction with expressed sequence tag (EST) and cDNA sequences to find genes and their splice sites. This thesis describes ESTmapper, a tool designed to use an eager write-only top-down (WOTD) suffix tree to efficiently align DNA sequences against known genomes. Experimental results show that ESTmapper can be much faster than previous techniques for aligning and clustering DNA sequences, and produces alignments of comparable or better quality. Peptide identification by tandem mass spectrometry (MS/MS) is becoming the dominant high-throughput proteomics workflow for protein characterization in complex samples. Biologists currently rely on protein database search engines to identify peptides producing experimentally observed mass spectra. This thesis describes two approaches for improving peptide identification precision using statistical machine learning. HMMatch (HMM MS/MS Match) is a hidden Markov model approach to spectral matching, in which many examples of a peptide fragmentation spectrum are summarized in a generative probabilistic model that captures the consensus and variation of each peak's intensity. Experimental results show that HMMatch can identify many peptides missed by traditional spectral matching and search engines. PepArML (Peptide Identification Arbiter by Machine Learning) is a machine learning based framework for improving the precision of peptide identification. It uses classification algorithms to effectively utilize spectra features and scores from multiple search engines in a single model-free framework that can be trained in an unsupervised manner. Experimental results show that PepArML can improve the sensitivity of peptide identification for several synthetic protein mixtures compared with individual search engines

    An in vitro biochemical investigation into the conformation, binding and E3-ubiquitin ligase activity of mammalian UHRF1 with reconstituted chromatin

    Get PDF
    In the eukaryotic genome, DNA and histone modifications regulate chromatin function and mediate basic processes such as gene transcription, DNA repair and DNA replication. Maintaining chromatin modifications after DNA replication is essential for chromatin homeostasis, especially for regions of the genome that need to be kept silenced such as repetitive elements. The maintenance DNA methyltransferase, DNMT1, is responsible for ensuring that cytosine methylation at CpG dinucleotides, and thus proper transcriptional programmes, are propagated to the daughter cells. DNMT1 is specifically recruited to newly replicated, hemi-methylated DNA and the E3-ubiquitin ligase UHRF1 (Ubiquitin-like containing PHD- and RING-finger domains protein 1) plays a critical role for this. The mechanisms of the recruitment of DNMT1 to chromatin via UHRF1 are currently an area of active investigation. Several studies using modified nucleosomes, histone peptides and DNA oligonucleotides have identified UHRF1 to bind to hemi-methylated CpG dinucleotides and to histone H3 di- or tri-methylated at Lys-9. Since UHRF1 was also found to interact with DNMT1, it was postulated that UHRF1 acts as an adapter that directly recruits DNMT1 to newly replicated DNA. Additionally, it has recently been reported that the E3-ubiquitin ligase activity of its C-terminal RING-finger is required for the recruitment of DNMT1 to replication forks. Ubiquitylation of either K18 or K23 on histone H3 that is recognised by a ubiquitin-interacting motif within DNMT1 appears to be critical for DNMT1 targeting but the recruitment mechanism has so far not been completely elucidated. This study has investigated the binding and E3-ubiquitin ligase activity of UHRF1 in the context of physiologically relevant chromatin substrates. Using a fully reconstituted system, the chromatin binding and enzymatic activity of UHRF1 and how this is linked to its intra-molecular arrangement have been elucidated. In the context of modified nucleosome substrates, we observe an increase in binding of recombinant UHRF1 in the presence of hemi-methylated DNA whilst with histone H3K9me2/3, only a small increase in binding is detected. We also provide evidence that binding to nucleosome core particles is enhanced by a basic region between the SRA-domain and the RING-finger. This so called polybasic region or PBR has previously been implicated in the regulation of UHRF1 binding to H3K9me2/3 marks. Our findings therefore suggest that binding of UHRF1 to physiological chromatin substrates is more complex than previously thought. In-solution crosslinking/mass spectrometry experiments using the full-length protein confirm that UHRF1 exhibits complex intra-molecular contacts that can potentially regulate its interaction with chromatin or other factors. In addition to reported contacts between the PBR with the Tandem-Tudor domain and between the PHD-finger and the SRA-domain, the UBL-domain also makes extensive contacts to other regions within UHRF1. These appear to be weak and dynamic. Crucially, removal of the UBL-domain does not affect nucleosome binding but does result in a strong reduction in UHRF1 E3-ubiquitin ligase activity. Further experiments suggest that the UBL-domain is involved in establishing the enzyme/substrate complex between the E2-conjugating enzyme and the chromatin substrate and in stimulating the transfer of ubiquitin from the E2~Ub complex to histone H3. In summary, by combining a crosslinking/mass spectrometry approach to interrogate the intra-molecular arrangement of UHRF1 with fully reconstituted enzyme and chromatin-binding assays using physiologically relevant substrates, we have identified a function for the UBL-domain of UHRF1. Our results suggest that the UBL is highly flexible in solution and that it forms transient contacts with other parts of UHRF1 and the E2-conjugating enzyme that are required for the formation of the E2/E3/substrate complex in allosterically activating ubiquitin transfer from the E2~Ub to the histone target substrate. These findings assign, for the first time, a function for the UBL-domain and pave the way for further investigation of the involvement of this domain in the physiological role of UHRF1.Open Acces

    Unipept: computational exploration of metaproteome data

    Get PDF

    Messenger RNA and protein profiles in familial chronic lymphocytic leukaemia

    Get PDF
    An inherited risk for developing chronic lymphocytic leukaemia (CLL) is well documented in genetic studies, and familial aggregation of CLL cases has consistently been demonstrated in large registry-based studies. However, genetic linkage studies of CLL families have not detected any high-risk susceptibility genes against a background of numerous low-risk genes. To detect patterns of multiple low-risk loci, genome-wide association studies (GWAS) have used large numbers of cases and controls and dense-coverage single nucleotide polymorphism (SNP) arrays. These studies have identified risk loci that account for ≈19% of the heritability of CLL, suggesting that some of the remaining CLL risk may be associated with non-DNA sequence modifications, including inherited epigenetic changes, which regulate oncogenes and tumour suppressor genes. In this CLL kindred study, high-resolution DNA microarrays and mass spectrometry (MS) were used to identify differentially abundant mRNA and proteins in cases of familial CLL (F-CLL) and monoclonal B lymphocytosis (F-MBL), and compared to unaffected relatives, sporadic CLL (S-CLL) and controls. In addition, mRNA and protein levels were studied in familial and sporadic CLL patients with mutated and unmutated immunoglobulin heavy chain variable genes (IGH). Key findings were that mRNA and protein profiles clearly segregated clonal B lymphocytes in S-CLL from clonal B lymphocytes in F-MBL and F-CLL (combined as familial-lymphoproliferative disease; F-LPD). These profiles were distinct from those found in normal B lymphocytes in unaffected family members and unrelated controls. Furthermore, increasing upregulation or downregulation of both F-LPD specific genes and genes common to S-CLL occurred in association with progression from normal familial B lymphocytes through F-MBL to F-CLL

    New Statistical Algorithms for the Analysis of Mass Spectrometry Time-Of-Flight Mass Data with Applications in Clinical Diagnostics

    Get PDF
    Mass spectrometry (MS) based techniques have emerged as a standard forlarge-scale protein analysis. The ongoing progress in terms of more sensitive machines and improved data analysis algorithms led to a constant expansion of its fields of applications. Recently, MS was introduced into clinical proteomics with the prospect of early disease detection using proteomic pattern matching. Analyzing biological samples (e.g. blood) by mass spectrometry generates mass spectra that represent the components (molecules) contained in a sample as masses and their respective relative concentrations. In this work, we are interested in those components that are constant within a group of individuals but differ much between individuals of two distinct groups. These distinguishing components that dependent on a particular medical condition are generally called biomarkers. Since not all biomarkers found by the algorithms are of equal (discriminating) quality we are only interested in a small biomarker subset that - as a combination - can be used as a fingerprint for a disease. Once a fingerprint for a particular disease (or medical condition) is identified, it can be used in clinical diagnostics to classify unknown spectra. In this thesis we have developed new algorithms for automatic extraction of disease specific fingerprints from mass spectrometry data. Special emphasis has been put on designing highly sensitive methods with respect to signal detection. Thanks to our statistically based approach our methods are able to detect signals even below the noise level inherent in data acquired by common MS machines, such as hormones. To provide access to these new classes of algorithms to collaborating groups we have created a web-based analysis platform that provides all necessary interfaces for data transfer, data analysis and result inspection. To prove the platform's practical relevance it has been utilized in several clinical studies two of which are presented in this thesis. In these studies it could be shown that our platform is superior to commercial systems with respect to fingerprint identification. As an outcome of these studies several fingerprints for different cancer types (bladder, kidney, testicle, pancreas, colon and thyroid) have been detected and validated. The clinical partners in fact emphasize that these results would be impossible with a less sensitive analysis tool (such as the currently available systems). In addition to the issue of reliably finding and handling signals in noise we faced the problem to handle very large amounts of data, since an average dataset of an individual is about 2.5 Gigabytes in size and we have data of hundreds to thousands of persons. To cope with these large datasets, we developed a new framework for a heterogeneous (quasi) ad-hoc Grid - an infrastructure that allows to integrate thousands of computing resources (e.g. Desktop Computers, Computing Clusters or specialized hardware, such as IBM's Cell Processor in a Playstation 3)

    Searching for novel gene functions in yeast : identification of thousands of novel molecular interactions by protein-fragment complementation assay followed by automated gene function prediction and high-throughput lipidomics

    Get PDF
    La compréhension de processus biologiques complexes requiert des approches expérimentales et informatiques sophistiquées. Les récents progrès dans le domaine des stratégies génomiques fonctionnelles mettent dorénavant à notre disposition de puissants outils de collecte de données sur l’interconnectivité des gènes, des protéines et des petites molécules, dans le but d’étudier les principes organisationnels de leurs réseaux cellulaires. L’intégration de ces connaissances au sein d’un cadre de référence en biologie systémique permettrait la prédiction de nouvelles fonctions de gènes qui demeurent non caractérisées à ce jour. Afin de réaliser de telles prédictions à l’échelle génomique chez la levure Saccharomyces cerevisiae, nous avons développé une stratégie innovatrice qui combine le criblage interactomique à haut débit des interactions protéines-protéines, la prédiction de la fonction des gènes in silico ainsi que la validation de ces prédictions avec la lipidomique à haut débit. D’abord, nous avons exécuté un dépistage à grande échelle des interactions protéines-protéines à l’aide de la complémentation de fragments protéiques. Cette méthode a permis de déceler des interactions in vivo entre les protéines exprimées par leurs promoteurs naturels. De plus, aucun biais lié aux interactions des membranes n’a pu être mis en évidence avec cette méthode, comparativement aux autres techniques existantes qui décèlent les interactions protéines-protéines. Conséquemment, nous avons découvert plusieurs nouvelles interactions et nous avons augmenté la couverture d’un interactome d’homéostasie lipidique dont la compréhension demeure encore incomplète à ce jour. Par la suite, nous avons appliqué un algorithme d’apprentissage afin d’identifier huit gènes non caractérisés ayant un rôle potentiel dans le métabolisme des lipides. Finalement, nous avons étudié si ces gènes et un groupe de régulateurs transcriptionnels distincts, non préalablement impliqués avec les lipides, avaient un rôle dans l’homéostasie des lipides. Dans ce but, nous avons analysé les lipidomes des délétions mutantes de gènes sélectionnés. Afin d’examiner une grande quantité de souches, nous avons développé une plateforme à haut débit pour le criblage lipidomique à contenu élevé des bibliothèques de levures mutantes. Cette plateforme consiste en la spectrométrie de masse à haute resolution Orbitrap et en un cadre de traitement des données dédié et supportant le phénotypage des lipides de centaines de mutations de Saccharomyces cerevisiae. Les méthodes expérimentales en lipidomiques ont confirmé les prédictions fonctionnelles en démontrant certaines différences au sein des phénotypes métaboliques lipidiques des délétions mutantes ayant une absence des gènes YBR141C et YJR015W, connus pour leur implication dans le métabolisme des lipides. Une altération du phénotype lipidique a également été observé pour une délétion mutante du facteur de transcription KAR4 qui n’avait pas été auparavant lié au métabolisme lipidique. Tous ces résultats démontrent qu’un processus qui intègre l’acquisition de nouvelles interactions moléculaires, la prédiction informatique des fonctions des gènes et une plateforme lipidomique innovatrice à haut débit , constitue un ajout important aux méthodologies existantes en biologie systémique. Les développements en méthodologies génomiques fonctionnelles et en technologies lipidomiques fournissent donc de nouveaux moyens pour étudier les réseaux biologiques des eucaryotes supérieurs, incluant les mammifères. Par conséquent, le stratégie présenté ici détient un potentiel d’application au sein d’organismes plus complexes.Understanding complex biological processes requires sophisticated experimental and computational approaches. The advances in functional genomics strategies provide powerful tools for collecting diverse types of information on interconnectivity of genes, proteins and small molecules for studying organizational principles of cellular networks. Integration of that knowledge into a systems biology framework enables prediction of novel functions of uncharacterized genes. For performing such predictions on a genome-wide scale in the yeast Saccharomyces cerevisiae, we have developed a novel strategy that combines high-throughput interactomics screen for protein-protein interactions, in silico gene function prediction, and validation of predictions with high-throughput lipidomics. We started by performing a large-scale screen for protein-protein interactions using a protein-fragment complementation assay. The method allowed to monitor interactions in vivo between proteins expressed from their natural promoters. Furthermore, the method did not suffer from bias against membrane interactions comparing to established genome-wide techniques for detecting protein interactions. As a result, we detected many novel interactions and increased coverage of an interactome of lipid homeostasis that has not been yet comprehensively explored. Next, we applied a machine learning algorithm to identify eight previously uncharacterized genes with a potential role in lipid metabolism. Finally, we investigated whether these genes and a set of distinct transcriptional regulators, not implicated previously with lipids, have a role in lipid homeostasis. For that purpose, we analyzed lipidome of deletion mutants of the selected genes. In order to probe a large number of strains, we have developed a high-throughput platform for high-content lipidomic screening of yeast mutant libraries that consists of high-resolution Orbitrap mass spectrometry and a dedicated data processing framework to support lipid phenotyping across hundreds of Saccharomyces cerevisiae mutants. Lipidomics experiments confirmed functional predictions by demonstrating differences of the lipid metabolic phenotypes of deletion mutants lacking YBR141C and YJR015W genes predicted to be involved in lipid metabolism. An altered lipid phenotype was also observed for a deletion mutant of the transcription factor KAR4 that has not been linked previously with lipid metabolism. These results demonstrate that a workflow that integrates the acquisition of novel molecular interactions, computational gene function prediction and novel high-throughput shotgun lipidomics platform is a valuable contribution to an arsenal of methods for systems biology. The developments of functional genomic methods and lipidomics technologies provide means to study biological networks of higher eukaryotes, including mammals. Therefore, the presented workflow has a potential to find its applications in more complex organisms
    • …
    corecore