6 research outputs found

    The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching

    Get PDF
    open access articleBackground: The Chemistry Development Kit (CDK) is a widely used open source cheminformatics toolkit, providing data structures to represent chemical concepts along with methods to manipulate such structures and perform computations on them. The library implements a wide variety of cheminformatics algorithms ranging from chemical structure canonicalization to molecular descriptor calculations and pharmacophore perception. It is used in drug discovery, metabolomics, and toxicology. Over the last 10 years, the code base has grown significantly, however, resulting in many complex interdependencies among components and poor performance of many algorithms. Results: We report improvements to the CDK v2.0 since the v1.2 release series, specifically addressing the increased functional complexity and poor performance. We first summarize the addition of new functionality, such atom typing and molecular formula handling, and improvement to existing functionality that has led to significantly better performance for substructure searching, molecular fingerprints, and rendering of molecules. Second, we outline how the CDK has evolved with respect to quality control and the approaches we have adopted to ensure stability, including a code review mechanism. Conclusions: This paper highlights our continued efforts to provide a community driven, open source cheminformatics library, and shows that such collaborative projects can thrive over extended periods of time, resulting in a high-quality and performant library. By taking advantage of community support and contributions, we show that an open source cheminformatics project can act as a peer reviewed publishing platform for scientific computing software

    Determining the parent and associated fragment formulae in mass spectrometry via the parent subformula graph

    Get PDF
    BackgroundIdentifying the molecular formula and fragmentation reactions of an unknown compound from its mass spectrum is crucial in areas such as natural product chemistry and metabolomics. We propose a method for identifying the correct candidate formula of an unidentified natural product from its mass spectrum. The method involves scoring the plausibility of parent candidate formulae based on a parent subformula graph (PSG), and two possible metrics relating to the number of edges in the PSG. This method is applicable to both electron-impact mass spectrometry (EI-MS) and tandem mass spectrometry (MS/MS) data. Additionally, this work introduces the two-dimensional fragmentation plot (2DFP) for visualizing PSGs.ResultsOur results suggest that incorporating information regarding the edges of the PSG results in enhanced performance in correctly identifying parent formulae, in comparison to the more well-accepted "MS/MS score", on the 2016 Computational Assessment of Small Molecule Identification (CASMI 2016) data set (76.3 vs 58.9% correct formula identification) and the Research Centre for Toxic Compounds in the Environment (RECETOX) data set (66.2% vs 59.4% correct formula identification). In the extension of our method to identify the correct candidate formula from complex EI-MS data of semiochemicals, our method again performed better (correct formula appearing in the top 4 candidates in 20/23 vs 7/23 cases) than the MS/MS score, and enables the rapid identification of both the correct parent ion mass and the correct parent formula with minimal expert intervention.ConclusionOur method reliably identifies the correct parent formula even when the mass information is ambiguous. Furthermore, should parent formula identification be successful, the majority of associated fragment formulae can also be correctly identified. Our method can also identify the parent ion and its associated fragments in EI-MS spectra where the identity of the parent ion is unclear due to low quantities and overlapping compounds. Finally, our method does not inherently require empirical fitting of parameters or statistical learning, meaning it is easy to implement and extend upon.Scientific contributionDeveloped, implemented and tested new metrics for assessing plausibility of candidate molecular formulae obtained from HR-MS data

    Prediction of Post-translational Modifications of Proteins from 2-DE/MS Data

    Get PDF
    The living cell is a complex entity consisting of nucleic acids, proteins, and otherbiomolecules that form an interrelated and dynamic network. The unraveling of this network is of great interest for scientists of different disciplines. With the sequencing of the genome a step was made to the understanding of the fundamental elements of the cells the genes. In humans, approximately 20,000 to 25,000 genes exist which encode about more than one million proteins. This complexity at the protein level is a result of alternative splicing and co- and post-translational modifications producing several protein species per transcript. Modifications are essential to the regulation of cellular processes and account for the activation or deactivation of enzymes and whole signaling pathways. The entirety of all proteins present in a cell at a fixed point of time and under particular biological conditions is called proteome, and the analysis of it is proteomics. One particular area of interest in proteomics is the identification of proteins and their post-translational modifications. Peptide mass fingerprinting is an established method and has proved useful to identify proteins by their amino acid sequence using mass spectrometry and protein sequence databases. This method relies on the idea of comparing experimental (measured) mass peaks to theoretical (calculated) masses, the latter being generated from a protein in a sequence database. As the mass of a modified protein differs from the mass of its unmodified counterpart, this mass distance is to be considered when detecting protein modifications with peptide mass fingerprinting. In the work described here, a novel algorithm was developed and implemented that allows for the identification of protein modifications from data derived by peptide mass fingerprinting. The algorithm transformed the process of predicting protein modifications to an extended Money Changing Problem of finding suitable combinations of modifications that explain the observed peak mass distances. Unlike common computational approaches the algorithm presented here will not be restricted in the number of modifications to be considered. Furthermore, this algorithm is efficient by calculating for a given list of modifications the combinations of modifications only once, independent of the number of queries. Although there exist hardly any frequencies of protein modifications, which turns the validation of the results very difficult, this novel approach is a promising step towards the unraveling of protein complexity

    Molecular Formula Identification using High Resolution Mass Spectrometry: Algorithms and Applications in Metabolomics and Proteomics

    Get PDF
    Wir untersuchen mehrere theoretische und praktische Aspekte der Identifikation der Summenformel von Biomolekülen mit Hilfe von hochauflösender Massenspektrometrie. Durch die letzten Forschritte in der Instrumentation ist die Massenspektrometrie (MS) zur einen der Schlüsseltechnologien für die Analyse von Biomolekülen in der Proteomik und Metabolomik geworden. Sie misst die Massen der Moleküle in der Probe mit hoher Genauigkeit, und ist für die Messdatenerfassung im Hochdurchsatz gut geeignet. Eine der Kernaufgaben in der MS-basierten Proteomik und Metabolomik ist die Identifikation der Moleküle in der Probe. In der Metabolomik unterliegen Metaboliten der Strukturaufklärung, beginnend bei der Summenformel eines Moleküls, d.h. der Anzahl der Atome jedes Elements. Dies ist der entscheidende Schritt in der Identifikation eines unbekannten Metabolits, da die festgelegte Formel die Anzahl der möglichen Molekülstrukturen auf eine viel kleinere Menge reduziert, die mit Methoden der automatischen Strukturaufklärung weiter analysiert werden kann. Nach der Vorverarbeitung ist die Ausgabe eines Massenspektrometers eine Liste von Peaks, die den Molekülmassen und deren Intensitäten, d.h. der Anzahl der Moleküle mit einer bestimmten Masse, entspricht. Im Prinzip können die Summenformel kleiner Moleküle nur mit präzisen Massen identifiziert werden. Allerdings wurde festgestellt, dass aufgrund der hohen Anzahl der chemisch legitimer Formeln in oberen Massenbereich eine exzellente Massengenaugkeit alleine für die Identifikation nicht genügt. Hochauflösende MS erlaubt die Bestimmung der Molekülmassen und Intensitäten mit hervorragender Genauigkeit. In dieser Arbeit entwickeln wir mehrere Algorithmen und Anwendungen, die diese Information zur Identifikation der Summenformel der Biomolekülen anwenden

    Efficient Mass Decomposition

    No full text
    We study the problem of decomposing a positive integer M over a (fixed and finite) weighted alphabet \u3a3: We want to find non-negative integers ci such that M = c1a1+...+ckak, where the ai are the positive integer weights of the individual characters and |\u3a3| = k. We refer to the vector (c1,...,ck) as a witness (of M over \u3a3), and denote by \u3b3(M) the number of distinct witnesses of M. We present a data structure of size O(ka1) that allows finding all witnesses of any query M in time O(ka1 \ub7\u3b3(M)). To the best of our knowledge, this is the first algorithm for the problem with runtime independent of the size of the query M. Construction of the data structure requires O(ka1) time and constant additional space, and is very easy to implement. The problem is motivated by mass spectrometry experiments, where peaks need to be mapped to sample molecules whose mass they could represent. Our simulations show that the algorithm presented performs well on relevant applications

    Efficient mass decomposition

    No full text
    Böcker S, Lipták Z. Efficient mass decomposition. In: Liebrock LM, ed. Proceedings of the 2005 ACM symposium on Applied computing (SAC '05). New York, NY: ACM; 2005: 151-157
    corecore