1,116 research outputs found
Engineering polymer informatics: Towards the computer-aided design of polymers
The computer-aided design of polymers is one of the holy grails of modern chemical
informatics and of significant interest for a number of communities in polymer
science. The paper outlines a vision for the in silico design of polymers and presents
an information model for polymers based on modern semantic web technologies, thus
laying the foundations for achieving the vision
CML: Evolution and design.
A retrospective view of the design and evolution of Chemical Markup Language (CML) is presented by its original authors.RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are
Recommended from our members
Information extraction from chemical patents
The automated extraction of semantic chemical data from the existing literature is demonstrated. For reasons of copyright, the work is focused on the patent literature, though the methods are expected to apply equally to other areas of the chemical literature.
Hearst Patterns are applied to the patent literature in order to discover hyponymic relations describing chemical species. The acquired relations are manually validated to determine the precision of the determined hypernyms (85.0%) and of the asserted hyponymic relations (94.3%). It is demonstrated that the system acquires relations that are not present in the ChEBI ontology, suggesting that it could function as a valuable aid to the ChEBI curators. The relations discovered by this process are formalised using the Web Ontology Language (OWL) to enable re-use.
PatentEye – an automated system for the extraction of reactions from chemical patents and their conversion to Chemical Markup Language (CML) – is presented. Chemical patents published by the European Patent Office over a ten-week period are used to demonstrate the capability of PatentEye – 4444 reactions are extracted with a precision of 78% and recall of 64% with regards to determining the identity and amount of reactants employed and an accuracy of 92% with regards to product identification. NMR spectra are extracted from the text using OSCAR3, which is developed to greatly increase recall. The resulting system is presented as a significant advancement towards the large-scale and automated extraction of high-quality reaction information.
Extended Polymer Markup Language (EPML), a CML dialect for the description of Markush structures as they are presented in the literature, is developed. Software to exemplify and to enable substructure searching of EPML documents is presented. Further work is recommended to refine the language and code to publication-quality before they are presented to the community.Unileve
The semantics of Chemical Markup Language (CML) for computational chemistry : CompChem.
: This paper introduces a subdomain chemistry format for storing computational chemistry data called CompChem. It has been developed based on the design, concepts and methodologies of Chemical Markup Language (CML) by adding computational chemistry semantics on top of the CML Schema. The format allows a wide range of ab initio quantum chemistry calculations of individual molecules to be stored. These calculations include, for example, single point energy calculation, molecular geometry optimization, and vibrational frequency analysis. The paper also describes the supporting infrastructure, such as processing software, dictionaries, validation tools and database repositories. In addition, some of the challenges and difficulties in developing common computational chemistry dictionaries are discussed. The uses of CompChem are illustrated by two practical applications.RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are
The semantics of Chemical Markup Language (CML): dictionaries and conventions
RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are.Abstract The semantic architecture of CML consists of conventions, dictionaries and units. The conventions conform to a top-level specification and each convention can constrain compliant documents through machine-processing (validation). Dictionaries conform to a dictionary specification which also imposes machine validation on the dictionaries. Each dictionary can also be used to validate data in a CML document, and provide human-readable descriptions. An additional set of conventions and dictionaries are used to support scientific units. All conventions, dictionaries and dictionary elements are identifiable and addressable through unique URIs.Peer Reviewe
XMPP for cloud computing in bioinformatics supporting discovery and invocation of asynchronous web services
Background: Life sciences make heavily use of the web for both data provision and analysis. However, the increasing amount of available data and the diversity of analysis tools call for machine accessible interfaces in order to be effective. HTTP-based Web service technologies, like the Simple Object Access Protocol (SOAP) and REpresentational State Transfer (REST) services, are today the most common technologies for this in bioinformatics. However, these methods have severe drawbacks, including lack of discoverability, and the inability for services to send status notifications. Several complementary workarounds have been proposed, but the results are ad-hoc solutions of varying quality that can be difficult to use. Results: We present a novel approach based on the open standard Extensible Messaging and Presence Protocol (XMPP), consisting of an extension (IO Data) to comprise discovery, asynchronous invocation, and definition of data types in the service. That XMPP cloud services are capable of asynchronous communication implies that clients do not have to poll repetitively for status, but the service sends the results back to the client upon completion. Implementations for Bioclipse and Taverna are presented, as are various XMPP cloud services in bio- and cheminformatics. Conclusion: XMPP with its extensions is a powerful protocol for cloud services that demonstrate several advantages over traditional HTTP-based Web services: 1) services are discoverable without the need of an external registry, 2) asynchronous invocation eliminates the need for ad-hoc solutions like polling, and 3) input and output types defined in the service allows for generation of clients on the fly without the need of an external semantics description. The many advantages over existing technologies make XMPP a highly interesting candidate for next generation online services in bioinformatics
DRIVER Technology Watch Report
This report is part of the Discovery Workpackage (WP4) and is the third report out of four deliverables. The objective of this report is to give an overview of the latest technical developments in the world of digital repositories, digital libraries and beyond, in order to serve as theoretical and practical input for the technical DRIVER developments, especially those focused on enhanced publications. This report consists of two main parts, one part focuses on interoperability standards for enhanced publications, the other part consists of three subchapters, which give a landscape picture of current and surfacing technologies and communities crucial to DRIVER. These three subchapters contain the GRID, CRIS and LTP communities and technologies. Every chapter contains a theoretical explanation, followed by case studies and the outcomes and opportunities for DRIVER in this field
Analysis and Synthesis of Metadata Goals for Scientific Data
The proliferation of discipline-specific metadata schemes contributes to artificial barriers that can impede interdisciplinary and transdisciplinary research. The authors considered this problem by examining the domains, objectives, and architectures of nine metadata schemes used to document scientific data in the physical, life, and social sciences. They used a mixed-methods content analysis and Greenberg’s (2005) metadata objectives, principles, domains, and architectural layout (MODAL) framework, and derived 22 metadata-related goals from textual content describing each metadata scheme. Relationships are identified between the domains (e.g., scientific discipline and type of data) and the categories of scheme objectives. For each strong correlation (\u3e0.6), a Fisher’s exact test for nonparametric data was used to determine significance (p \u3c .05).
Significant relationships were found between the domains and objectives of the schemes. Schemes describing observational data are more likely to have “scheme harmonization” (compatibility and interoperability with related schemes) as an objective; schemes with the objective “abstraction” (a conceptual model exists separate from the technical implementation) also have the objective “sufficiency” (the scheme defines a minimal amount of information to meet the needs of the community); and schemes with the objective “data publication” do not have the objective “element refinement.” The analysis indicates that many metadata-driven goals expressed by communities are independent of scientific discipline or the type of data, although they are constrained by historical community practices and workflows as well as the technological environment at the time of scheme creation. The analysis reveals 11 fundamental metadata goals for metadata documenting scientific data in support of sharing research data across disciplines and domains. The authors report these results and highlight the need for more metadata-related research, particularly in the context of recent funding agency policy changes
Recommended from our members
Automatic analysis and validation of open polymer data
A system to automatically extract, analyse, validate and model polymer data has been produced. This system is called the Polymer Informatics Knowledge System (PIKS).
Methods of storing polymer data electronically are examined. The majority of data-formats are only capable of representing an idealised structure of a macromolecule rather than the actual distribution of structures present in the polymer. Polymer markup language (PML) is the only data-format capable of storing this information. A novel extension to the PML language, allowing copolymers produced with a depletion of reactants is introduced. Without the extension only Markov-chains can be produced.
An informatics analysis of Unilever data of cleaning efficacy of polymers is performed. A representative macromolecule was produced for each polymer sample. Descriptors were calculated over these and used for machine learning to predict the cleaning efficacy. From these models a monomer was identified which was very strongly correlated with good cleaning performance. The monomer in question cannot be revealed as it is a trade secret.
Polymer data from the PoLyInfo database are extracted and converted into XML. A summary of the data available in the PoLyInfo Database is presented. The PIKS tools were used to automatically validate this data for internal consistency, as well as against another data source. The monomers and polymers were analysed for consistency, as well as CML reactions being produced for the polymerisation reactions in the database which were also checked for constancy. The error in the structures was found to be 5.8% for the monomers, 7.3% for the polymers and 2.9% for the reactions. Some of the causes of the discrepancies are presented.
The property data from the PoLyInfo database was then used for machine learning. Support Vector Regression (SVR) models of the glass transition temperature were produced both with and without the inclusion of sample characterisation data. Both methods performed similarly, with the model without producing an RMS error of 19.1K (r^2=0.96), while the model with produced an RMS error of 20.1K (r^2=0.96). This means that more sample characterisation data is required than the M_w and M_w/M_n.This work was supported by Unileve
- …