4 research outputs found

    Automatic Assignment of EC Numbers

    Get PDF
    A wide range of research areas in molecular biology and medical biochemistry require a reliable enzyme classification system, e.g., drug design, metabolic network reconstruction and system biology. When research scientists in the above mentioned areas wish to unambiguously refer to an enzyme and its function, the EC number introduced by the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB) is used. However, each and every one of these applications is critically dependent upon the consistency and reliability of the underlying data for success. We have developed tools for the validation of the EC number classification scheme. In this paper, we present validated data of 3788 enzymatic reactions including 229 sub-subclasses of the EC classification system. Over 80% agreement was found between our assignment and the EC classification. For 61 (i.e., only 2.5%) reactions we found that their assignment was inconsistent with the rules of the nomenclature committee; they have to be transferred to other sub-subclasses. We demonstrate that our validation results can be used to initiate corrections and improvements to the EC number classification scheme

    Structure generation and de novo design using reaction networks

    Get PDF
    This project is concerned with de novo molecular design whereby novel molecules are built in silico and evaluated against properties relevant to biological activity, such as physicochemical properties and structural similarity to active compounds. The aim is to encourage cost-effective compound design by reducing the number of molecules requiring synthesis and analysis. One of the main issues in de novo design is ensuring that the molecules generated are synthesisable. In this project, a method is developed that enables virtual synthesis using rules derived from reaction sequences. Individual reactions taken from reaction databases were connected to form reaction networks. Reaction sequences were then extracted by tracing paths through the network and used to create ‘reaction sequence vectors’ (RSVs) which encode the differences between the start and end points of th esequences. RSVs can be applied to molecules to generate virtual products which are based on literature precedents. The RSVs were applied to structure-activity relationship (SAR) exploration using examples taken from the literature. They were shown to be effective in expanding the chemical space that is accessible from the given starting materials. Furthermore, each virtual product is associated with a potential synthetic route. They were then applied in de novo design scenarios with the aim of generating molecules that are predicted to be active using SAR models. Using a collection of RSVs with a set of small molecules as starting materials for de novo design proved that the method was capable of producing many useful, synthesisable compounds worthy of future study. The RSV method was then compared with a previously published method that is based on individual reactions (reaction vectors or RVs). The RSV approach was shown to be considerably faster than de novo design using RVs, however, the diversity of products was more limited

    Enhancing Reaction-based de novo Design using Machine Learning

    Get PDF
    De novo design is a branch of chemoinformatics that is concerned with the rational design of molecular structures with desired properties, which specifically aims at achieving suitable pharmacological and safety profiles when applied to drug design. Scoring, construction, and search methods are the main components that are exploited by de novo design programs to explore the chemical space to encourage the cost-effective design of new chemical entities. In particular, construction methods are concerned with providing strategies for compound generation to address issues such as drug-likeness and synthetic accessibility. Reaction-based de novo design consists of combining building blocks according to transformation rules that are extracted from collections of known reactions, intending to restrict the enumerated chemical space into a manageable number of synthetically accessible structures. The reaction vector is an example of a representation that encodes topological changes occurring in reactions, which has been integrated within a structure generation algorithm to increase the chances of generating molecules that are synthesisable. The general aim of this study was to enhance reaction-based de novo design by developing machine learning approaches that exploit publicly available data on reactions. A series of algorithms for reaction standardisation, fingerprinting, and reaction vector database validation were introduced and applied to generate new data on which the entirety of this work relies. First, these collections were applied to the validation of a new ligand-based design tool. The tool was then used in a case study to design compounds which were eventually synthesised using very similar procedures to those suggested by the structure generator. A reaction classification model and a novel hierarchical labelling system were then developed to introduce the possibility of applying transformations by class. The model was augmented with an algorithm for confidence estimation, and was used to classify two datasets from industry and the literature. Results from the classification suggest that the model can be used effectively to gain insights on the nature of reaction collections. Classified reactions were further processed to build a reaction class recommendation model capable of suggesting appropriate reaction classes to apply to molecules according to their fingerprints. The model was validated, then integrated within the reaction vector-based design framework, which was assessed on its performance against the baseline algorithm. Results from the de novo design experiments indicate that the use of the recommendation model leads to a higher synthetic accessibility and a more efficient management of computational resources
    corecore