101 research outputs found
Chemoinformatics Research at the University of Sheffield: A History and Citation Analysis
This paper reviews the work of the Chemoinformatics Research Group in the Department of Information Studies at the University of Sheffield, focusing particularly on the work carried out in the period 1985-2002. Four major research areas are discussed, these involving the development of methods for: substructure searching in databases of three-dimensional structures, including both rigid and flexible molecules; the representation and searching of the Markush structures that occur in chemical patents; similarity searching in databases of both two-dimensional and three-dimensional structures; and compound selection and the design of combinatorial libraries. An analysis of citations to 321 publications from the Group shows that it attracted a total of 3725 residual citations during the period 1980-2002. These citations appeared in 411 different journals, and involved 910 different citing organizations from 54 different countries, thus demonstrating the widespread impact of the Group's work
Recommended from our members
Information extraction from chemical patents
The automated extraction of semantic chemical data from the existing literature is demonstrated. For reasons of copyright, the work is focused on the patent literature, though the methods are expected to apply equally to other areas of the chemical literature.
Hearst Patterns are applied to the patent literature in order to discover hyponymic relations describing chemical species. The acquired relations are manually validated to determine the precision of the determined hypernyms (85.0%) and of the asserted hyponymic relations (94.3%). It is demonstrated that the system acquires relations that are not present in the ChEBI ontology, suggesting that it could function as a valuable aid to the ChEBI curators. The relations discovered by this process are formalised using the Web Ontology Language (OWL) to enable re-use.
PatentEye – an automated system for the extraction of reactions from chemical patents and their conversion to Chemical Markup Language (CML) – is presented. Chemical patents published by the European Patent Office over a ten-week period are used to demonstrate the capability of PatentEye – 4444 reactions are extracted with a precision of 78% and recall of 64% with regards to determining the identity and amount of reactants employed and an accuracy of 92% with regards to product identification. NMR spectra are extracted from the text using OSCAR3, which is developed to greatly increase recall. The resulting system is presented as a significant advancement towards the large-scale and automated extraction of high-quality reaction information.
Extended Polymer Markup Language (EPML), a CML dialect for the description of Markush structures as they are presented in the literature, is developed. Software to exemplify and to enable substructure searching of EPML documents is presented. Further work is recommended to refine the language and code to publication-quality before they are presented to the community.Unileve
ClassyFire: automated chemical classification with a comprehensive, computable taxonomy
Additional file 5. Use cases. Text-based search on the ClassyFire web server. (A) Building the query. (B) Sparteine, one of the returned compounds
Development of deep learning applications for the automated extraction of chemical information from scientific literature
This dissertation focuses on developing deep learning applications for extracting chemical information from scientific literature, particularly targeting the automated recognition of molecular structures in images. DECIMER Segmentation, a novel application, employs a Mask Region-based Convolutional Neural Network (MRCNN) model to segment chemical structures in documents, aided by a mask expansion algorithm, marking a significant advancement in processing chemical literature. The Optical Chemical Structure Recognition (OCSR) tool DECIMER Image Transformer uses an encoder-decoder architecture to convert chemical structure depictions into the machine-readable SMILES format. The model has been trained on over 450 million pairs of images and SMILES representations. Its ability to interpret various depiction styles, including hand-drawn structures, sets a new standard in OCSR. To artificially generate large and diverse OCSR training datasets using multiple cheminformatics toolkits, RanDepict was developed. The diversification of training data ensures robust model generalisation across different chemical structure depictions. A unique dataset of hand-drawn molecule images was created to evaluate the model's performance in interpreting these challenging depictions. This dataset further contributes to the understanding of automated structure recognition from diverse styles. The integration of these technologies led to the creation of DECIMER.ai, an open-source web application that combines segmentation and interpretation tools, allowing users to extract and process chemical information from literature efficiently. The work concludes with a discussion on the significance of open data in advancing molecular informatics, highlighting the potential to broader chemical research domains. By adhering to FAIR data standards and open-source principles, the tools developed for this dissertation are designed for adaptability and future development within the community
Patent Database: Their Importance in Prior Art Documentation and Patent Search
In knowledge based economies the nation’s economic status depends on the production, distribution and use of knowledge
and information. The recent trend in the economic growth of nations is mainly determined by innovative technological knowhow
of the individuals. Intellectual property has gained attention in this era of knowledge. The vast amount of data generated
through the application of intellectual assets is managed with the help of various in- silico tools. In recent days, the patent
databases have gained importance due to the detailed information available on the granted patent and other details, such as,
legal status of the patent applications, which are not available through any other literature search. This review paper attempts to
describe different types of patent databases available, their unique features, strengths, weakness and their major purpose. This
paper details the information on how to access a patent database, the relevance of patent information obtained from these
databases in prior art search, patent analysis, and the drawbacks present in these patent databases
Information retrieval and text mining technologies for chemistry
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European
Community’s Horizon 2020 Program (project reference:
654021 - OpenMinted). M.K. additionally acknowledges the
Encomienda MINETAD-CNIO as part of the Plan for the
Advancement of Language Technology. O.R. and J.O. thank
the Foundation for Applied Medical Research (FIMA),
University of Navarra (Pamplona, Spain). This work was
partially funded by Consellería
de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic
funding of UID/BIO/04469/2013 unit and COMPETE 2020
(POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi
for useful feedback and discussions during the preparation of
the manuscript.info:eu-repo/semantics/publishedVersio
Science Inside Law: The Making of a New Patent Class in the International Patent Classification
Recent studies of patents have argued that the very materiality and techniques of legal media, such as the written patent document, are vital for the legal construction of a patentable invention. Developing the centrality placed on patent documents further, it becomes important to understand how these documents are ordered and mobilized. Patent classification answers the necessity of making the virtual nature of textual claims practicable by linking written inscription to bureaucracy. Here, the epistemological organization of documents overlaps with the grid of patent administration. How are scientific inventions represented in such a process? If we examine the process of creating a new patent category within the International Patent Classification (IPC), it becomes clear that disagreements about the substance of the novel inventive subject matter have been resolved by computer simulations of patent documents in draft classifications. The practical needs of patent examiners were the most important concerns in the making of a new category. Such a lack of epistemological mediation between the scientific and legal identities of an invention depicts a legal understanding that science is already inside patent law. From an internal legal perspective, the self-referential introduction of the new patent category may make practical sense; however it becomes problematic from a technological and scientific standpoint as the remit of the patent classification also affects other social contexts and practice
Towards Inference of a Biochemical Ontology From a Metabolic Database
In order to predict the metabolic fate of an arbitrary compound based solely on
structure, it is useful to be able to identify substructural ‘functional groups’ that are
biochemically reactive. These functional groups are the substructural elements that
can be removed and replaced to transform one compound into another. This problem
of identifying functional groups is related to the problem of classifying compounds.
The research presented here discusses the state of the art in biochemical databases
and how these sources may be applied to the problem of classifying compounds based
solely on structure. We describe a biochemical informatics system for processing
molecular data and describe how 100 255 compositional (hasA) relationships are
inferred between 835 abstractions and 9500 metabolites from the KEGG Ligand
database. Specifically, we focus on the identification of amino acids and consider ways
in which the inference of biochemical ontologies for metabolites will be improved in
the future
Recommended from our members
Chemical Information Bulletin
Created as a supplement for "the regular journals of the American Chemical Society," this publication contains annotated bibliographies of chemical documentation literature as well as information about meetings, conferences, awards, scholarships, and other news from the American Chemical Society (ACS) Division of Chemical Information (CINF)
- …