2,614 research outputs found

    Information retrieval and text mining technologies for chemistry

    Get PDF
    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European Community’s Horizon 2020 Program (project reference: 654021 - OpenMinted). M.K. additionally acknowledges the Encomienda MINETAD-CNIO as part of the Plan for the Advancement of Language Technology. O.R. and J.O. thank the Foundation for Applied Medical Research (FIMA), University of Navarra (Pamplona, Spain). This work was partially funded by Consellería de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi for useful feedback and discussions during the preparation of the manuscript.info:eu-repo/semantics/publishedVersio

    Web Server for Protein Interaction Searching

    Get PDF
    Tato práce se zabývá zbůsoby, jimiž je možné získávat data z bioinformatických databází obsahujících data týkajících se interakcí mezi proteiny. Od souvislostí okolo vzniku bioinformatiky sloučením informatiky a biologie tato práce uvede čtenáře do problematiky přístupu k datům týkajících se interakcí mezi proteiny. Tato práce vysvětlí důvody vzniku IMEx konsorcia, jeho cíle a prostředky, kterými svých cílů dosahuje. IMEx konsorcium dalo vzniknout mnoha standardům, které usnadňují přístup k datům členů konsorcia a výměnu těchto dat mezi nimi. Jedním z výtvorů IMEx konsorcia je i webová služba PSICQUIC, která byla navržena s využitím architektonického stylu REST, a která je přístupná i pomocí protokolu SOAP. Obě tyto kategorie přístupů k webových službám jsou v rámci této práce studovány a na základě výsledků výzkumu je implementována aplikace pro získávání interakcí mezi proteiny z databází, jenž jsou členy IMEx konsorcia.This thesis deals with different possibilities, how to collect data from bioinforatics databases containing protein interaction data. Reader is put into context by introducing him problematics of emergence of bioinformatics by connecting two fields of human knowledge: biology and informatics. Then the reader will get acquainted with the importance of protein interactions and possible ways of retrieving protein interaction data from protein interaction databases. This thesis also elucidates the motivation for IMEx consortium existence. IMEx faciliattes access to data and data exchange between its members by issuing new standards and data formats. I a list of IMEx consortium sucecsses is also PSICQUIC web service. PSICQUIC is REST-compliant web service, which can be also accessed via SOAP protocol. Both REST and SOAP approaches are studied and compared in this thesis and on the basis of this research is implemented application for retreiving protein interaction data from PSICQUIC members' databases.

    Design of a Structure Search Engine for Chemical Compound Database

    Get PDF
    The search for structural fragments (substructures) of compounds is very important in medicinal chemistry, QSAR, spectroscopy, and many other fields. In the last decade, with the development of hardware and evolution of database technologies, more and more chemical compound database applications have been developed along with interfaces of searching for targets based on user input. Due to the algorithmic complexity of structure comparison, essentially a graph isomorphism problem, the current applications mainly work by the approximation of the comparison problem based on certain chemical perceptions and their search interfaces are often e-mail based. The procedure of approximation usually invokes subjective assumption. Therefore, the accuracy of the search is undermined, which may not be acceptable for researchers because in a time-consuming drug design, accuracy is always the first priority. In this dissertation, a design of a search engine for chemical compound database is presented.The design focuses on providing a solution to develop an accurate and fast search engine without sacrificing performance. The solution is comprehensive in a way that a series of related problems were addressed throughout the dissertation with proposed methods. Based on the design, a flexible computing model working for compound search engine can be established and the model can be easily applied to other applications as well. To verify the solution in a practical manner, an implementation based on the presented solution was developed. The implementation clarifies the coupling between theoretic design and technique development. In addition, a workable implementation can be deployed to test the efficiency and effectiveness of the design under variant of experimental data

    Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry

    Get PDF
    BACKGROUND: Structure elucidation of unknown small molecules by mass spectrometry is a challenge despite advances in instrumentation. The first crucial step is to obtain correct elemental compositions. In order to automatically constrain the thousands of possible candidate structures, rules need to be developed to select the most likely and chemically correct molecular formulas. RESULTS: An algorithm for filtering molecular formulas is derived from seven heuristic rules: (1) restrictions for the number of elements, (2) LEWIS and SENIOR chemical rules, (3) isotopic patterns, (4) hydrogen/carbon ratios, (5) element ratio of nitrogen, oxygen, phosphor, and sulphur versus carbon, (6) element ratio probabilities and (7) presence of trimethylsilylated compounds. Formulas are ranked according to their isotopic patterns and subsequently constrained by presence in public chemical databases. The seven rules were developed on 68,237 existing molecular formulas and were validated in four experiments. First, 432,968 formulas covering five million PubChem database entries were checked for consistency. Only 0.6% of these compounds did not pass all rules. Next, the rules were shown to effectively reducing the complement all eight billion theoretically possible C, H, N, S, O, P-formulas up to 2000 Da to only 623 million most probable elemental compositions. Thirdly 6,000 pharmaceutical, toxic and natural compounds were selected from DrugBank, TSCA and DNP databases. The correct formulas were retrieved as top hit at 80–99% probability when assuming data acquisition with complete resolution of unique compounds and 5% absolute isotope ratio deviation and 3 ppm mass accuracy. Last, some exemplary compounds were analyzed by Fourier transform ion cyclotron resonance mass spectrometry and by gas chromatography-time of flight mass spectrometry. In each case, the correct formula was ranked as top hit when combining the seven rules with database queries. CONCLUSION: The seven rules enable an automatic exclusion of molecular formulas which are either wrong or which contain unlikely high or low number of elements. The correct molecular formula is assigned with a probability of 98% if the formula exists in a compound database. For truly novel compounds that are not present in databases, the correct formula is found in the first three hits with a probability of 65–81%. Corresponding software and supplemental data are available for downloads from the authors' website

    Broadening the horizon – level 2.5 of the HUPO-PSI format for molecular interactions

    Get PDF
    BACKGROUND: Molecular interaction Information is a key resource in modern biomedical research. Publicly available data have previously been provided in a broad array of diverse formats, making access to this very difficult. The publication and wide implementation of the Human Proteome Organisation Proteomics Standards Initiative Molecular Interactions (HUPO PSI-MI) format in 2004 was a major step towards the establishment of a single, unified format by which molecular interactions should be presented, but focused purely on protein-protein interactions. RESULTS: The HUPO-PSI has further developed the PSI-MI XML schema to enable the description of interactions between a wider range of molecular types, for example nucleic acids, chemical entities, and molecular complexes. Extensive details about each supported molecular interaction can now be captured, including the biological role of each molecule within that interaction, detailed description of interacting domains, and the kinetic parameters of the interaction. The format is supported by data management and analysis tools and has been adopted by major interaction data providers. Additionally, a simpler, tab-delimited format MITAB2.5 has been developed for the benefit of users who require only minimal information in an easy to access configuration. CONCLUSION: The PSI-MI XML2.5 and MITAB2.5 formats have been jointly developed by interaction data producers and providers from both the academic and commercial sector, and are already widely implemented and well supported by an active development community. PSI-MI XML2.5 enables the description of highly detailed molecular interaction data and facilitates data exchange between databases and users without loss of information. MITAB2.5 is a simpler format appropriate for fast Perl parsing or loading into Microsoft Excel

    ClassyFire: automated chemical classification with a comprehensive, computable taxonomy

    Get PDF
    Additional file 5. Use cases. Text-based search on the ClassyFire web server. (A) Building the query. (B) Sparteine, one of the returned compounds

    Domain-specific ChatBots for Science using Embeddings

    Full text link
    Large language models (LLMs) have emerged as powerful machine-learning systems capable of handling a myriad of tasks. Tuned versions of these systems have been turned into chatbots that can respond to user queries on a vast diversity of topics, providing informative and creative replies. However, their application to physical science research remains limited owing to their incomplete knowledge in these areas, contrasted with the needs of rigor and sourcing in science domains. Here, we demonstrate how existing methods and software tools can be easily combined to yield a domain-specific chatbot. The system ingests scientific documents in existing formats, and uses text embedding lookup to provide the LLM with domain-specific contextual information when composing its reply. We similarly demonstrate that existing image embedding methods can be used for search and retrieval across publication figures. These results confirm that LLMs are already suitable for use by physical scientists in accelerating their research efforts.Comment: 12 pages, 5 figure
    corecore