5,849 research outputs found

    Information retrieval and text mining technologies for chemistry

    Get PDF
    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European Community’s Horizon 2020 Program (project reference: 654021 - OpenMinted). M.K. additionally acknowledges the Encomienda MINETAD-CNIO as part of the Plan for the Advancement of Language Technology. O.R. and J.O. thank the Foundation for Applied Medical Research (FIMA), University of Navarra (Pamplona, Spain). This work was partially funded by Consellería de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi for useful feedback and discussions during the preparation of the manuscript.info:eu-repo/semantics/publishedVersio

    Efficient identification of Tanimoto nearest neighbors; All Pairs Similarity Search Using the Extended Jaccard Coefficient

    Get PDF
    Tanimoto, or extended Jaccard, is an important similarity measure which has seen prominent use in fields such as data mining and chemoinformatics. Many of the existing state-of-the-art methods for market basket analysis, plagiarism and anomaly detection, compound database search, and ligand-based virtual screening rely heavily on identifying Tanimoto nearest neighbors. Given the rapidly increasing size of data that must be analyzed, new algorithms are needed that can speed up nearest neighbor search, while at the same time providing reliable results. While many search algorithms address the complexity of the task by retrieving only some of the nearest neighbors, we propose a method that finds all of the exact nearest neighbors efficiently by leveraging recent advances in similarity search filtering. We provide tighter filtering bounds for the Tanimoto coefficient and show that our method, TAPNN, greatly outperforms existing baselines across a variety of real-world datasets and similarity thresholds

    Nanotechnology Publications and Patents: A Review of Social Science Studies and Search Strategies

    Get PDF
    This paper provides a comprehensive review of more than 120 social science studies in nanoscience and technology, all of which analyze publication and patent data. We conduct a comparative analysis of bibliometric search strategies that these studies use to harvest publication and patent data related to nanoscience and technology. We implement these strategies on 2006 publication data and find that Mogoutov and Kahane (2007), with their evolutionary lexical query search strategy, extract the highest number of records from the Web of Science. The strategies of Glanzel et al. (2003), Noyons et al. (2003), Porter et al. (2008) and Mogoutov and Kahane (2007) produce very similar ranking tables of the top ten nanotechnology subject areas and the top ten most prolific countries and institutions.nanotechnology, research and development, productivity, publications, patents, bibliometric analysis, search strategy

    Using natural language processing techniques to inform research on nanotechnology

    Get PDF
    Literature in the field of nanotechnology is exponentially increasing with more and more engineered nanomaterials being created, characterized, and tested for performance and safety. With the deluge of published data, there is a need for natural language processing approaches to semi-automate the cataloguing of engineered nanomaterials and their associated physico-chemical properties, performance, exposure scenarios, and biological effects. In this paper, we review the different informatics methods that have been applied to patent mining, nanomaterial/device characterization, nanomedicine, and environmental risk assessment. Nine natural language processing (NLP)-based tools were identified: NanoPort, NanoMapper, TechPerceptor, a Text Mining Framework, a Nanodevice Analyzer, a Clinical Trial Document Classifier, Nanotoxicity Searcher, NanoSifter, and NEIMiner. We conclude with recommendations for sharing NLP-related tools through online repositories to broaden participation in nanoinformatics

    Scientometric mapping as a strategic intelligence tool for the governance of emerging technologies

    Get PDF
    How can scientometric mapping function as a tool of ’strategic intelligence’ to aid the governance of emerging technologies? The present paper aims to address this question by focusing on a set of recently developed scientometric techniques, namely overlay mapping. We examine the potential these techniques have to inform, in a timely manner, analysts and decision-makers about relevant dynamics of technical emergence. We investigate the capability of overlay mapping in generating informed perspectives about emergence across three spaces: geographical, social, and cognitive. Our analysis relies on three empirical studies of emerging technologies in the biomedical domain: RNA interference (RNAi), Human Papilloma Virus (HPV) testing technologies for cervical cancer, and Thiopurine Methyltransferase (TPMT) genetic testing. The case-studies are analysed and mapped longitudinally by using publication and patent data. Results show the variety of ’intelligence’ inputs overlay mapping can produce for the governance of emerging technologies. Overlay mapping also confers to the investigation of emergence flexibility and granularity in terms of adaptability to different sources of data and selection of the levels of the analysis, respectively. These features make possible the integration and comparison of results from different contexts and cases, thus providing possibilities for a potentially more ’distributed’ strategic intelligence. The generated perspectives allow triangulation of findings, which is important given the complexity featuring in technical emergence and the limitations associated with the use of single scientometric approaches
    corecore