1,700 research outputs found

    Information retrieval and text mining technologies for chemistry

    Get PDF
    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European Community’s Horizon 2020 Program (project reference: 654021 - OpenMinted). M.K. additionally acknowledges the Encomienda MINETAD-CNIO as part of the Plan for the Advancement of Language Technology. O.R. and J.O. thank the Foundation for Applied Medical Research (FIMA), University of Navarra (Pamplona, Spain). This work was partially funded by Consellería de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi for useful feedback and discussions during the preparation of the manuscript.info:eu-repo/semantics/publishedVersio

    Using rule-based natural language processing to improve disease normalization in biomedical text

    Get PDF
    Background and objective: In order for computers to extract useful information from unstructured text, a concept normalization system is needed to link relevant concepts in a text to sources that contain further information about the concept. Popular concept normalization tools in the biomedical field are dictionarybased. In this study we investigate the usefulness of natural language processing (NLP) as an adjunct to dictionary-based concept normalization. Methods: We compared the performance of two biomedical concept normalization systems, MetaMap and Peregrine, on the Arizona Disease Corpus, with and without the use of a rule-based NLP module. Performance was assessed for exact and inexact boundary matching of the system annotations with those of the gold standard and for concept identifier matching. Results: Without the NLP module, MetaMap and Peregrine attained F-scores of 61.0% and 63.9%, respectively, for exact boundary matching, and 55.1% and 56.9% for concept identifier matching. With the aid of the NLP module, the F-scores of MetaMap and Peregrine improved to 73.3% and 78.0% for boundary matching, and to 66.2% and 69.8% for concept identifier matching. For inexact boundary matching, performances further increased to 85.5% and 85.4%, and to 73.6% and 73.3% for concept identifier matching. Conclusions: We have shown the added value of NLP for the recognition and normalization of diseases with MetaMap and Peregrine. The NLP module is general and can be applied in combination with any concept normalization system. Whether its use for concept types other than disease is equally advantageous remains to be investigated

    charmm2gmx: An Automated Method to Port the CHARMM Additive Force Field to GROMACS

    Get PDF
    CHARMM is one of the most widely used biomolecular force fields. Although developed in close connection with a dedicated molecular simulation engine of the same name, it is also usable with other codes. GROMACS is a well-established, highly optimized, and multipurpose software for molecular dynamics, versatile enough to accommodate many different force field potential functions and the associated algorithms. Due to conceptional differences related to software design and the large amount of numeric data inherent to residue topologies and parameter sets, conversion from one software format to another is not straightforward. Here, we present an automated and validated means to port the CHARMM force field to a format read by the GROMACS engine, harmonizing the different capabilities of the two codes in a self-documenting and reproducible way with a bare minimum of user interaction required. Being based entirely on the upstream data files, the presented approach does not involve any hard-coded data, in contrast with previous attempts to solve the same problem. The heuristic approach used for perceiving the local internal geometry is directly applicable for analogous transformations of other force fields

    Novel concepts for lipid identification from shotgun mass spectra using a customized query language

    Get PDF
    Lipids are the main component of semipermeable cell membranes and linked to several important physiological processes. Shotgun lipidomics relies on the direct infusion of total lipid extracts from cells, tissues or organisms into the mass spectrometer and is a powerful tool to elucidate their molecular composition. Despite the technical advances in modern mass spectrometry the currently available software underperforms in several aspects of the lipidomics pipeline. This thesis addresses these issues by presenting a new concept for lipid identification using a customized query language for mass spectra in combination with efficient spectra alignment algorithms which are implemented in the open source kit “LipidXplorer”

    Acta Cybernetica : Volume 18. Number 3.

    Get PDF

    Semantic Biclustering

    Get PDF
    Tato disertační práce se zaměřuje na problém hledání interpretovatelných a prediktivních vzorů, které jsou vyjádřeny formou dvojshluků, se specializací na biologická data. Prezentované metody jsou souhrnně označovány jako sémantické dvojshlukování, jedná se o podobor dolování dat. Termín sémantické dvojshlukování je použit z toho důvodu, že zohledňuje proces hledání koherentních podmnožin řádků a sloupců, tedy dvojshluků, v 2-dimensionální binární matici a zárove ň bere také v potaz sémantický význam prvků v těchto dvojshlucích. Ačkoliv byla práce motivována biologicky orientovanými daty, vyvinuté algoritmy jsou obecně aplikovatelné v jakémkoli jiném výzkumném oboru. Je nutné pouze dodržet požadavek na formát vstupních dat. Disertační práce představuje dva originální a v tomto ohledu i základní přístupy pro hledání sémantických dvojshluků, jako je Bicluster enrichment analysis a Rule a tree learning. Jelikož tyto metody nevyužívají vlastní hierarchické uspořádání termů v daných ontologiích, obecně je běh těchto algoritmů dlouhý čin může docházet k indukci hypotéz s redundantními termy. Z toho důvodu byl vytvořen nový operátor zjemnění. Tento operátor byl včleněn do dobře známého algoritmu CN2, kde zavádí dvě redukční procedury: Redundant Generalization a Redundant Non-potential. Obě procedury pomáhají dramaticky prořezat prohledávaný prostor pravidel a tím umožňují urychlit proces indukce pravidel v porovnání s tradičním operátorem zjemnění tak, jak je původně prezentován v CN2. Celý algoritmus spolu s redukčními metodami je publikován ve formě R balííčku, který jsme nazvali sem1R. Abychom ukázali i možnost praktického užití metody sémantického dvojshlukování na reálných biologických problémech, v disertační práci dále popisujeme a specificky upravujeme algoritmus sem1R pro dv+ úlohy. Zaprvé, studujeme praktickou aplikaci algoritmu sem1R v analýze E-3 ubikvitin ligázy v trávicí soustavě s ohledem na potenciál regenerace tkáně. Zadruhé, kromě objevování dvojshluků v dat ech genové exprese, adaptujeme algoritmus sem1R pro hledání potenciálne patogenních genetických variant v kohortě pacientů.This thesis focuses on the problem of finding interpretable and predic tive patterns, which are expressed in the form of biclusters, with an orientation to biological data. The presented methods are collectively called semantic biclustering, as a subfield of data mining. The term semantic biclustering is used here because it reflects both a process of finding coherent subsets of rows and columns in a 2-dimensional binary matrix and simultaneously takes into account a mutual semantic meaning of elements in such biclusters. In spite of focusing on applications of algorithms in biological data, the developed algorithms are generally applicable to any other research field, there are only limitations on the format of the input data. The thesis introduces two novel, and in that context basic, approaches for finding semantic biclusters, as Bicluster enrichment analysis and Rule and tree learning. Since these methods do not exploit the native hierarchical order of terms of input ontologies, the run-time of algorithms is relatively long in general or an induced hypothesis might have terms that are redundant. For this reason, a new refinement operator has been invented. The refinement operator was incorporated into the well-known CN2 algorithm and uses two reduction procedures: Redundant Generalization and Redundant Non-potential, both of which help to dramatically prune the rule space and consequently, speed-up the entire process of rule induction in comparison with the traditional refinement operator as is presented in CN2. The reduction procedures were published as an R package that we called sem1R. To show a possible practical usage of semantic biclustering in real biological problems, the thesis also describes and specifically adapts the algorithm for two real biological problems. Firstly, we studied a practical application of sem1R algorithm in an analysis of E-3 ubiquitin ligase in the gastrointestinal tract with respect to tissue regeneration potential. Secondly, besides discovering biclusters in gene expression data, we adapted the sem1R algorithm for a different task, concretely for finding potentially pathogenic genetic variants in a cohort of patients

    Automatic Population of Structured Reports from Narrative Pathology Reports

    Get PDF
    There are a number of advantages for the use of structured pathology reports: they can ensure the accuracy and completeness of pathology reporting; it is easier for the referring doctors to glean pertinent information from them. The goal of this thesis is to extract pertinent information from free-text pathology reports and automatically populate structured reports for cancer diseases and identify the commonalities and differences in processing principles to obtain maximum accuracy. Three pathology corpora were annotated with entities and relationships between the entities in this study, namely the melanoma corpus, the colorectal cancer corpus and the lymphoma corpus. A supervised machine-learning based-approach, utilising conditional random fields learners, was developed to recognise medical entities from the corpora. By feature engineering, the best feature configurations were attained, which boosted the F-scores significantly from 4.2% to 6.8% on the training sets. Without proper negation and uncertainty detection, the quality of the structured reports will be diminished. The negation and uncertainty detection modules were built to handle this problem. The modules obtained overall F-scores ranging from 76.6% to 91.0% on the test sets. A relation extraction system was presented to extract four relations from the lymphoma corpus. The system achieved very good performance on the training set, with 100% F-score obtained by the rule-based module and 97.2% F-score attained by the support vector machines classifier. Rule-based approaches were used to generate the structured outputs and populate them to predefined templates. The rule-based system attained over 97% F-scores on the training sets. A pipeline system was implemented with an assembly of all the components described above. It achieved promising results in the end-to-end evaluations, with 86.5%, 84.2% and 78.9% F-scores on the melanoma, colorectal cancer and lymphoma test sets respectively

    Novel approaches for bond order assignment and NMR shift prediction

    Get PDF
    Molecular modelling is one of the cornerstones of modern biological and pharmaceutical research. Accurate modelling approaches easily become computationally overwhelming and thus, different levels of approximations are typically employed. In this work, we develop such approximation approaches for problems arising in structural bioinformatics. A fundamental approximation of molecular physics is the classification of chemical bonds, usually in the form of integer bond orders. Many input data sets lack this information, but several problems render an automated bond order assignment highly challenging. For this task, we develop the BOA Constructor method which accounts for the non-uniqueness of solutions and allows simple extensibility. Testing our method on large evaluation sets, we demonstrate how it improves on the state of the art. Besides traditional applications, bond orders yield valuable input for the approximation of molecular quantities by statistical means. One such problem is the prediction of NMR chemical shifts of protein atoms. We present our pipeline NightShift for automated model generation, use it to create a new prediction model called Spinster, and demonstrate that it outperforms established, manually developed approaches. Combining Spinster and BOA Constructor, we create the Liops-model that for the first time allows to efficiently include the influence of non-protein atoms. Finally, we describe our work on manual modelling techniques, including molecular visualization and novel input paradigms.Methoden des molekularen Modellierens gehören zu den Grundpfeilern moderner biologischer und pharmazeutischer Forschung. Akkurate Modelling-Methoden erfordern jedoch enormen Rechenaufwand, weshalb üblicherweise verschiedene Näherungsverfahren eingesetzt werden. Im Promotionsvortrag werden solche im Rahmen der Promotion entwickelten Näherungen für verschiedene Probleme aus der strukturbasierten Bioinformatik vorgestellt. Eine fundamentale Näherung der molekularen Physik ist die Einteilung chemischer Bindungen in wenige Klassen, meist in Form ganzzahliger Bindungsordnungen. In vielen Datensätzen ist diese Information nicht enthalten und eine automatische Zuweisung ist hochgradig schwierig. Für diese Problemstellung wird die BOA Constructor-Methode vorgestellt, die sowohl mit uneindeutigen Lösungen umgehen kann als auch vom Benutzer leicht erweitert werden kann. In umfangreichen Tests zeigen wir, dass unsere Methode dem bisherigen Stand der Forschung überlegen ist. Neben klassischen Anwendungen liefern Bindungsordnungen wertvolle Informationen für die statistische Vorhersage molekularer Eigenschaften wie z.B. der chemischen Verschiebung von Proteinatomen. Mit der von uns entwickelten NightShift-Pipeline wird ein Verfahren zur automatischen Generierung von Vorhersagemodellen präsentiert, wie z.B. dem Spinster-Modell, das den bisherigen manuell entwickelten Verfahren überlegen ist. Die Kombination mit BOA Constructor führt zum sogenannten Liops-Modell, welches als erstes Modell die effiziente Berücksichtigung des Einflusses von nicht-Proteinatomen erlaubt

    Text Augmentation: Inserting markup into natural language text with PPM Models

    Get PDF
    This thesis describes a new optimisation and new heuristics for automatically marking up XML documents. These are implemented in CEM, using PPMmodels. CEM is significantly more general than previous systems, marking up large numbers of hierarchical tags, using n-gram models for large n and a variety of escape methods. Four corpora are discussed, including the bibliography corpus of 14682 bibliographies laid out in seven standard styles using the BIBTEX system and markedup in XML with every field from the original BIBTEX. Other corpora include the ROCLING Chinese text segmentation corpus, the Computists’ Communique corpus and the Reuters’ corpus. A detailed examination is presented of the methods of evaluating mark up algorithms, including computation complexity measures and correctness measures from the fields of information retrieval, string processing, machine learning and information theory. A new taxonomy of markup complexities is established and the properties of each taxon are examined in relation to the complexity of marked-up documents. The performance of the new heuristics and optimisation is examined using the four corpora
    corecore