14 research outputs found
Ambiguity of human gene symbols in LocusLink and MEDLINE: creating an inventory and a disambiguation test collection
Genes are discovered almost on a daily basis and new names have to be
found. Although there are guidelines for gene nomenclature, the naming
process is highly creative. Human genes are often named with a gene symbol
and a longer, more descriptive term; the short form is very often an
abbreviation of the long form. Abbreviations in biomedical language are
highly ambiguous, i.e., one gene symbol often refers to more than one
gene.Using an existing abbreviation expansion algorithm,we explore MEDLINE
for the use of human gene symbols derived from LocusLink. It turns out
that just over 40% of these symbols occur in MEDLINE, however, many of
these occurrences are not related to genes. Along the process of making an
inventory, a disambiguation test collection is constructed automatically
Electronic data sources for kinetic models of cell signaling
Functional understanding of signaling pathways requires detailed information about the constituent molecules and their interactions. Simulations of signaling pathways therefore build upon a great deal of data from various sources. We first survey electronic data resources for cell signaling modeling and then based on the type of data representation the data sources are broadly classified into five groups. None of the data sources surveyed provide all required data in a ready-to-be-modeled fashion. We then put forward a wish list for the desired attributes for an ideal modeling centric database. Finally, we close with perspectives on how electronic data sources for cell signaling modeling have developed. We suggest that future directions in such data sources are largely model-driven and are hinged on interoperability of data sources
Do peers see more in a paper than its authors?
Recent years have shown a gradual shift in the content of biomedical publications that is freely accessible, from titles and abstracts to full text. This has enabled new forms of automatic text analysis and has given rise to some interesting questions: How informative is the abstract compared to the full-text? What important information in the full-text is not present in the abstract? What should a good summary contain that is not already in the abstract? Do authors and peers see an article differently? We answer these questions by comparing the information content of the abstract to that in citances-sentences containing citations to that article. We contrast the important points of an article as judged by its authors versus as seen by peers. Focusing on the area of molecular interactions, we perform manual and automatic analysis, and we find that the set of all citances to a target article not only covers most information (entities, functions, experimental methods, and other biological concepts) found in its abstract, but also contains 20% more concepts. We further present a detailed summary of the differences across information types, and we examine the effects other citations and time have on the content of citances
LAITOR - Literature Assistant for Identification of Terms co-Occurrences and Relationships
<p>Abstract</p> <p>Background</p> <p>Biological knowledge is represented in scientific literature that often describes the function of genes/proteins (bioentities) in terms of their interactions (biointeractions). Such bioentities are often related to biological concepts of interest that are specific of a determined research field. Therefore, the study of the current literature about a selected topic deposited in public databases, facilitates the generation of novel hypotheses associating a set of bioentities to a common context.</p> <p>Results</p> <p>We created a text mining system (LAITOR: <it><b>L</b>iterature <b>A</b>ssistant for <b>I</b>dentification of <b>T</b>erms co-<b>O</b>ccurrences and <b>R</b>elationships</it>) that analyses co-occurrences of bioentities, biointeractions, and other biological terms in MEDLINE abstracts. The method accounts for the position of the co-occurring terms within sentences or abstracts. The system detected abstracts mentioning protein-protein interactions in a standard test (BioCreative II IAS test data) with a precision of 0.82-0.89 and a recall of 0.48-0.70. We illustrate the application of LAITOR to the detection of plant response genes in a dataset of 1000 abstracts relevant to the topic.</p> <p>Conclusions</p> <p>Text mining tools combining the extraction of interacting bioentities and biological concepts with network displays can be helpful in developing reasonable hypotheses in different scientific backgrounds.</p
Exploring the Unexplored: Identifying Implicit and Indirect Descriptions of Biomedical Terminologies Based on Multifaceted Weighting Combinations
In order to achieve relevant scholarly information from the biomedical databases, researchers generally use technical terms as queries such as proteins, genes, diseases, and other biomedical descriptors. However, the technical terms have limits as query terms because there are so many indirect and conceptual expressions denoting them in scientific literatures. Combinatorial weighting schemes are proposed as an initial approach to this problem, which utilize various indexing and weighting methods and their combinations. In the experiments based on the proposed system and previously constructed evaluation collection, this approach showed promising results in that one could continually locate new relevant expressions by combining the proposed weighting schemes. Furthermore, it could be ascertained that the most outperforming binary combinations of the weighting schemes, showing the inherent traits of the weighting schemes, could be complementary to each other and it is possible to find hidden relevant documents based on the proposed methods
Do Peers See More in a Paper Than Its Authors?
Recent years have shown a gradual shift in the content of biomedical publications that is freely accessible, from titles and abstracts to full text. This has enabled new forms of automatic text analysis and has given rise to some interesting questions: How informative is the abstract compared to the full-text? What important information in the full-text is not present in the abstract? What should a good summary contain that is not already in the abstract? Do authors and peers see an article differently? We answer these questions by comparing the information content of the abstract to that in citances—sentences containing citations to that article. We contrast the important points of an article as judged by its authors versus as seen by peers. Focusing on the area of molecular interactions, we perform manual and automatic analysis, and we find that the set of all citances to a target article not only covers most information (entities, functions, experimental methods, and other biological concepts) found in its abstract, but also contains 20% more concepts. We further present a detailed summary of the differences across information types, and we examine the effects other citations and time have on the content of citances
Ontology-Based Clinical Information Extraction Using SNOMED CT
Extracting and encoding clinical information captured in unstructured clinical documents with standard medical terminologies is vital to enable secondary use of clinical data from practice. SNOMED CT is the most comprehensive medical ontology with broad types of concepts and detailed relationships and it has been widely used for many clinical applications. However, few studies have investigated the use of SNOMED CT in clinical information extraction.
In this dissertation research, we developed a fine-grained information model based on the SNOMED CT and built novel information extraction systems to recognize clinical entities and identify their relations, as well as to encode them to SNOMED CT concepts. Our evaluation shows that such ontology-based information extraction systems using SNOMED CT could achieve state-of-the-art performance, indicating its potential in clinical natural language processing
Automatic extraction of gene and protein synonyms from MEDLINE and journal articles.
Genes and proteins are often associated with multiple names, and more names are added as new functional or structural information is discovered. Because authors often alternate between these synonyms, information retrieval and extraction benefits from identifying these synonymous names. We have developed a method to extract automatically synonymous gene and protein names from MEDLINE and journal articles. We first identified patterns authors use to list synonymous gene and protein names. We developed SGPE (for synonym extraction of gene and protein names), a software program that recognizes the patterns and extracts from MEDLINE abstracts and full-text journal articles candidate synonymous terms. SGPE then applies a sequence of filters that automatically screen out those terms that are not gene and protein names. We evaluated our method to have an overall precision of 71% on both MEDLINE and journal articles, and 90% precision on the more suitable full-text articles alon