Search CORE

14 research outputs found

Automatic document classification of biological literature

Author: Chen David
Muller Hans-Michael
Sternberg Paul W.
Publication venue
Publication date: 01/08/2006
Field of study

Background: Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans literature. Results: We present a two-step text categorization algorithm to classify a corpus of C. elegans papers. Our classification method first uses a support vector machine-trained classifier, followed by a novel, phrase-based clustering algorithm. This clustering step autonomously creates cluster labels that are descriptive and understandable by humans. This clustering engine performed better on a standard test-set (Reuters 21578) compared to previously published results (F-value of 0.55 vs. 0.49), while producing cluster descriptions that appear more useful. A web interface allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept. Conclusions: We have demonstrated a simple method to classify biological documents that embodies an improvement over current methods. While the classification results are currently optimized for Caenorhabditis elegans papers by human-created rules, the classification engine can be adapted to different types of documents. We have demonstrated this by presenting a web interface that allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Caltech Authors

Automatic document classification of biological literature

Author
Publication venue: BioMed Central
Publication date
Field of study

Springer - Publisher Connector

WormBase: a comprehensive resource for nematode research

Author: Antoshechkin Igor
Bieri Tamberlyn
Blasiar Darin
Chan Juancarlos
Chen Wen J.
Davis Paul
De La Cruz Norie
Duesbury Margaret
Durbin Richard
Fang Ruihua
Fernandes Jolene
Han Michael
Harris Todd W.
Kishore Ranjana
Lee Raymond
Müller Hans-Michael
Nakamura Cecilia
Ozersky Philip
Petcherski Andrei
Rangarajan Arun
Rogers Anthony
Schindelman Gary
Schwarz Erich M.
Spieth John
Stein Lincoln D.
Sternberg Paul W.
Tuli Mary Ann
Van Auken Kimberly
Wang Daniel
Wang Xiaodong
Williams Gary
Yook Karen
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2010
Field of study

WormBase (http://www.wormbase.org) is a central data repository for nematode biology. Initially created as a service to the Caenorhabditis elegans research field, WormBase has evolved into a powerful research tool in its own right. In the past 2 years, we expanded WormBase to include the complete genomic sequence, gene predictions and orthology assignments from a range of related nematodes. This comparative data enrich the C. elegans data with improved gene predictions and a better understanding of gene function. In turn, they bring the wealth of experimental knowledge of C. elegans to other systems of medical and agricultural importance. Here, we describe new species and data types now available at WormBase. In addition, we detail enhancements to our curatorial pipeline and website infrastructure to accommodate new genomes and an extensive user base

PubMed Central

Digital Commons@Becker

Caltech Authors

Textpresso for Neuroscience: Searching the Full Text of Thousands of Neuroscience Research Papers

Author: Mueller Hans-Michael
Rangarajan Arun
Sternberg Paul W.
Teal Tracy K.
Publication venue: Humana Press Inc.
Publication date: 01/01/2008
Field of study

Textpresso is a text-mining system for scientific literature. Its two major features are access to the full text of research papers and the development and use of categories of biological concepts as well as categories that describe or relate objects. A search engine enables the user to search for one or a combination of these categories and/or keywords within an entire literature. Here we describe Textpresso for Neuroscience, part of the core Neuroscience Information Framework (NIF). The Textpresso site currently consists of 67,500 full text papers and 131,300 abstracts. We show that using categories in literature can make a pure keyword query more refined and meaningful. We also show how semantic queries can be formulated with categories only. We explain the build and content of the database and describe the main features of the web pages and the advanced search options. We also give detailed illustrations of the web service developed to provide programmatic access to Textpresso. This web service is used by the NIF interface to access Textpresso. The standalone website of Textpresso for Neuroscience can be accessed at http://www.textpresso.org/neuroscience

Springer - Publisher Connector

Caltech Authors

WormBase 2012: more genomes, more data, new website

Author: Chan Juancarlos
Chen Wen J.
Fang Ruihua
Ganesan Uma
Grove Christian
Kadam Snehalata
Kishore Ranjana
Lee Raymond
Li Yuling
Muller Hans-Michael
Nakamura Cecilia
Raciti Daniela
Rangarajan Arun
Schindelman Gary
Schwarz Erich M.
Sternberg Paul W.
Van Auken Kimberly
Wang Daniel
Wang Xiaodong
Yook Karen
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2012
Field of study

Since its release in 2000, WormBase (http://www.wormbase.org) has grown from a small resource focusing on a single species and serving a dedicated research community, to one now spanning 15 species essential to the broader biomedical and agricultural research fields. To enhance the rate of curation, we have automated the identification of key data in the scientific literature and use similar methodology for data extraction. To ease access to the data, we are collaborating with journals to link entities in research publications to their report pages at WormBase. To facilitate discovery, we have added new views of the data, integrated large-scale datasets and expanded descriptions of models for human disease. Finally, we have introduced a dramatic overhaul of the WormBase website for public beta testing. Designed to balance complexity and usability, the new site is species-agnostic, highly customizable, and interactive. Casual users and developers alike will be able to leverage the public RESTful application programming interface (API) to generate custom data mining solutions and extensions to the site. We report on the growth of our database and on our work in keeping pace with the growing demand for data, efforts to anticipate the requirements of users and new collaborations with the larger science community

Caltech Authors

Chi-square-based scoring function for categorization of MEDLINE citations

Author: Hristovski Dimitar
Kastrin Andrej
Peterlin Borut
Publication venue: 'Georg Thieme Verlag KG'
Publication date: 01/01/2010
Field of study

Objectives: Text categorization has been used in biomedical informatics for identifying documents containing relevant topics of interest. We developed a simple method that uses a chi-square-based scoring function to determine the likelihood of MEDLINE citations containing genetic relevant topic. Methods: Our procedure requires construction of a genetic and a nongenetic domain document corpus. We used MeSH descriptors assigned to MEDLINE citations for this categorization task. We compared frequencies of MeSH descriptors between two corpora applying chi-square test. A MeSH descriptor was considered to be a positive indicator if its relative observed frequency in the genetic domain corpus was greater than its relative observed frequency in the nongenetic domain corpus. The output of the proposed method is a list of scores for all the citations, with the highest score given to those citations containing MeSH descriptors typical for the genetic domain. Results: Validation was done on a set of 734 manually annotated MEDLINE citations. It achieved predictive accuracy of 0.87 with 0.69 recall and 0.64 precision. We evaluated the method by comparing it to three machine learning algorithms (support vector machines, decision trees, na\"ive Bayes). Although the differences were not statistically significantly different, results showed that our chi-square scoring performs as good as compared machine learning algorithms. Conclusions: We suggest that the chi-square scoring is an effective solution to help categorize MEDLINE citations. The algorithm is implemented in the BITOLA literature-based discovery support system as a preprocessor for gene symbol disambiguation process.Comment: 34 pages, 2 figure

arXiv.org e-Print Archive

Crossref

Linguistic Processing and Classification of Semi Structured Bibliographic Data on Complementary Medicine

Author: Büssing Arndt
Matthiessen Peter F.
Ostermann Thomas
Raak Christa K.
Zillmann Hartmut
Publication venue: Libertas Academica
Publication date: 01/01/2009
Field of study

Complementary and alternative therapies and medicines (CAM) such as acupuncture or mistletoe treatment are much asked for by cancer patients. With a growing interest in such therapies, physicians need a simple tool with which to get an overview of the scientific publications on CAM, particularly those that are not listed in common bibliographic databases like MEDLINE. CAMbase is an XML-based bibliographical database on CAM which serves to address this need. A custom front end search engine performs semantic analysis of textual input enabling users to quickly find information relevant to the search queries. This article describes the technical background and the architecture behind CAMbase, a free online database on CAM (www.cambase.de). We give examples on its use, describe the underlying algorithms and present recent statistics for search terms related to complementary therapies in oncology

Directory of Open Access Journals

PubMed Central

Word add-in for ontology recognition: semantic enrichment of scientific literature

Abstract Background In the current era of scientific research, efficient communication of information is paramount. As such, the nature of scholarly and scientific communication is changing; cyberinfrastructure is now absolutely necessary and new media are allowing information and knowledge to be more interactive and immediate. One approach to making knowledge more accessible is the addition of machine-readable semantic data to scholarly articles. Results The Word add-in presented here will assist authors in this effort by automatically recognizing and highlighting words or phrases that are likely information-rich, allowing authors to associate semantic data with those words or phrases, and to embed that data in the document as XML. The add-in and source code are publicly available at <url>http://www.codeplex.com/UCSDBioLit</url>. Conclusions The Word add-in for ontology term recognition makes it possible for an author to add semantic data to a document as it is being written and it encodes these data using XML tags that are effectively a standard in life sciences literature. Allowing authors to mark-up their own work will help increase the amount and quality of machine-readable literature metadata.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

An effective biomedical document classification scheme in support of biocuration: addressing class imbalance.

Author: Arighi Cecilia
Blake Judith A
Jiang Xiangying
Ringwald Martin
Shatkay Hagit
Zhang Gongbo
Publication venue: The Mouseion at the JAXlibrary
Publication date: 01/01/2019
Field of study

Published literature is an important source of knowledge supporting biomedical research. Given the large and increasing number of publications, automated document classification plays an important role in biomedical research. Effective biomedical document classifiers are especially needed for bio-databases, in which the information stems from many thousands of biomedical publications that curators must read in detail and annotate. In addition, biomedical document classification often amounts to identifying a small subset of relevant publications within a much larger collection of available documents. As such, addressing class imbalance is essential to a practical classifier. We present here an effective classification scheme for automatically identifying papers among a large pool of biomedical publications that contain information relevant to a specific topic, which the curators are interested in annotating. The proposed scheme is based on a meta-classification framework using cluster-based under-sampling combined with named-entity recognition and statistical feature selection strategies. We examined the performance of our method over a large imbalanced data set that was originally manually curated by the Jackson Laboratory\u27s Gene Expression Database (GXD). The set consists of more than 90 000 PubMed abstracts, of which about 13 000 documents are labeled as relevant to GXD while the others are not relevant. Our results, 0.72 precision, 0.80 recall and 0.75 f-measure, demonstrate that our proposed classification scheme effectively categorizes such a large data set in the face of data imbalance

The Jackson Laboratory: The Mouseion at the JAXlibrary

Semi-automated screening of biomedical citations for systematic reviews

Author: A Aronson
A Blum
A Cohen
A Wilcox
B Settles
B Wallace
Byron C Wallace
C Blake
C Cole
C Counsell
Carla Brodley
Chih-Chung
Christopher H Schmid
CJL Chih-Wei Hsu
D Chen
DD Lewis
E Perrin
F Camous
G Druck
G Schohn
H Kilicoglu
Joseph Lau
K Brinker
KS Goh
KS Jones
L Breiman
L Hunter
M Barza
M Chung
M Yetisgen-Yildiz
N Japkowicz
P Wheeler
P Zweigenbaum
S Dasgupta
S Ertekin
S Kotsiantis
S Tong
T Joachims
T Terasawa
Thomas A Trikalinos
VN Vapnik
W Yu
Y Aphinyanaphongs
YAC Aphinyanaphongs
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Crossref

Springer - Publisher Connector

PubMed Central