Search CORE

Springer - Publisher Connector

Springer - Publisher Connector

Discovering semantic features in the literature: a foundation for building functional associations

Author: Carazo Jose M
Carmona-Saez Pedro
Chagoyen Monica
Pascual-Montano Alberto
Shatkay Hagit
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Experimental techniques such as DNA microarray, serial analysis of gene expression (SAGE) and mass spectrometry proteomics, among others, are generating large amounts of data related to genes and proteins at different levels. As in any other experimental approach, it is necessary to analyze these data in the context of previously known information about the biological entities under study. The literature is a particularly valuable source of information for experiment validation and interpretation. Therefore, the development of automated text mining tools to assist in such interpretation is one of the main challenges in current bioinformatics research. RESULTS: We present a method to create literature profiles for large sets of genes or proteins based on common semantic features extracted from a corpus of relevant documents. These profiles can be used to establish pair-wise similarities among genes, utilized in gene/protein classification or can be even combined with experimental measurements. Semantic features can be used by researchers to facilitate the understanding of the commonalities indicated by experimental results. Our approach is based on non-negative matrix factorization (NMF), a machine-learning algorithm for data analysis, capable of identifying local patterns that characterize a subset of the data. The literature is thus used to establish putative relationships among subsets of genes or proteins and to provide coherent justification for this clustering into subsets. We demonstrate the utility of the method by applying it to two independent and vastly different sets of genes. CONCLUSION: The presented method can create literature profiles from documents relevant to sets of genes. The representation of genes as additive linear combinations of semantic features allows for the exploration of functional associations as well as for clustering, suggesting a valuable methodology for the validation and interpretation of high-throughput experimental data

Universidade do Minho: RepositoriUM

Digital.CSIC

Testing extensive use of NER tools in article classification and a statistical approach for method interaction extraction in the protein-protein interaction literature

Author: Abi-Haidar Alaa
Conover Michael
Lourenço Anália
Nematzadeh Azadeh
Pan Fengxia
Rocha L. M.
Shatkay Hagit
Wong Andrew
Publication venue: University of Delaware
Publication date: 01/09/2010
Field of study

We participated (as Team 81) in the Article Classification (ACT) and Interaction Method (IMT) subtasks of the Protein-Protein Interaction task of the Biocreative III Challenge. For the ACT we pursued an extensive testing of available Named Entity Recognition (NER) tools, and used the most promising ones to extend our the Variable Trigonometric Threshold (VTT) linear classifier we successfully used in BioCreative II and II.5. Our main goal was to exploit the power of available NER tools to aid in the document classification of documents relevant for Protein-Protein Interaction. We also used a Support Vector Machine Classifier on NER features for comparison purposes. For the IMT, we experimented with a primarily statistical approach, as opposed to a deeper natural language processing strategy; in a nutshell, we exploited classifiers, simple pattern matching, and ranking of candidate matches using statistical considerations. We will also report on our efforts to integrate our IMT method sentence classifier into our ACT pipeline

Automating the extraction of essential genes from literature

Author: A Lourenço
Chih-Chung Chang
D Campos
D Guo
H Chun
H Shatkay
M Gerner
R Rodrigues
S Ananiadou
T Yu
WH Chen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 11/07/2018
Field of study

The construction of repositories with curated information about gene essentiality for organisms of interest in Biotechnology is a very relevant task, mainly in the design of cell factories for the enhanced production of added-value products. However, it requires retrieval and extraction of relevant information from literature, leading to high costs regarding manual curation. Text mining tools implementing methods addressing tasks as information retrieval, named entity recognition and event extraction have been developed to automate and reduce the time required to obtain relevant information from literature in many biomedical fields. However, current tools are not designed or optimized for the purpose of identifying mentions to essential genes in scientific texts.This work is co-funded by the North Portugal Regional Operational Programme, under the “Portugal 2020”, through the European Regional Development Fund (ERDF), within project SISBI- Refa NORTE-01-0247-FEDER-003381. The Centre of Biological Engineering (CEB), University of Minho, sponsored all computational hardware and software required for this work.info:eu-repo/semantics/publishedVersio

Universidade do Minho: RepositoriUM

How to Get the Most out of Your Curation Effort

Author: A Einstein
A Rzhetsky
Andrey Rzhetsky
CJ Bult
DA Benson
F Yates
G Hripcsak
H Linhart
H Shatkay
Hagit Shatkay
I Stewart
JA Blake
M Schneider
MK Kerr
NL Johnson
Philip E. Bourne
RL Poole
S Kirkpatrick
T Byrt
W. John Wilbur
WJ Wilbur
WS Cobb
Publication venue: Public Library of Science
Publication date: 01/05/2009
Field of study

Large-scale annotation efforts typically involve several experts who may disagree with each other. We propose an approach for modeling disagreements among experts that allows providing each annotation with a confidence value (i.e., the posterior probability that it is correct). Our approach allows computing certainty-level for individual annotations, given annotator-specific parameters estimated from data. We developed two probabilistic models for performing this analysis, compared these models using computer simulation, and tested each model's actual performance, based on a large data set generated by human annotators specifically for this study. We show that even in the worst-case scenario, when all annotators disagree, our approach allows us to significantly increase the probability of choosing the correct annotation. Along with this publication we make publicly available a corpus of 10,000 sentences annotated according to several cardinal dimensions that we have introduced in earlier work. The 10,000 sentences were all 3-fold annotated by a group of eight experts, while a 1,000-sentence subset was further 5-fold annotated by five new experts. While the presented data represent a specialized curation task, our modeling approach is general; most data annotation studies could benefit from our methodology

Repositorio Institucional Universidad de Granada

SENT: semantic features in text

Author: A. Pascual-Montano
Ashburner
Carmona-Saez
Chagoyen
F. Tirado
Huang
J. M. Carazo
Jelier
Jenssen
Kanhere
Lee
M. Chagoyen
M. Vazquez
Majoros
P. Carmona-Saez
R. Nogales-Cadenas
Raychaudhuri
Shatkay
Wren
Publication venue: Oxford University Press
Publication date: 01/01/2009
Field of study

We present SENT (semantic features in text), a functional interpretation tool based on literature analysis. SENT uses Non-negative Matrix Factorization to identify topics in the scientific articles related to a collection of genes or their products, and use them to group and summarize these genes. In addition, the application allows users to rank and explore the articles that best relate to the topics found, helping put the analysis results into context. This approach is useful as an exploratory step in the workflow of interpreting and understanding experimental data, shedding some light into the complex underlying biological mechanisms. This tool provides a user-friendly interface via a web site, and a programmatic access via a SOAP web server. SENT is freely accessible at http://sent.dacya.ucm.es

Measurement of 222Rn dissolved in water at the Sudbury Neutrino Observatory

Author: Blevis I
Boger J
Bonvin E
Cleveland BT
Dai X
Dalnoki-Veress F
Doucas G
Farine J
Fergani H
Grant D
Hahn RL
Hamer AS
Hargrove CK
Heron H
Jagam P
Jelley NA
Jillings C
Knox AB
Lee HW
Levine I
Liu M
Majerus S
McDonald A
McFarlane K
Mifflin C
Noble AJ
Noel S
Novikov VM
Rowley JK
Shatkay M
Simpson JJ
Sinclair D
Sur B
Wang JX
Yeh M
Zhu X
Publication venue: 'Elsevier BV'
Publication date: 10/12/2003
Field of study

The technique used at the Sudbury Neutrino Observatory (SNO) to measure the concentration of 222Rn in water is described. Water from the SNO detector is passed through a vacuum degasser (in the light water system) or a membrane contact degasser (in the heavy water system) where dissolved gases, including radon, are liberated. The degasser is connected to a vacuum system which collects the radon on a cold trap and removes most other gases, such as water vapor and nitrogen. After roughly 0.5 tonnes of H2O or 6 tonnes of D2O have been sampled, the accumulated radon is transferred to a Lucas cell. The cell is mounted on a photomultiplier tube which detects the alpha particles from the decay of 222Rn and its daughters. The overall degassing and concentration efficiency is about 38% and the single-alpha counting efficiency is approximately 75%. The sensitivity of the radon assay system for D2O is equivalent to ~3 E(-15) g U/g water. The radon concentration in both the H2O and D2O is sufficiently low that the rate of background events from U-chain elements is a small fraction of the interaction rate of solar neutrinos by the neutral current reaction.Comment: 14 pages, 6 figures; v2 has very minor change

arXiv.org e-Print Archive

Oxford University Research Archive

A radium assay technique using hydrous titanium oxide adsorbent for the Sudbury Neutrino Observatory

Author: A.B Knox
A.J Noble
B.T Cleveland
Bland
C Shewchuk
C.K Hargrove
Currie
D Sinclair
Duncan
E Bonvin
E Hooper
E.D Hallman
F Dalnoki-Veress
G Doucas
G McGregor
G.G Miller
Galloway
H Fergani
H Heron
H.W Lee
Hooper
I Blevis
I Levine
J Boger
J Farine
J.-X Wang
J.B Wilhelmy
J.J Simpson
J.K Rowley
K McFarlane
K.H Howard
M Chen
M Moorhead
M Omori
M Shatkay
M Yeh
M.M Fowler
Miyake
N.A Jelley
N.W Tanner
P Jagam
P.T Trent
R.A Black
R.K Taplin
R.L Hahn
S Majerus
Sault
Suzuki
T.C Andersen
Tsuji
Venkataramani
W Locke
X Dai
Publication venue: 'Elsevier BV'
Publication date: 26/08/2002
Field of study

As photodisintegration of deuterons mimics the disintegration of deuterons by neutrinos, the accurate measurement of the radioactivity from thorium and uranium decay chains in the heavy water in the Sudbury Neutrino Observatory (SNO) is essential for the determination of the total solar neutrino flux. A radium assay technique of the required sensitivity is described that uses hydrous titanium oxide adsorbent on a filtration membrane together with a beta-alpha delayed coincidence counting system. For a 200 tonne assay the detection limit for 232Th is a concentration of 3 x 10^(-16) g Th/g water and for 238U of 3 x 10^(-16) g U/g water. Results of assays of both the heavy and light water carried out during the first two years of data collection of SNO are presented.Comment: 12 pages, 4 figure

arXiv.org e-Print Archive

Oxford University Research Archive

Stormwater best management practices: Experimental evaluation of chemical cocktails mobilized by freshwater salinization syndrome

Author: Carly M. Maas
Joseph G. Galella
Paul M. Mayer
Robert A. Stutzke
Ruth R. Shatkay
Sujay S. Kaushal
Publication venue: 'Frontiers Media SA'
Publication date: 01/04/2023
Field of study

Freshwater Salinization Syndrome (FSS) refers to the suite of physical, biological, and chemical impacts of salt ions on the degradation of natural, engineered, and social systems. Impacts of FSS on mobilization of chemical cocktails has been documented in streams and groundwater, but little research has focused on the effects of FSS on stormwater best management practices (BMPs) such as: constructed wetlands, bioswales, ponds, and bioretention. However emerging research suggests that stormwater BMPs may be both sources and sinks of contaminants, shifting seasonally with road salt applications. We conducted lab experiments to investigate this premise; replicate water and soil samples were collected from four distinct stormwater feature types (bioretention, bioswale, constructed wetlands and retention ponds) and were used in salt incubation experiments conducted under six different salinities with three different salts (NaCl, CaCl2, and MgCl2). Increased salt concentrations had profound effects on major and trace element mobilization, with all three salts showing significant positive relationships across nearly all elements analyzed. Across all sites, mean salt retention was 34%, 28%, and 26% for Na+, Mg2+ and Ca2+ respectively, and there were significant differences among stormwater BMPs. Salt type showed preferential mobilization of certain elements. NaCl mobilized Cu, a potent toxicant to aquatic biota, at rates over an order of magnitude greater than both CaCl2 and MgCl2. Stormwater BMP type also had a significant effect on elemental mobilization, with ponds mobilizing significantly more Mn than other sites. However, salt concentration and salt type consistently had significant effects on mean concentrations of elements mobilized across all stormwater BMPs (p < 0.05), suggesting that processes such as ion exchange mobilize metals mobilize metals and salt ions regardless of BMP type. Our results suggest that decisions regarding the amounts and types of salts used as deicers can have significant effects on reducing contaminant mobilization to freshwater ecosystems

A bioinformatics knowledge discovery in text application for grid computing

Author: A Hotho
AM Cohen
D Talia
EG Talbi
Gianfranco Tarricone
Giuseppe Mastronardi
H Shatkay
I Foster
IH Witten
M Castellano
M Castellano
Marcello Castellano
P Zweigenbaum
PC Carvalho
R Mooney
RC Bunescu
Roberto Bellotti
U Leser
UM Fayyad
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background A fundamental activity in biomedical research is Knowledge Discovery which has the ability to search through large amounts of biomedical information such as documents and data. High performance computational infrastructures, such as Grid technologies, are emerging as a possible infrastructure to tackle the intensive use of Information and Communication resources in life science. The goal of this work was to develop a software middleware solution in order to exploit the many knowledge discovery applications on scalable and distributed computing systems to achieve intensive use of ICT resources. Methods The development of a grid application for Knowledge Discovery in Text using a middleware solution based methodology is presented. The system must be able to: perform a user application model, process the jobs with the aim of creating many parallel jobs to distribute on the computational nodes. Finally, the system must be aware of the computational resources available, their status and must be able to monitor the execution of parallel jobs. These operative requirements lead to design a middleware to be specialized using user application modules. It included a graphical user interface in order to access to a node search system, a load balancing system and a transfer optimizer to reduce communication costs. Results A middleware solution prototype and the performance evaluation of it in terms of the speed-up factor is shown. It was written in JAVA on Globus Toolkit 4 to build the grid infrastructure based on GNU/Linux computer grid nodes. A test was carried out and the results are shown for the named entity recognition search of symptoms and pathologies. The search was applied to a collection of 5,000 scientific documents taken from PubMed. Conclusion In this paper we discuss the development of a grid application based on a middleware solution. It has been tested on a knowledge discovery in text process to extract new and useful information about symptoms and pathologies from a large collection of unstructured scientific documents. As an example a computation of Knowledge Discovery in Database was applied on the output produced by the KDT user module to extract new knowledge about symptom and pathology bio-entities.</p

Springer - Publisher Connector