179 research outputs found
A Linear Classifier Based on Entity Recognition Tools and a Statistical Approach to Method Extraction in the Protein-Protein Interaction Literature
We participated, in the Article Classification and the Interaction Method
subtasks (ACT and IMT, respectively) of the Protein-Protein Interaction task of
the BioCreative III Challenge. For the ACT, we pursued an extensive testing of
available Named Entity Recognition and dictionary tools, and used the most
promising ones to extend our Variable Trigonometric Threshold linear
classifier. For the IMT, we experimented with a primarily statistical approach,
as opposed to employing a deeper natural language processing strategy. Finally,
we also studied the benefits of integrating the method extraction approach that
we have used for the IMT into the ACT pipeline. For the ACT, our linear article
classifier leads to a ranking and classification performance significantly
higher than all the reported submissions. For the IMT, our results are
comparable to those of other systems, which took very different approaches. For
the ACT, we show that the use of named entity recognition tools leads to a
substantial improvement in the ranking and classification of articles relevant
to protein-protein interaction. Thus, we show that our substantially expanded
linear classifier is a very competitive classifier in this domain. Moreover,
this classifier produces interpretable surfaces that can be understood as
"rules" for human understanding of the classification. In terms of the IMT
task, in contrast to other participants, our approach focused on identifying
sentences that are likely to bear evidence for the application of a PPI
detection method, rather than on classifying a document as relevant to a
method. As BioCreative III did not perform an evaluation of the evidence
provided by the system, we have conducted a separate assessment; the evaluators
agree that our tool is indeed effective in detecting relevant evidence for PPI
detection methods.Comment: BMC Bioinformatics. In Pres
Discovering semantic features in the literature: a foundation for building functional associations
BACKGROUND: Experimental techniques such as DNA microarray, serial analysis of gene expression (SAGE) and mass spectrometry proteomics, among others, are generating large amounts of data related to genes and proteins at different levels. As in any other experimental approach, it is necessary to analyze these data in the context of previously known information about the biological entities under study. The literature is a particularly valuable source of information for experiment validation and interpretation. Therefore, the development of automated text mining tools to assist in such interpretation is one of the main challenges in current bioinformatics research. RESULTS: We present a method to create literature profiles for large sets of genes or proteins based on common semantic features extracted from a corpus of relevant documents. These profiles can be used to establish pair-wise similarities among genes, utilized in gene/protein classification or can be even combined with experimental measurements. Semantic features can be used by researchers to facilitate the understanding of the commonalities indicated by experimental results. Our approach is based on non-negative matrix factorization (NMF), a machine-learning algorithm for data analysis, capable of identifying local patterns that characterize a subset of the data. The literature is thus used to establish putative relationships among subsets of genes or proteins and to provide coherent justification for this clustering into subsets. We demonstrate the utility of the method by applying it to two independent and vastly different sets of genes. CONCLUSION: The presented method can create literature profiles from documents relevant to sets of genes. The representation of genes as additive linear combinations of semantic features allows for the exploration of functional associations as well as for clustering, suggesting a valuable methodology for the validation and interpretation of high-throughput experimental data
Testing extensive use of NER tools in article classification and a statistical approach for method interaction extraction in the protein-protein interaction literature
We participated (as Team 81) in the Article Classification (ACT) and Interaction Method (IMT) subtasks
of the Protein-Protein Interaction task of the Biocreative III Challenge. For the ACT we pursued an extensive
testing of available Named Entity Recognition (NER) tools, and used the most promising ones to extend our
the Variable Trigonometric Threshold (VTT) linear classifier we successfully used in BioCreative II and II.5. Our
main goal was to exploit the power of available NER tools to aid in the document classification of documents
relevant for Protein-Protein Interaction. We also used a Support Vector Machine Classifier on NER features for
comparison purposes. For the IMT, we experimented with a primarily statistical approach, as opposed to a deeper
natural language processing strategy; in a nutshell, we exploited classifiers, simple pattern matching, and ranking
of candidate matches using statistical considerations. We will also report on our efforts to integrate our IMT
method sentence classifier into our ACT pipeline
Automating the extraction of essential genes from literature
The construction of repositories with curated information about gene essentiality for organisms of interest in Biotechnology is a very relevant task, mainly in the design of cell factories for the enhanced production of added-value products. However, it requires retrieval and extraction of relevant information from literature, leading to high costs regarding manual curation. Text mining tools implementing methods addressing tasks as information retrieval, named entity recognition and event extraction have been developed to automate and reduce the time required to obtain relevant information from literature in many biomedical fields. However, current tools are not designed or optimized for the purpose of identifying mentions to essential genes in scientific texts.This work is co-funded by the North Portugal Regional Operational Programme, under the “Portugal 2020”, through the European Regional Development Fund (ERDF), within project SISBI- Refa NORTE-01-0247-FEDER-003381. The Centre of Biological Engineering (CEB), University of Minho, sponsored all computational hardware and software required for this work.info:eu-repo/semantics/publishedVersio
How to Get the Most out of Your Curation Effort
Large-scale annotation efforts typically involve several experts who may disagree with each other. We propose an approach for modeling disagreements among experts that allows providing each annotation with a confidence value (i.e., the posterior probability that it is correct). Our approach allows computing certainty-level for individual annotations, given annotator-specific parameters estimated from data. We developed two probabilistic models for performing this analysis, compared these models using computer simulation, and tested each model's actual performance, based on a large data set generated by human annotators specifically for this study. We show that even in the worst-case scenario, when all annotators disagree, our approach allows us to significantly increase the probability of choosing the correct annotation. Along with this publication we make publicly available a corpus of 10,000 sentences annotated according to several cardinal dimensions that we have introduced in earlier work. The 10,000 sentences were all 3-fold annotated by a group of eight experts, while a 1,000-sentence subset was further 5-fold annotated by five new experts. While the presented data represent a specialized curation task, our modeling approach is general; most data annotation studies could benefit from our methodology
SENT: semantic features in text
We present SENT (semantic features in text), a functional interpretation tool based on literature analysis. SENT uses Non-negative Matrix Factorization to identify topics in the scientific articles related to a collection of genes or their products, and use them to group and summarize these genes. In addition, the application allows users to rank and explore the articles that best relate to the topics found, helping put the analysis results into context. This approach is useful as an exploratory step in the workflow of interpreting and understanding experimental data, shedding some light into the complex underlying biological mechanisms. This tool provides a user-friendly interface via a web site, and a programmatic access via a SOAP web server. SENT is freely accessible at http://sent.dacya.ucm.es
Measurement of 222Rn dissolved in water at the Sudbury Neutrino Observatory
The technique used at the Sudbury Neutrino Observatory (SNO) to measure the
concentration of 222Rn in water is described. Water from the SNO detector is
passed through a vacuum degasser (in the light water system) or a membrane
contact degasser (in the heavy water system) where dissolved gases, including
radon, are liberated. The degasser is connected to a vacuum system which
collects the radon on a cold trap and removes most other gases, such as water
vapor and nitrogen. After roughly 0.5 tonnes of H2O or 6 tonnes of D2O have
been sampled, the accumulated radon is transferred to a Lucas cell. The cell is
mounted on a photomultiplier tube which detects the alpha particles from the
decay of 222Rn and its daughters. The overall degassing and concentration
efficiency is about 38% and the single-alpha counting efficiency is
approximately 75%. The sensitivity of the radon assay system for D2O is
equivalent to ~3 E(-15) g U/g water. The radon concentration in both the H2O
and D2O is sufficiently low that the rate of background events from U-chain
elements is a small fraction of the interaction rate of solar neutrinos by the
neutral current reaction.Comment: 14 pages, 6 figures; v2 has very minor change
A radium assay technique using hydrous titanium oxide adsorbent for the Sudbury Neutrino Observatory
As photodisintegration of deuterons mimics the disintegration of deuterons by
neutrinos, the accurate measurement of the radioactivity from thorium and
uranium decay chains in the heavy water in the Sudbury Neutrino Observatory
(SNO) is essential for the determination of the total solar neutrino flux. A
radium assay technique of the required sensitivity is described that uses
hydrous titanium oxide adsorbent on a filtration membrane together with a
beta-alpha delayed coincidence counting system. For a 200 tonne assay the
detection limit for 232Th is a concentration of 3 x 10^(-16) g Th/g water and
for 238U of 3 x 10^(-16) g U/g water. Results of assays of both the heavy and
light water carried out during the first two years of data collection of SNO
are presented.Comment: 12 pages, 4 figure
Stormwater best management practices: Experimental evaluation of chemical cocktails mobilized by freshwater salinization syndrome
Freshwater Salinization Syndrome (FSS) refers to the suite of physical, biological, and chemical impacts of salt ions on the degradation of natural, engineered, and social systems. Impacts of FSS on mobilization of chemical cocktails has been documented in streams and groundwater, but little research has focused on the effects of FSS on stormwater best management practices (BMPs) such as: constructed wetlands, bioswales, ponds, and bioretention. However emerging research suggests that stormwater BMPs may be both sources and sinks of contaminants, shifting seasonally with road salt applications. We conducted lab experiments to investigate this premise; replicate water and soil samples were collected from four distinct stormwater feature types (bioretention, bioswale, constructed wetlands and retention ponds) and were used in salt incubation experiments conducted under six different salinities with three different salts (NaCl, CaCl2, and MgCl2). Increased salt concentrations had profound effects on major and trace element mobilization, with all three salts showing significant positive relationships across nearly all elements analyzed. Across all sites, mean salt retention was 34%, 28%, and 26% for Na+, Mg2+ and Ca2+ respectively, and there were significant differences among stormwater BMPs. Salt type showed preferential mobilization of certain elements. NaCl mobilized Cu, a potent toxicant to aquatic biota, at rates over an order of magnitude greater than both CaCl2 and MgCl2. Stormwater BMP type also had a significant effect on elemental mobilization, with ponds mobilizing significantly more Mn than other sites. However, salt concentration and salt type consistently had significant effects on mean concentrations of elements mobilized across all stormwater BMPs (p < 0.05), suggesting that processes such as ion exchange mobilize metals mobilize metals and salt ions regardless of BMP type. Our results suggest that decisions regarding the amounts and types of salts used as deicers can have significant effects on reducing contaminant mobilization to freshwater ecosystems
A bioinformatics knowledge discovery in text application for grid computing
<p>Abstract</p> <p>Background</p> <p>A fundamental activity in biomedical research is Knowledge Discovery which has the ability to search through large amounts of biomedical information such as documents and data. High performance computational infrastructures, such as Grid technologies, are emerging as a possible infrastructure to tackle the intensive use of Information and Communication resources in life science. The goal of this work was to develop a software middleware solution in order to exploit the many knowledge discovery applications on scalable and distributed computing systems to achieve intensive use of ICT resources.</p> <p>Methods</p> <p>The development of a grid application for Knowledge Discovery in Text using a middleware solution based methodology is presented. The system must be able to: perform a user application model, process the jobs with the aim of creating many parallel jobs to distribute on the computational nodes. Finally, the system must be aware of the computational resources available, their status and must be able to monitor the execution of parallel jobs. These operative requirements lead to design a middleware to be specialized using user application modules. It included a graphical user interface in order to access to a node search system, a load balancing system and a transfer optimizer to reduce communication costs.</p> <p>Results</p> <p>A middleware solution prototype and the performance evaluation of it in terms of the speed-up factor is shown. It was written in JAVA on Globus Toolkit 4 to build the grid infrastructure based on GNU/Linux computer grid nodes. A test was carried out and the results are shown for the named entity recognition search of symptoms and pathologies. The search was applied to a collection of 5,000 scientific documents taken from PubMed.</p> <p>Conclusion</p> <p>In this paper we discuss the development of a grid application based on a middleware solution. It has been tested on a knowledge discovery in text process to extract new and useful information about symptoms and pathologies from a large collection of unstructured scientific documents. As an example a computation of Knowledge Discovery in Database was applied on the output produced by the KDT user module to extract new knowledge about symptom and pathology bio-entities.</p
- …