Search CORE

Homogenisation of a gridded snow water equivalent climatology for Alpine terrain: methodology and applications

Author: F. Fundel
M. Zappa
S. Jörg-Hess
T. Jonas
Publication venue: Copernicus Publications
Publication date: 01/03/2014
Field of study

Gridded snow water equivalent (SWE) data sets are valuable for estimating the snow water resources and verify different model systems, e.g. hydrological, land surface or atmospheric models. However, changing data availability represents a considerable challenge when trying to derive consistent time series for SWE products. In an attempt to improve the product consistency, we first evaluated the differences between two climatologies of SWE grids that were calculated on the basis of data from 110 and 203 stations, respectively. The "shorter" climatology (2001–2009) was produced using 203 stations (map203) and the "longer" one (1971–2009) 110 stations (map110). Relative to map203, map110 underestimated SWE, especially at higher elevations and at the end of the winter season. We tested the potential of quantile mapping to compensate for mapping errors in map110 relative to map203. During a 9 yr calibration period from 2001 to 2009, for which both map203 and map110 were available, the method could successfully refine the spatial and temporal SWE representation in map110 by making seasonal, regional and altitude-related distinctions. Expanding the calibration to the full 39 yr showed that the general underestimation of map110 with respect to map203 could be removed for the whole winter. The calibrated SWE maps fitted the reference (map203) well when averaged over regions and time periods, where the mean error is approximately zero. However, deviations between the calibrated maps and map203 were observed at single grid cells and years. When we looked at three different regions in more detail, we found that the calibration had the largest effect in the region with the highest proportion of catchment areas above 2000 m a.s.l. and that the general underestimation of map110 compared to map203 could be removed for the entire snow season. The added value of the calibrated SWE climatology is illustrated with practical examples: the verification of a hydrological model, the estimation of snow resource anomalies and the predictability of runoff through SWE

A realistic assessment of methods for extracting gene/protein interactions from free text

Author: A Moschitti
AB Clegg
Adrian J Shepherd
AM Cohen
Andrew B Clegg
AS Yeh
B Settles
C Nédellec
D Rebholz-Schuhmann
H Jose
HL Johnson
J Ding
J Fluck
JD Kim
JD Kim
K Franzén
K Fundel
K Sagae
L Hunter
M Krallinger
N Domedel-Puig
R Bunescu
R Hoffmann
R Kabiljo
R Kabiljo
R Leaman
R Sætre
Renata Kabiljo
S Pyysalo
S Pyysalo
S Pyysalo
T Hara
WA Baumgartner
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Background: The automated extraction of gene and/or protein interactions from the literature is one of the most important targets of biomedical text mining research. In this paper we present a realistic evaluation of gene/protein interaction mining relevant to potential non-specialist users. Hence we have specifically avoided methods that are complex to install or require reimplementation, and we coupled our chosen extraction methods with a state-of-the-art biomedical named entity tagger. Results: Our results show: that performance across different evaluation corpora is extremely variable; that the use of tagged (as opposed to gold standard) gene and protein names has a significant impact on performance, with a drop in F-score of over 20 percentage points being commonplace; and that a simple keyword-based benchmark algorithm when coupled with a named entity tagger outperforms two of the tools most widely used to extract gene/protein interactions. Conclusion: In terms of availability, ease of use and performance, the potential non-specialist user community interested in automatically extracting gene and/or protein interactions from free text is poorly served by current tools and systems. The public release of extraction tools that are easy to install and use, and that achieve state-of-art levels of performance should be treated as a high priority by the biomedical text mining community

UCL Discovery

Birkbeck Institutional Research Online

"EDML1": a chronology for the EPICA deep ice core from Dronning Maud Land, Antarctica, over the last 150 000 years.

Author: Barnola J. M.
Beer J.
Bigler M.
Blunier T.
Castellano Emiliano
Fischer H.
Fundel F.
Huybrechts P.
Kaufmann P.
Kipfstuhl S.
Lambrecht A.
Morganti A.
Oerter H.
Parrenin F.
Ruth U.
Rybak O.
Severi Mirko
Udisti Roberto
Wilhelms F.
Wolff E.
Publication venue
Publication date: 01/01/2007
Field of study

A chronology called EDML1 has been developed for the EPICA ice core from Dronning Maud Land (EDML). EDML1 is closely interlinked with EDC3, the new chronology for the EPICA ice core from Dome-C (EDC) through a stratigraphic match between EDML and EDC that consists of 322 volcanic match points over the last 128 ka. The EDC3 chronology comprises a glaciological model at EDC, which is constrained and later selectively tuned using primary dating information from EDC as well as from EDML, the latter being transferred using the tight stratigraphic link between the two cores. Finally, EDML1 was built by exporting EDC3 to EDML. For ages younger than 41 ka BP the new synchronized time scale EDML1/EDC3 is based on dated volcanic events and on a match to the Greenlandic ice core chronology GICC05 via 10Be and methane. The internal consistency between EDML1 and EDC3 is estimated to be typically ~6 years and always less than 450 years over the last 128 ka (always less than 130 years over the last 60 ka), which reflects an unprecedented synchrony of time scales. EDML1 ends at 150 ka BP (2417 m depth) because the match between EDML and EDC becomes ambiguous further down. This hints at a complex ice flow history for the deepest 350 m of the EDML ice core

Florence Research

Learning an enriched representation from unlabeled data for protein-protein interaction extraction

Author: A Airola
AM Cohen
C Giuliano
DP Corney
Hongfei Lin
J Taylor
J Wilbur
K Fundel
M Krallinger
M Miwa
M Miwa
R Bunescu
R Bunescu
R Bunescu
R Sætre
S Pyysalo
S Pyysalo
S Van Landeghem
T Mitsumori
W Hersh
X Hu
Xiaohua Hu
Y Li
Y Miyao
Yanpeng Li
Zhihao Yang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Walk-weighted subsequence kernels for protein-protein interaction extraction

Author: A Airola
A Bairoch
A Culotta
A Moschitti
A Zanzoni
B Boeckmann
C Giuliano
C Hsu
D Sleator
G Zhou
GD Bader
H Lodhi
J Hakenberg
J Kim
J Shawe-Taylor
Jihoon Yang
Juntae Yoon
K Fundel
M Huang
M Lease
M Miwa
M Miwa
R Bunescu
R Sætre
S Aubin
S Pyysalo
S Riedel
Seog Park
Seonho Kim
SH Kim
SM Harabagiu
T Ono
TH Cormen
Y Miyao
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Public Library of Science (PLOS)

Large Scale Application of Neural Network Based Semantic Role Labeling for Automated Relation Extraction from Biomedical Texts

Author: AB Clegg
C Nedellec
D Klein
D Rebholz-Schuhmann
E Charniak
H Jose
Hans-Werner Mewes
I Donaldson
J Tsujii
J-H Eom
Jason Weston
K Fundel
L Hirschman
M Lease
M Palmer
Mark Isalan
R Collobert
R Collobert
R Hoffmann
Ronan Collobert
RT-H Tsai
S Bethard
S Pradhan
TH Tsai
Thorsten Barnickel
Volker Stümpflen
Y Kogan
Y Miyao
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

To reduce the increasing amount of time spent on literature search in the life sciences, several methods for automated knowledge extraction have been developed. Co-occurrence based approaches can deal with large text corpora like MEDLINE in an acceptable time but are not able to extract any specific type of semantic relation. Semantic relation extraction methods based on syntax trees, on the other hand, are computationally expensive and the interpretation of the generated trees is difficult. Several natural language processing (NLP) approaches for the biomedical domain exist focusing specifically on the detection of a limited set of relation types. For systems biology, generic approaches for the detection of a multitude of relation types which in addition are able to process large text corpora are needed but the number of systems meeting both requirements is very limited. We introduce the use of SENNA (“Semantic Extraction using a Neural Network Architecture”), a fast and accurate neural network based Semantic Role Labeling (SRL) program, for the large scale extraction of semantic relations from the biomedical literature. A comparison of processing times of SENNA and other SRL systems or syntactical parsers used in the biomedical domain revealed that SENNA is the fastest Proposition Bank (PropBank) conforming SRL program currently available. 89 million biomedical sentences were tagged with SENNA on a 100 node cluster within three days. The accuracy of the presented relation extraction approach was evaluated on two test sets of annotated sentences resulting in precision/recall values of 0.71/0.43. We show that the accuracy as well as processing speed of the proposed semantic relation extraction approach is sufficient for its large scale application on biomedical text. The proposed approach is highly generalizable regarding the supported relation types and appears to be especially suited for general-purpose, broad-scale text mining systems. The presented approach bridges the gap between fast, cooccurrence-based approaches lacking semantic relations and highly specialized and computationally demanding NLP approaches

PuSH

Linguistic feature analysis for protein interaction extraction

Author: A Airola
A Moschitti
A Yakushiji
B Schölkopf
C Cortes
C Giuliano
C Nedellec
CC Chang
Chris Cornelis
D Haussler
H Lodhi
J Ding
J Xiao
JH Eom
K Fundel
M Collins
Martine De Cock
MF Porter
R Bunescu
R Saetre
RC Bunescu
S Katrenko
S Kim
S Pyysalo
S Pyysalo
S Van Landeghem
T Fayruzov
T Fayruzov
Timur Fayruzov
Veronique Hoste
Y Saeys
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background The rapid growth of the amount of publicly available reports on biomedical experimental results has recently caused a boost of text mining approaches for protein interaction extraction. Most approaches rely implicitly or explicitly on linguistic, i.e., lexical and syntactic, data extracted from text. However, only few attempts have been made to evaluate the contribution of the different feature types. In this work, we contribute to this evaluation by studying the relative importance of deep syntactic features, i.e., grammatical relations, shallow syntactic features (part-of-speech information) and lexical features. For this purpose, we use a recently proposed approach that uses support vector machines with structured kernels. Results Our results reveal that the contribution of the different feature types varies for the different data sets on which the experiments were conducted. The smaller the training corpus compared to the test data, the more important the role of grammatical relations becomes. Moreover, deep syntactic information based classifiers prove to be more robust on heterogeneous texts where no or only limited common vocabulary is shared. Conclusion Our findings suggest that grammatical relations play an important role in the interaction extraction task. Moreover, the net advantage of adding lexical and shallow syntactic features is small related to the number of added features. This implies that efficient classifiers can be built by using only a small fraction of the features that are typically being used in recent approaches.</p

Ghent University Academic Bibliography

Comparative analysis of five protein-protein interaction corpora

Author: A Rzhetsky
Antti Airola
C Blaschke
C Nédellec
D Klein
DJ Best
Filip Ginter
HL Johnson
J Ding
J Kim
Jari Björne
JD Wren
Juho Heimonen
K Fundel
KB Cohen
L Smith
M Light
N Daraselia
R Bunescu
R Ihaka
S Pyysalo
Sampo Pyysalo
Tapio Salakoski
WJ Wilbur
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Growing interest in the application of natural language processing methods to biomedical text has led to an increasing number of corpora and methods targeting protein-protein interaction (PPI) extraction. However, there is no general consensus regarding PPI annotation and consequently resources are largely incompatible and methods are difficult to evaluate. Results We present the first comparative evaluation of the diverse PPI corpora, performing quantitative evaluation using two separate information extraction methods as well as detailed statistical and qualitative analyses of their properties. For the evaluation, we unify the corpus PPI annotations to a shared level of information, consisting of undirected, untyped binary interactions of non-static types with no identification of the words specifying the interaction, no negations, and no interaction certainty. We find that the F-score performance of a state-of-the-art PPI extraction method varies on average 19 percentage units and in some cases over 30 percentage units between the different evaluated corpora. The differences stemming from the choice of corpus can thus be substantially larger than differences between the performance of PPI extraction methods, which suggests definite limits on the ability to compare methods evaluated on different resources. We analyse a number of potential sources for these differences and identify factors explaining approximately half of the variance. We further suggest ways in which the difficulty of the PPI extraction tasks codified by different corpora can be determined to advance comparability. Our analysis also identifies points of agreement and disagreement in PPI corpus annotation that are rarely explicitly stated by the authors of the corpora. Conclusions Our comparative analysis uncovers key similarities and differences between the diverse PPI corpora, thus taking an important step towards standardization. In the course of this study we have created a major practical contribution in converting the corpora into a shared format. The conversion software is freely available at <url>http://mars.cs.utu.fi/PPICorpora</url>.</p

eGIFT: Mining Gene Information from the Literature

Author: A Gladki
AS Schwartz
C Blaschke
C Perez-Iratxeta
Carl J Schmidt
Catalina O Tudor
CO Tudor
D Cheng
D Rebholz-Schuhmann
D Yarowsky
H Maier
H Shatkay
J Ding
J McEntyre
J Miller
JJ Kim
K Fundel
K Vijay-Shanker
KB Cohen
LC Tsoi
M Krallinger
MA Andrade
NR Smalheiser
O Gospodnetic
PK Shah
R Bruce
R Jelier
S Gaudan
S Kaczanowski
S Pakhomov
Y Liu
Y Tsuruoka
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background With the biomedical literature continually expanding, searching PubMed for information about specific genes becomes increasingly difficult. Not only can thousands of results be returned, but gene name ambiguity leads to many irrelevant hits. As a result, it is difficult for life scientists and gene curators to rapidly get an overall picture about a specific gene from documents that mention its names and synonyms. Results In this paper, we present eGIFT (<url>http://biotm.cis.udel.edu/eGIFT</url>), a web-based tool that associates informative terms, called <it>i</it>Terms, and sentences containing them, with genes. To associate <it>i</it>Terms with a gene, eGIFT ranks <it>i</it>Terms about the gene, based on a score which compares the frequency of occurrence of a term in the gene's literature to its frequency of occurrence in documents about genes in general. To retrieve a gene's documents (Medline abstracts), eGIFT considers all gene names, aliases, and synonyms. Since many of the gene names can be ambiguous, eGIFT applies a disambiguation step to remove matches that do not correspond to this gene. Another additional filtering process is applied to retain those abstracts that focus on the gene rather than mention it in passing. eGIFT's information for a gene is pre-computed and users of eGIFT can search for genes by using a name or an EntrezGene identifier. <it>i</it>Terms are grouped into different categories to facilitate a quick inspection. eGIFT also links an <it>i</it>Term to sentences mentioning the term to allow users to see the relation between the <it>i</it>Term and the gene. We evaluated the precision and recall of eGIFT's <it>i</it>Terms for 40 genes; between 88% and 94% of the <it>i</it>Terms were marked as salient by our evaluators, and 94% of the UniProtKB keywords for these genes were also identified by eGIFT as <it>i</it>Terms. Conclusions Our evaluations suggest that <it>i</it>Terms capture highly-relevant aspects of genes. Furthermore, by showing sentences containing these terms, eGIFT can provide a quick description of a specific gene. eGIFT helps not only life scientists survey results of high-throughput experiments, but also annotators to find articles describing gene aspects and functions.</p