Search CORE

49 research outputs found

The PPI affix dictionary (PPIAD) and BioMethod Lexicon: importance of affixes and tags for recognition of entity mentions and experimental protein interactions

Author: Alfonso Valencia
Andrew Chatr-aryamontri
Ashish V Tendulkar
Florian Leitner
J Hakenberg
L Smith
M Krallinger
M Narayanaswamy
Martin Krallinger
O Sanchez-Graillet
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Iniciativas de evaluación para la indización semántica de literatura médica en español: PLANTL, LILACS, IBECS Y BIOASQ

Author: Bojo Canales Cristina
Intxaurrondo Ander
Krallinger Martin
Nentidis A.
Primo-Peña Elena
Villegas M.
Publication venue
Publication date: 01/04/2019
Field of study

XVI Jornadas Nacionales de Información y Documentación en Ciencias de la Salud. Oviedo, 4-5 de abril de 2019El proyecto Faro de Sanidad del Plan de Impulso de las Tecnologías del Lenguaje (PlanTL) pretende fomentar el desarrollo de sistemas de procesamiento del lenguaje natural (PLN), minería de textos y traducción automática para español y lenguas cooficiales. Una actividad importante del PlanTL es la organización de campañas de evaluación de sistemas de PLN y minería de textos, un mecanismo que no sólo es clave para evaluar la calidad de los resultados obtenidos por sistemas y algoritmos predictivos, sino que representa un motor fundamental para fomentar el desarrollo de herramientas y recursos de tecnologías del lenguaje. Debido a la importancia de la literatura para la toma de decisiones en medicina y el volumen considerable de publicaciones en español, el Plan TL, en colaboración con el BSC, el CNIO, la BNCS y la iniciativa BioASQ ha lanzado una tarea competitiva relacionada con la indización automática de la literatura médica en español con términos DeCS. Su fin es generar recursos de etiquetado semántico que sirvan de ayuda a la indización manual. La tarea BioASQ (bioasq.org) de indización semántica biomédica en español se realizará usando resúmenes de artículos de revistas contenidas en las bases de datos LILACS (Literatura Lationamericana en Ciencias de la Salud) y IBECS1 (Índice Bibliográfico Español en Ciencias de la Salud) como conjunto básico etiquetado y, a partir de ellos, desarrollar los algoritmos de indización automática, facilitando así el desarrollo de modelos de inteligencia artificial. La evaluación de los sistemas se realiza con la plataforma de BioASQ, mediante un sistema de evaluación continua. En él, se solicita a los participantes que asignen automáticamente términos DeCS a los registros nuevos añadidos a las bases de datos a medida que se hacen públicos, y antes de que se haya completado la indización manual. El rendimiento de indización se calcula comparando indización automática y manual. Gracias a los resultados de ediciones previas de BioASQ para la indización de PubMed, se ha mejorado este proceso en dicho recurso. Esta tarea de indización biomédica en español servirá para generar recursos comparables para indizar LILACS e IBECS y otros conjuntos documentales.The health flagship project of the Plan for the Advancement of Language Technology (PlanTL) tries to promote the development of natural language processing systems (NLP), text mining and machine translation resources for Spanish and co-official languages. There is a growing demand for a better exploitation of datasets generated by clinicians, especially electronic health records, as well as the integration and management of this kind of data in personalized medicine platforms integrating also information extracted from the literature. In this context, the PlanTL collaborates in the organization of evaluation efforts of clinical NLP and text mining systems, a key mechanism to evaluate the quality of results obtained by such automated systems and a fundamental mechanism to promote the development of tools and resources related to language technologies. Given the importance of literature for medical decision-making and the growing volume of Spanish medical publications, the TL Plan, in collaboration with the BSC, CNIO, the Biblioteca Nacional de Ciencias de la Salud and the BioASQ team have launched a shared task on automatic indexing of abstracts in Spanish with DeCS terms. The aim of this tracks is to generate semantic annotation resources that can be used to assist manual indexing. The Spanish biomedical semantic indexing track of BioASQ (bioasq.org) will rely on abstracts of journals contained in the LILACS databases as a basic Gold Standard manually labeled benchmark set for the development of automatic indexing algorithms particularly those based on artificial intelligence language models. The evaluation of participating systems is done through the BioASQ platform, which requests results in a continuous evaluation process, i.e. automatically asking for DeCS term assignment for newly added documents to LILACS, as they are made public, and before the manual indexing results are publicly released. The indexing performance in BioASQ is calculated by comparing automatic indexing against manual annotations. Thanks to the results of previous editions of BioASQ for indexing PubMed, the MeSH indexing process of this resource was considerably improved. This novel effort on medical indexing in Spanish will serve to generate comparable resources to semantically index not only LILACS but also other health databases and repositories in Spanish.N

REPISALUD

CHEMDNER: The drugs and chemical names extraction challenge

Author: Krallinger M. (Martin)
Leitner F. (Florian)
Oyarzabal J. (Julen)
Rabal O. (Obdulia)
Valencia A. (Alfonso)
Vazquez M. (Miguel)
Publication venue: Chemistry Central
Publication date: 01/01/2015
Field of study

Natural language processing (NLP) and text mining technologies for the chemical domain (ChemNLP or chemical text mining) are key to improve the access and integration of information from unstructured data such as patents or the scientific literature. Therefore, the BioCreative organizers posed the CHEMDNER (chemical compound and drug name recognition) community challenge, which promoted the development of novel, competitive and accessible chemical text mining systems. This task allowed a comparative assessment of the performance of various methodologies using a carefully prepared collection of manually labeled text prepared by specially trained chemists as Gold Standard data. We evaluated two important aspects: one covered the indexing of documents with chemicals (chemical document indexing - CDI task), and the other was concerned with finding the exact mentions of chemicals in text (chemical entity mention recognition - CEM task). 27 teams (23 academic and 4 commercial, a total of 87 researchers) returned results for the CHEMDNER tasks: 26 teams for CEM and 23 for the CDI task. Top scoring teams obtained an F-score of 87.39% for the CEM task and 88.20% for the CDI task, a very promising result when compared to the agreement between human annotators (91%). The strategies used to detect chemicals included machine learning methods (e.g. conditional random fields) using a variety of features, chemistry and drug lexica, and domain-specific rules. We expect that the tools and resources resulting from this effort will have an impact in future developments of chemical text mining applications and will form the basis to find related chemical information for the detected entities, such as toxicological or pharmacogenomic properties

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Springer - Publisher Connector

Universidad de Navarra

PubMed Central

Dadun, University of Navarra

Archivo Digital UPM

The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text

Author: A Abi-Haidar
A Ceol
A Chatr-aryamontri
A Cohen
A Kolchinsky
A Lourenco
A McCallum
A Ng
A Yeh
Alfonso Valencia
AM Cohen
Andrew Chatr-aryamontri
Andrew Winter
Ashish V Tendulkar
B Aranda
B Settles
BP Suomela
C Blaschke
C Elkan
C Stark
Charles Elkan
D Bauer
D Salgado
David Salgado
E Marcotte
F Ehrler
F Leitner
F Leitner
F Leitner
F Rinaldi
F Rinaldi
F Rinaldi
Fabio Rinaldi
Feifan Liu
Florian Leitner
G Andrew
Gerold Schneider
Gianni Cesareni
GL Poulter
Graciela Gonzalez
H Daumé III
H Hermjakob
H Shatkay
H Wang
Hagit Shatkay
HK Rekapalli
I Donaldson
J Lin
Jean-Fred Fontaine
JR Curran
Keith Noto
KG Dowell
L Tanabe
Leonardo Briganti
Livia Perfetto
Luana Licata
Luis Rocha
Luisa Castagnoli
M Hall
M Harris
M Hollander
M Krallinger
M Krallinger
M Krallinger
M Krallinger
M Krallinger
M Oberoi
Marta Iannuccelli
Martin Krallinger
Miguel A Andrade-Navarro
Miguel Vazquez
Mike Tyers
P Wang
R Chowdhary
R Hoffmann
Rafal Rak
Rezarta Islamaj Dogan
Robert Leaman
S Kim
S Matos
S Orchard
Sergio Matos
Shashank Agarwal
Sun Kim
T Kappeler
T Ono
T Zhang
W Baumgartner
W Hersh
W Hersh
W John Wilbur
W Wilbur
Xinglong Wang
Y Niu
Y Sasaki
Z Cao
Zhiyong Lu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

BACKGROUND: Determining usefulness of biomedical text mining systems requires realistic task definition and data selection criteria without artificial constraints, measuring performance aspects that go beyond traditional metrics. The BioCreative III Protein-Protein Interaction (PPI) tasks were motivated by such considerations, trying to address aspects including how the end user would oversee the generated output, for instance by providing ranked results, textual evidence for human interpretation or measuring time savings by using automated systems. Detecting articles describing complex biological events like PPIs was addressed in the Article Classification Task (ACT), where participants were asked to implement tools for detecting PPI-describing abstracts. Therefore the BCIII-ACT corpus was provided, which includes a training, development and test set of over 12,000 PPI relevant and non-relevant PubMed abstracts labeled manually by domain experts and recording also the human classification times. The Interaction Method Task (IMT) went beyond abstracts and required mining for associations between more than 3,500 full text articles and interaction detection method ontology concepts that had been applied to detect the PPIs reported in them.RESULTS:A total of 11 teams participated in at least one of the two PPI tasks (10 in ACT and 8 in the IMT) and a total of 62 persons were involved either as participants or in preparing data sets/evaluating these tasks. Per task, each team was allowed to submit five runs offline and another five online via the BioCreative Meta-Server. From the 52 runs submitted for the ACT, the highest Matthew's Correlation Coefficient (MCC) score measured was 0.55 at an accuracy of 89 and the best AUC iP/R was 68. Most ACT teams explored machine learning methods, some of them also used lexical resources like MeSH terms, PSI-MI concepts or particular lists of verbs and nouns, some integrated NER approaches. For the IMT, a total of 42 runs were evaluated by comparing systems against manually generated annotations done by curators from the BioGRID and MINT databases. The highest AUC iP/R achieved by any run was 53, the best MCC score 0.55. In case of competitive systems with an acceptable recall (above 35) the macro-averaged precision ranged between 50 and 80, with a maximum F-Score of 55. CONCLUSIONS: The results of the ACT task of BioCreative III indicate that classification of large unbalanced article collections reflecting the real class imbalance is still challenging. Nevertheless, text-mining tools that report ranked lists of relevant articles for manual selection can potentially reduce the time needed to identify half of the relevant articles to less than 1/4 of the time when compared to unranked results. Detecting associations between full text articles and interaction detection method PSI-MI terms (IMT) is more difficult than might be anticipated. This is due to the variability of method term mentions, errors resulting from pre-processing of articles provided as PDF files, and the heterogeneity and different granularity of method term concepts encountered in the ontology. However, combining the sophisticated techniques developed by the participants with supporting evidence strings derived from the articles for human interpretation could result in practical modules for biological annotation workflows

Crossref

Springer - Publisher Connector

Monash University Research Portal

Annotating genes and genomes with DNA sequences extracted from biomedical articles

Author: Aerts
Anderson
Benson
Casey M. Bergman
Cock
Colosimo
Dowell
Fulp
Garcia-Remesal
Garcia-Remesal
Gerner
Gibson
Gray
Hakenberg
Holley
Hubbard
Karolchik
Kent
Kersey
Krallinger
Maglott
Martin Gerner
Maximilian Haeussler
Morgan
Rhead
Roberts
Semon
Shtatland
The FlyBase Consortium
Vandesompele
Visel
Weiss
Wren
Yoshida
Publication venue: Oxford University Press
Publication date: 01/01/2011
Field of study

Motivation: Increasing rates of publication and DNA sequencing make the problem of finding relevant articles for a particular gene or genomic region more challenging than ever. Existing text-mining approaches focus on finding gene names or identifiers in English text. These are often not unique and do not identify the exact genomic location of a study

CiteSeerX

Crossref

PubMed Central

The University of Manchester - Institutional Repository

pubmed2ensembl: A Resource for Mining the Biological Literature on Genes

Author: A Doms
AA Morgan
AM Jenkinson
B Giardine
BA Eckman
C Plake
Casey M. Bergman
D Hull
D Maglott
D Smedley
E Ryder
EM Zdobnov
G Zhou
Goran Nenadic
H Miller
H Parkinson
J Hakenberg
J Hirschman
J Tamames
JM Fernandez
Joachim Baran
L Chen
L Hirschman
M Ashburner
M Gerner
M Haeussler
M Huang
M Krallinger
M Krallinger
Martin Gerner
Maximilian Haeussler
P Flicek
P Kersey
PA Fujita
R Drysdale
R Hoffmann
R Leinonen
R Lyne
RC Gentleman
S Matos
SM Gallo
SP Shah
SS Dwight
Stein Aerts
T Imanishi
TJ Lee
U Mudunuri
W Xuan
Y Makita
Y Yoshida
Z Lu
Publication venue: Public Library of Science
Publication date: 29/09/2011
Field of study

The last two decades have witnessed a dramatic acceleration in the production of genomic sequence information and publication of biomedical articles. Despite the fact that genome sequence data and publications are two of the most heavily relied-upon sources of information for many biologists, very little effort has been made to systematically integrate data from genomic sequences directly with the biological literature. For a limited number of model organisms dedicated teams manually curate publications about genes; however for species with no such dedicated staff many thousands of articles are never mapped to genes or genomic regions.To overcome the lack of integration between genomic data and biological literature, we have developed pubmed2ensembl (http://www.pubmed2ensembl.org), an extension to the BioMart system that links over 2,000,000 articles in PubMed to nearly 150,000 genes in Ensembl from 50 species. We use several sources of curated (e.g., Entrez Gene) and automatically generated (e.g., gene names extracted through text-mining on MEDLINE records) sources of gene-publication links, allowing users to filter and combine different data sources to suit their individual needs for information extraction and biological discovery. In addition to extending the Ensembl BioMart database to include published information on genes, we also implemented a scripting language for automated BioMart construction and a novel BioMart interface that allows text-based queries to be performed against PubMed and PubMed Central documents in conjunction with constraints on genomic features. Finally, we illustrate the potential of pubmed2ensembl through typical use cases that involve integrated queries across the biomedical literature and genomic data.By allowing biologists to find the relevant literature on specific genomic regions or sets of functionally related genes more easily, pubmed2ensembl offers a much-needed genome informatics inspired solution to accessing the ever-increasing biomedical literature

Crossref

Directory of Open Access Journals

PubMed Central

The University of Manchester - Institutional Repository

Text mining for biology - the way forward: opinions from leading scientists

Author: Altman Russ B
Bergman Casey M
Blake Judith
Blaschke Christian
Cohen Aaron
Gannon Frank
Grivell Les
Hahn Udo
Hersh William
Hirschman Lynette
Jensen Lars Juhl
Krallinger Martin
Mons Barend
O'Donoghue Seán I
Peitsch Manuel C
Rebholz-Schuhmann Dietrich
Shatkay Hagit
Valencia Alfonso
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

This article collects opinions from leading scientists about how text mining can provide better access to the biological literature, how the scientific community can help with this process, what the next steps are, and what role future BioCreative evaluations can play. The responses identify several broad themes, including the possibility of fusing literature and biological databases through text mining; the need for user interfaces tailored to different classes of users and supporting community-based annotation; the importance of scaling text mining technology and inserting it into larger workflows; and suggestions for additional challenge evaluations, new applications, and additional resources needed to make progress

Springer - Publisher Connector

PubMed Central

Copenhagen University Research Information System

EUR Research Repository

UNSWorks

The University of Manchester - Institutional Repository

Jointly creating digital abstracts: dealing with synonymy and polysemy

Author: A Ceol
AR Pico
B Mons
C Blaschke
CN Arighi
F Leitner
F Leitner
H Pearson
M Krallinger
M Seringhaus
Martin Kuiper
NE Fuchs
P Jaiswal
RG Côté
S Vercruysse
S Vercruysse
Steven Vercruysse
T Kelder
T Kuhn
TA Eyre
WA Baumgartner Jr
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

BioCreative III interactive task: an overview

The BioCreative challenge evaluation is a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain. The biocurator community, as an active user of biomedical literature, provides a diverse and engaged end user group for text mining tools. Earlier BioCreative challenges involved many text mining teams in developing basic capabilities relevant to biological curation, but they did not address the issues of system usage, insertion into the workflow and adoption by curators. Thus in BioCreative III (BC-III), the InterActive Task (IAT) was introduced to address the utility and usability of text mining tools for real-life biocuration tasks. To support the aims of the IAT in BC-III, involvement of both developers and end users was solicited, and the development of a user interface to address the tasks interactively was requested

Crossref

The Jackson Laboratory: The Mouseion at the JAXlibrary

Springer

Springer - Publisher Connector

PubMed Central

ZORA

ART

NORA - Norwegian Open Research Archives

Linguistic measures of chemical diversity and the "keywords" of molecular collections

Author: A Cadeddu
A Kilgarriff
A Roy
B Kowalczyk
B Zhang
C Bian
C Lipinski
D Conte
D Hoover
EJ Martin
F Font-Clos
F Tweedie
FW Goldberg
G Skoraczyński
GM Maggiora
GM Rishton
JW Raymond
K Kettunen
M Krallinger
M Kubát
M Suggitt
MA Covington
ME Welsch
MM Cone
NG Olinghouse
S Soh
WP Walters
Y Cao
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/05/2018
Field of study

Computerized linguistic analyses have proven of immense value in comparing and searching through large text collections ("corpora"), including those deposited on the Internet-indeed, it would nowadays be hard to imagine browsing the Web without, for instance, search algorithms extracting most appropriate keywords from documents. This paper describes how such corpus-linguistic concepts can be extended to chemistry based on characteristic "chemical words" that span more than traditional functional groups and, instead, look at common structural fragments molecules share. Using these words, it is possible to quantify the diversity of chemical collections/databases in new ways and to define molecular "keywords" by which such collections are best characterized and annotated

Crossref

ScholarWorks@UNIST