Search CORE

127 research outputs found

Recommended from our members

BioC: a minimalist approach to interoperability for biomedical text processing

Author: Ciccarese Paolo
Cohen Kevin Bretonnel
Comeau Donald C.
Islamaj Doğan Rezarta
Krallinger Martin
Leitner Florian
Lu Zhiyong
Peng Yifan
Rinaldi Fabio
Torii Manabu
Valencia Alfonso
Verspoor Karin
Wiegers Thomas C.
Wilbur W. John
Wu Cathy H.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 11/03/2014
Field of study

A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/. Database URL: http://bioc.sourceforge.net

Harvard University - DASH

Next generation community assessment of biomedical entity recognition web servers: metrics, performance, interoperability aspects of BeCalm

Author: Blanco Míguez Aitor
Fernández Riverola Florentino
GARCIA LOURENÇO Analia Maria
Krallinger Martin
Pérez Pérez Martín
Pérez Rodríguez Gael
Valencia Alfonso
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 17/11/2022
Field of study

Background: Shared tasks and community challenges represent key instruments to promote research, collaboration and determine the state of the art of biomedical and chemical text mining technologies. Traditionally, such tasks relied on the comparison of automatically generated results against a so-called Gold Standard dataset of manually labelled textual data, regardless of efficiency and robustness of the underlying implementations. Due to the rapid growth of unstructured data collections, including patent databases and particularly the scientific literature, there is a pressing need to generate, assess and expose robust big data text mining solutions to semantically enrich documents in real time. To address this pressing need, a novel track called “Technical interoperability and performance of annotation servers” was launched under the umbrella of the BioCreative text mining evaluation effort. The aim of this track was to enable the continuous assessment of technical aspects of text annotation web servers, specifically of online biomedical named entity recognition systems of interest for medicinal chemistry applications. Results: A total of 15 out of 26 registered teams successfully implemented online annotation servers. They returned predictions during a two-month period in predefined formats and were evaluated through the BeCalm evaluation platform, specifically developed for this track. The track encompassed three levels of evaluation, i.e. data format considerations, technical metrics and functional specifications. Participating annotation servers were implemented in seven different programming languages and covered 12 general entity types. The continuous evaluation of server responses accounted for testing periods of low activity and moderate to high activity, encompassing overall 4,092,502 requests from three different document provider settings. The median response time was below 3.74 s, with a median of 10 annotations/document. Most of the servers showed great reliability and stability, being able to process over 100,000 requests in a 5-day period. Conclusions: The presented track was a novel experimental task that systematically evaluated the technical performance aspects of online entity recognition systems. It raised the interest of a significant number of participants. Future editions of the competition will address the ability to process documents in bulk as well as to annotate full-text documents.Portuguese Foundation for Science and Technology | Ref. UID/BIO/04469/2013Portuguese Foundation for Science and Technology | Ref. COMPETE 2020 (POCI-01-0145-FEDER-006684)Xunta de Galicia | Ref. ED431C2018/55-GRCEuropean Commission | Ref. H2020, n. 65402

Investigo

The biomedical abbreviation recognition and resolution (BARR) track: Benchmarking, evaluation and importance of abbreviation recognition systems applied to Spanish biomedical abstracts

Author: Akhondi Saber A.
de la Peña Santiago
Intxaurrondo Ander
Krallinger Martin
Lourenço Analia
López-Martín Jose A.
Pérez-Pérez Martin
Pérez-Rodríguez Gael
Santamaria Jesus
Valencia Alfonso
Villegas Marta
Publication venue
Publication date: 19/09/2017
Field of study

Healthcare professionals are generating a substantial volume of clinical data in narrative form. As healthcare providers are confronted with serious time constraints, they frequently use telegraphic phrases, domain-specific abbreviations and shorthand notes. Efficient clinical text processing tools need to cope with the recognition and resolution of abbreviations, a task that has been extensively studied for English documents. Despite the outstanding number of clinical documents written worldwide in Spanish, only a marginal amount of studies has been published on this subject. In clinical texts, as opposed to the medical literature, abbreviations are generally used without their definitions or expanded forms. The aim of the first Biomedical Abbreviation Recognition and Resolution (BARR) track, posed at the IberEval 2017 evaluation campaign, was to assess and promote the development of systems for generating a sense inventory of medical abbreviations. The BARR track required the detection of mentions of abbreviations or short forms and their corresponding long forms or definitions from Spanish medical abstracts. For this track, the organizers provided the BARR medical document collection, the BARR corpus of manually annotated abstracts labelled by domain experts and the BARR-Markyt evaluation platform. A total of 7 teams submitted 25 runs for the two BARR subtasks: (a) the identification of mentions of abbreviations and their definitions and (b) the correct detection of short form-long form pairs. Here we describe the BARR track setting, the obtained results and the methodologies used by participating systems. The BARR task summary, corpus, resources and evaluation tool for testing systems beyond this campaign are available at: http://temu.inab.org .We acknowledge the Encomienda MINETAD-CNIO/OTG Sanidad Plan TL and Open-Minted (654021) H2020 project for funding.Postprint (published version

UPCommons. Portal del coneixement obert de la UPC

BC4GO: a full-text corpus for the BioCreative IV GO task

Author: Arighi Cecilia N.
Done James
Hayman G. Thomas
Laulederkind Stanley J. F.
Li Donghui
Lu Zhiyong
Mao Yuqing
McQuilton Peter
Müller Hans-Michael
Schaeffer Mary L.
Sternberg Paul W.
Tweedie Susan
Van Auken Kimberly
Wang Shur-Jen
Wei Chih-Hsuan
Publication venue: 'Oxford University Press (OUP)'
Publication date: 28/07/2014
Field of study

Gene function curation via Gene Ontology (GO) annotation is a common task among Model Organism Database groups. Owing to its manual nature, this task is considered one of the bottlenecks in literature curation. There have been many previous attempts at automatic identification of GO terms and supporting information from full text. However, few systems have delivered an accuracy that is comparable with humans. One recognized challenge in developing such systems is the lack of marked sentence-level evidence text that provides the basis for making GO annotations. We aim to create a corpus that includes the GO evidence text along with the three core elements of GO annotations: (i) a gene or gene product, (ii) a GO term and (iii) a GO evidence code. To ensure our results are consistent with real-life GO data, we recruited eight professional GO curators and asked them to follow their routine GO annotation protocols. Our annotators marked up more than 5000 text passages in 200 articles for 1356 distinct GO terms. For evidence sentence selection, the inter-annotator agreement (IAA) results are 9.3% (strict) and 42.7% (relaxed) in F1-measures. For GO term selection, the IAAs are 47% (strict) and 62.9% (hierarchical). Our corpus analysis further shows that abstracts contain ∼10% of relevant evidence sentences and 30% distinct GO terms, while the Results/Experiment section has nearly 60% relevant sentences and >70% GO terms. Further, of those evidence sentences found in abstracts, less than one-third contain enough experimental detail to fulfill the three core criteria of a GO annotation. This result demonstrates the need of using full-text articles for text mining GO annotations. Through its use at the BioCreative IV GO (BC4GO) task, we expect our corpus to become a valuable resource for the BioNLP research community

Caltech Authors

Cross-Platform Text Mining and Natural Language Processing Interoperability - Proceedings of the LREC2016 conference

Author
Publication venue: European Language Resources Association
Publication date: 01/01/2016
Field of study

No abstract available

Enlighten

Cross-Platform Text Mining and Natural Language Processing Interoperability - Proceedings of the LREC2016 conference

Author
Publication venue: European Language Resources Association
Publication date: 01/01/2016
Field of study

No abstract available

Benchmarking biomedical text mining web servers at BioCreative V.5: the technical Interoperability and Performance of annotation Servers - TIPS track

Author: Blanco-Míguez Aitor
Fdez-Riverola Florentino
Krallinger Martin
Lourenço Anália
Pérez-Pérez Martin
Pérez-Rodríguez Gael
Valencia Alfonso
Publication venue
Publication date: 26/04/2017
Field of study

The TIPS track consisted in a novel experimental task under the umbrella of the BioCreative text mining challenges with the aim to, for the first time ever, carry out a text mining challenge with particular focus on the continuous assessment of technical aspects of text annotation web servers, specifically of biomedical online named entity recognition systems. A total of 13 teams registered annotation servers, implemented in various programming languages, supporting up to 12 different general annotation types. The continuous evaluation period took place from February to March 2017. The systematic and continuous evaluation of server responses accounted for testing periods of low activity and moderate to high activity. Moreover three document provider settings were covered, including also NCBI PubMed. For a total of 4,092,502 requests, the median response time for most servers was below 3.74 s with a median of 10 annotations/document. Most of the servers showed great reliability and stability, being able to process 100,000 requests in 5 days.info:eu-repo/semantics/publishedVersio

Universidade do Minho: RepositoriUM

Recommended from our members

Challenges of Digitalisation in the Aerospace and Aviation Sectors

Author: Lamb Kirsten
Publication venue: CDBB
Publication date: 29/03/2018
Field of study

This report describes digital transformation in aerospace and aviation, and identifies some challenges that are likely to have parallels with the architecture, engineering and construction (AEC) sector

Apollo (Cambridge)