Search CORE

530 research outputs found

Proceedings of the LREC workshop on partial parsing : between chunk parsing and deep parsing

Author: Kübler Sandra
Piskorski Jakub
Przepiorkowski Adam
Publication venue
Publication date: 03/11/2008
Field of study

Hochschulschriftenserver - Universität Frankfurt am Main

A Survey of Paraphrasing and Textual Entailment Methods

Author: Androutsopoulos Ion
Malakasiotis Prodromos
Publication venue: 'AI Access Foundation'
Publication date: 30/05/2010
Field of study

Paraphrasing methods recognize, generate, or extract phrases, sentences, or longer natural language expressions that convey almost the same information. Textual entailment methods, on the other hand, recognize, generate, or extract pairs of natural language expressions, such that a human who reads (and trusts) the first element of a pair would most likely infer that the other element is also true. Paraphrasing can be seen as bidirectional textual entailment and methods from the two areas are often similar. Both kinds of methods are useful, at least in principle, in a wide range of natural language processing applications, including question answering, summarization, text generation, and machine translation. We summarize key ideas from the two areas by considering in turn recognition, generation, and extraction methods, also pointing to prominent articles and resources.Comment: Technical Report, Natural Language Processing Group, Department of Informatics, Athens University of Economics and Business, Greece, 201

arXiv.org e-Print Archive

Crossref

Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines

Author: Berkecz Péter
Farkas Richárd
Orosz György
Szabó Gergő
Szántó Zsolt
Publication venue
Publication date: 24/08/2023
Field of study

This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy. Models have been implemented in the spaCy framework, extending the HuSpaCy toolkit with several improvements to its architecture. Compared to existing NLP tools for Hungarian, all of our pipelines feature all basic text processing steps including tokenization, sentence-boundary detection, part-of-speech tagging, morphological feature tagging, lemmatization, dependency parsing and named entity recognition with high accuracy and throughput. We thoroughly evaluated the proposed enhancements, compared the pipelines with state-of-the-art tools and demonstrated the competitive performance of the new models in all text preprocessing steps. All experiments are reproducible and the pipelines are freely available under a permissive license.Comment: Submitted to TSD 2023 Conferenc

arXiv.org e-Print Archive

Investigating heterogeneous protein annotations toward cross-corpora utilization

Author: A Arnold
A Yeh
AM Cohen
B Alex
B Efron
C Nédellec
CJ Kuo
EFTK Sang
EW Noreen
F Rinaldi
F Sha
G Zhou
H Daumé III
H Shatkay
HL Johnson
J Wilbur
JD Kim
JD Kim
Jin-Dong Kim
Jun'ichi Tsujii
K Franzén
K Yoshida
KB Cohen
L Gillick
L Tanabe
MA Mandel
R Bunescu
R Bunescu
R Kabiljo
RTH Tsai
Rune Sætre
S Pyysalo
Sampo Pyysalo
T Ohta
V Hatzivassiloglou
X Sun
Y Song
Y Wang
Yue Wang
Publication venue: BioMed Central
Publication date: 01/12/2009
Field of study

Abstract Background The number of corpora, collections of structured texts, has been increasing, as a result of the growing interest in the application of natural language processing methods to biological texts. Many named entity recognition (NER) systems have been developed based on these corpora. However, in the biomedical community, there is yet no general consensus regarding named entity annotation; thus, the resources are largely incompatible, and it is difficult to compare the performance of systems developed on resources that were divergently annotated. On the other hand, from a practical application perspective, it is desirable to utilize as many existing annotated resources as possible, because annotation is costly. Thus, it becomes a task of interest to integrate the heterogeneous annotations in these resources. Results We explore the potential sources of incompatibility among gene and protein annotations that were made for three common corpora: GENIA, GENETAG and AIMed. To show the inconsistency in the corpora annotations, we first tackle the incompatibility problem caused by corpus integration, and we quantitatively measure the effect of this incompatibility on protein mention recognition. We find that the F-score performance declines tremendously when training with integrated data, instead of training with pure data; in some cases, the performance drops nearly 12%. This degradation may be caused by the newly added heterogeneous annotations, and cannot be fixed without an understanding of the heterogeneities that exist among the corpora. Motivated by the result of this preliminary experiment, we further qualitatively analyze a number of possible sources for these differences, and investigate the factors that would explain the inconsistencies, by performing a series of well-designed experiments. Our analyses indicate that incompatibilities in the gene/protein annotations exist mainly in the following four areas: the boundary annotation conventions, the scope of the entities of interest, the distribution of annotated entities, and the ratio of overlap between annotated entities. We further suggest that almost all of the incompatibilities can be prevented by properly considering the four aspects aforementioned. Conclusion Our analysis covers the key similarities and dissimilarities that exist among the diverse gene/protein corpora. This paper serves to improve our understanding of the differences in the three studied corpora, which can then lead to a better understanding of the performance of protein recognizers that are based on the corpora.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

HuSpaCy : an industrial-strength Hungarian natural language processing toolkit

Author: Berkecz Péter
Farkas Richárd
Orosz György
Szabó Gergő
Szántó Zsolt
Publication venue
Publication date: 01/01/2022
Field of study

Although there are a couple of open-source language processing pipelines available for Hungarian, none of them satisfies the requirements of today’s NLP applications. A language processing pipeline should consist of close to state-of-the-art lemmatization, morphosyntactic analysis, entity recognition and word embeddings. Industrial text processing applications have to satisfy non-functional software quality requirements, what is more, frameworks supporting multiple languages are more and more favored. This paper introduces HuSpaCy, an industryready Hungarian language processing toolkit. The presented tool provides components for the most important basic linguistic analysis tasks. It is open-source and is available under a permissive license. Our system is built upon spaCy’s NLP components resulting in an easily usable, fast yet accurate application. Experiments confirm that HuSpaCy has high accuracy while maintaining resource-efficient prediction capabilities

University of Szeged