Search CORE

2,319 research outputs found

Spanish named entity recognition in the biomedical domain

Author: Cotik Viviana
Rodríguez Hontoria Horacio
Vivaldi Palatresi Jorge
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

Named Entity Recognition in the clinical domain and in languages different from English has the difficulty of the absence of complete dictionaries, the informality of texts, the polysemy of terms, the lack of accordance in the boundaries of an entity, the scarcity of corpora and of other resources available. We present a Named Entity Recognition method for poorly resourced languages. The method was tested with Spanish radiology reports and compared with a conditional random fields system.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Political Text Scaling Meets Computational Semantics

Author: Glavas Goran
Nanni Federico
Ponzetto Simone Paolo
Rehbein Ines
Stuckenschmidt Heiner
Publication venue
Publication date: 01/01/2021
Field of study

During the last fifteen years, automatic text scaling has become one of the key tools of the Text as Data community in political science. Prominent text scaling algorithms, however, rely on the assumption that latent positions can be captured just by leveraging the information about word frequencies in documents under study. We challenge this traditional view and present a new, semantically aware text scaling algorithm, SemScale, which combines recent developments in the area of computational linguistics with unsupervised graph-based clustering. We conduct an extensive quantitative analysis over a collection of speeches from the European Parliament in five different languages and from two different legislative terms, and show that a scaling approach relying on semantic document representations is often better at capturing known underlying political dimensions than the established frequency-based (i.e., symbolic) scaling method. We further validate our findings through a series of experiments focused on text preprocessing and feature selection, document representation, scaling of party manifestos, and a supervised extension of our algorithm. To catalyze further research on this new branch of text scaling methods, we release a Python implementation of SemScale with all included data sets and evaluation procedures.Comment: Updated version - accepted for Transactions on Data Science (TDS

arXiv.org e-Print Archive

MAnnheim DOCument Server

NERD: Evaluating Named Entity Recognition Tools in the Web of Data

Author: Rizzo G. Troncy R.
Publication venue
Publication date: 01/01/2011
Field of study

EURECOM Repository

PORTO Publications Open Repository TOrino

Multilingual Language Processing From Bytes

Author: Brunk Cliff
Gillick Dan
Subramanya Amarnag
Vinyals Oriol
Publication venue
Publication date: 01/01/2016
Field of study

We describe an LSTM-based model which we call Byte-to-Span (BTS) that reads text as bytes and outputs span annotations of the form [start, length, label] where start positions, lengths, and labels are separate entries in our vocabulary. Because we operate directly on unicode bytes rather than language-specific words or characters, we can analyze text in many languages with a single model. Due to the small vocabulary size, these multilingual models are very compact, but produce results similar to or better than the state-of- the-art in Part-of-Speech tagging and Named Entity Recognition that use only the provided training datasets (no external data sources). Our models are learning "from scratch" in that they do not rely on any elements of the standard pipeline in Natural Language Processing (including tokenization), and thus can run in standalone fashion on raw text

arXiv.org e-Print Archive

Crossref