2,319 research outputs found
Spanish named entity recognition in the biomedical domain
Named Entity Recognition in the clinical domain and in languages different from English has the difficulty of the absence of complete dictionaries, the informality of texts, the polysemy of terms, the lack of accordance in the boundaries of an entity, the scarcity of corpora and of other resources available. We present a Named Entity Recognition method for poorly resourced languages. The method was tested with Spanish radiology reports and compared with a conditional random fields system.Peer ReviewedPostprint (author's final draft
Political Text Scaling Meets Computational Semantics
During the last fifteen years, automatic text scaling has become one of the
key tools of the Text as Data community in political science. Prominent text
scaling algorithms, however, rely on the assumption that latent positions can
be captured just by leveraging the information about word frequencies in
documents under study. We challenge this traditional view and present a new,
semantically aware text scaling algorithm, SemScale, which combines recent
developments in the area of computational linguistics with unsupervised
graph-based clustering. We conduct an extensive quantitative analysis over a
collection of speeches from the European Parliament in five different languages
and from two different legislative terms, and show that a scaling approach
relying on semantic document representations is often better at capturing known
underlying political dimensions than the established frequency-based (i.e.,
symbolic) scaling method. We further validate our findings through a series of
experiments focused on text preprocessing and feature selection, document
representation, scaling of party manifestos, and a supervised extension of our
algorithm. To catalyze further research on this new branch of text scaling
methods, we release a Python implementation of SemScale with all included data
sets and evaluation procedures.Comment: Updated version - accepted for Transactions on Data Science (TDS
Multilingual Language Processing From Bytes
We describe an LSTM-based model which we call Byte-to-Span (BTS) that reads
text as bytes and outputs span annotations of the form [start, length, label]
where start positions, lengths, and labels are separate entries in our
vocabulary. Because we operate directly on unicode bytes rather than
language-specific words or characters, we can analyze text in many languages
with a single model. Due to the small vocabulary size, these multilingual
models are very compact, but produce results similar to or better than the
state-of- the-art in Part-of-Speech tagging and Named Entity Recognition that
use only the provided training datasets (no external data sources). Our models
are learning "from scratch" in that they do not rely on any elements of the
standard pipeline in Natural Language Processing (including tokenization), and
thus can run in standalone fashion on raw text
- …