137 research outputs found
Arabic Fine-Grained Entity Recognition
Traditional NER systems are typically trained to recognize coarse-grained
entities, and less attention is given to classifying entities into a hierarchy
of fine-grained lower-level subtypes. This article aims to advance Arabic NER
with fine-grained entities. We chose to extend Wojood (an open-source Nested
Arabic Named Entity Corpus) with subtypes. In particular, four main entity
types in Wojood, geopolitical entity (GPE), location (LOC), organization (ORG),
and facility (FAC), are extended with 31 subtypes. To do this, we first revised
Wojood's annotations of GPE, LOC, ORG, and FAC to be compatible with the LDC's
ACE guidelines, which yielded 5, 614 changes. Second, all mentions of GPE, LOC,
ORG, and FAC (~44K) in Wojood are manually annotated with the LDC's ACE
sub-types. We refer to this extended version of Wojood as WojoodF ine. To
evaluate our annotations, we measured the inter-annotator agreement (IAA) using
both Cohen's Kappa and F1 score, resulting in 0.9861 and 0.9889, respectively.
To compute the baselines of WojoodF ine, we fine-tune three pre-trained Arabic
BERT encoders in three settings: flat NER, nested NER and nested NER with
subtypes and achieved F1 score of 0.920, 0.866, and 0.885, respectively. Our
corpus and models are open-source and available at
https://sina.birzeit.edu/wojood/
BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages
We present BPEmb, a collection of pre-trained subword unit embeddings in 275
languages, based on Byte-Pair Encoding (BPE). In an evaluation using
fine-grained entity typing as testbed, BPEmb performs competitively, and for
some languages bet- ter than alternative subword approaches, while requiring
vastly fewer resources and no tokenization. BPEmb is available at
https://github.com/bheinzerling/bpem
New Embedded Representations and Evaluation Protocols for Inferring Transitive Relations
Beyond word embeddings, continuous representations of knowledge graph (KG)
components, such as entities, types and relations, are widely used for entity
mention disambiguation, relation inference and deep question answering. Great
strides have been made in modeling general, asymmetric or antisymmetric KG
relations using Gaussian, holographic, and complex embeddings. None of these
directly enforce transitivity inherent in the is-instance-of and is-subtype-of
relations. A recent proposal, called order embedding (OE), demands that the
vector representing a subtype elementwise dominates the vector representing a
supertype. However, the manner in which such constraints are asserted and
evaluated have some limitations. In this short research note, we make three
contributions specific to representing and inferring transitive relations.
First, we propose and justify a significant improvement to the OE loss
objective. Second, we propose a new representation of types as
hyper-rectangular regions, that generalize and improve on OE. Third, we show
that some current protocols to evaluate transitive relation inference can be
misleading, and offer a sound alternative. Rather than use black-box deep
learning modules off-the-shelf, we develop our training networks using
elementary geometric considerations.Comment: Accepted at SIGIR 201
Discovering Power Laws in Entity Length
This paper presents a discovery that the length of the entities in various
datasets follows a family of scale-free power law distributions. The concept of
entity here broadly includes the named entity, entity mention, time expression,
aspect term, and domain-specific entity that are well investigated in natural
language processing and related areas. The entity length denotes the number of
words in an entity. The power law distributions in entity length possess the
scale-free property and have well-defined means and finite variances. We
explain the phenomenon of power laws in entity length by the principle of least
effort in communication and the preferential mechanism
- …