5,434 research outputs found
Multilingual Language Processing From Bytes
We describe an LSTM-based model which we call Byte-to-Span (BTS) that reads
text as bytes and outputs span annotations of the form [start, length, label]
where start positions, lengths, and labels are separate entries in our
vocabulary. Because we operate directly on unicode bytes rather than
language-specific words or characters, we can analyze text in many languages
with a single model. Due to the small vocabulary size, these multilingual
models are very compact, but produce results similar to or better than the
state-of- the-art in Part-of-Speech tagging and Named Entity Recognition that
use only the provided training datasets (no external data sources). Our models
are learning "from scratch" in that they do not rely on any elements of the
standard pipeline in Natural Language Processing (including tokenization), and
thus can run in standalone fashion on raw text
Named Entity Recognition Only from Word Embeddings
Deep neural network models have helped named entity (NE) recognition achieve
amazing performance without handcrafting features. However, existing systems
require large amounts of human annotated training data. Efforts have been made
to replace human annotations with external knowledge (e.g., NE dictionary,
part-of-speech tags), while it is another challenge to obtain such effective
resources. In this work, we propose a fully unsupervised NE recognition model
which only needs to take informative clues from pre-trained word embeddings. We
first apply Gaussian Hidden Markov Model and Deep Autoencoding Gaussian Mixture
Model on word embeddings for entity span detection and type prediction, and
then further design an instance selector based on reinforcement learning to
distinguish positive sentences from noisy sentences and refine these
coarse-grained annotations through neural networks. Extensive experiments on
CoNLL benchmark datasets demonstrate that our proposed light NE recognition
model achieves remarkable performance without using any annotated lexicon or
corpus.Comment: Accepted by EMNLP202
BOND: BERT-Assisted Open-Domain Named Entity Recognition with Distant Supervision
We study the open-domain named entity recognition (NER) problem under distant
supervision. The distant supervision, though does not require large amounts of
manual annotations, yields highly incomplete and noisy distant labels via
external knowledge bases. To address this challenge, we propose a new
computational framework -- BOND, which leverages the power of pre-trained
language models (e.g., BERT and RoBERTa) to improve the prediction performance
of NER models. Specifically, we propose a two-stage training algorithm: In the
first stage, we adapt the pre-trained language model to the NER tasks using the
distant labels, which can significantly improve the recall and precision; In
the second stage, we drop the distant labels, and propose a self-training
approach to further improve the model performance. Thorough experiments on 5
benchmark datasets demonstrate the superiority of BOND over existing distantly
supervised NER methods. The code and distantly labeled data have been released
in https://github.com/cliang1453/BOND.Comment: Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery
and Data Mining (KDD '20
Complete Semantics to empower Touristic Service Providers
The tourism industry has a significant impact on the world's economy,
contributes 10.2% of the world's gross domestic product in 2016. It becomes a
very competitive industry, where having a strong online presence is an
essential aspect for business success. To achieve this goal, the proper usage
of latest Web technologies, particularly schema.org annotations is crucial. In
this paper, we present our effort to improve the online visibility of touristic
service providers in the region of Tyrol, Austria, by creating and deploying a
substantial amount of semantic annotations according to schema.org, a widely
used vocabulary for structured data on the Web. We started our work from
Tourismusverband (TVB) Mayrhofen-Hippach and all touristic service providers in
the Mayrhofen-Hippach region and applied the same approach to other TVBs and
regions, as well as other use cases. The rationale for doing this is
straightforward. Having schema.org annotations enables search engines to
understand the content better, and provide better results for end users, as
well as enables various intelligent applications to utilize them. As a direct
consequence, the region of Tyrol and its touristic service increase their
online visibility and decrease the dependency on intermediaries, i.e. Online
Travel Agency (OTA).Comment: 18 pages, 6 figure
- …