19,821 research outputs found
Mimicking Word Embeddings using Subword RNNs
Word embeddings improve generalization over lexical features by placing each
word in a lower-dimensional space, using distributional information obtained
from unlabeled data. However, the effectiveness of word embeddings for
downstream NLP tasks is limited by out-of-vocabulary (OOV) words, for which
embeddings do not exist. In this paper, we present MIMICK, an approach to
generating OOV word embeddings compositionally, by learning a function from
spellings to distributional embeddings. Unlike prior work, MIMICK does not
require re-training on the original word embedding corpus; instead, learning is
performed at the type level. Intrinsic and extrinsic evaluations demonstrate
the power of this simple approach. On 23 languages, MIMICK improves performance
over a word-based baseline for tagging part-of-speech and morphosyntactic
attributes. It is competitive with (and complementary to) a supervised
character-based model in low-resource settings.Comment: EMNLP 201
Text segmentation with character-level text embeddings
Learning word representations has recently seen much success in computational
linguistics. However, assuming sequences of word tokens as input to linguistic
analysis is often unjustified. For many languages word segmentation is a
non-trivial task and naturally occurring text is sometimes a mixture of natural
language strings and other character data. We propose to learn text
representations directly from raw character sequences by training a Simple
recurrent Network to predict the next character in text. The network uses its
hidden layer to evolve abstract representations of the character sequences it
sees. To demonstrate the usefulness of the learned text embeddings, we use them
as features in a supervised character level text segmentation and labeling
task: recognizing spans of text containing programming language code. By using
the embeddings as features we are able to substantially improve over a baseline
which uses only surface character n-grams.Comment: Workshop on Deep Learning for Audio, Speech and Language Processing,
ICML 201
Semantic reconstruction of continuous language from MEG signals
Decoding language from neural signals holds considerable theoretical and
practical importance. Previous research has indicated the feasibility of
decoding text or speech from invasive neural signals. However, when using
non-invasive neural signals, significant challenges are encountered due to
their low quality. In this study, we proposed a data-driven approach for
decoding semantic of language from Magnetoencephalography (MEG) signals
recorded while subjects were listening to continuous speech. First, a
multi-subject decoding model was trained using contrastive learning to
reconstruct continuous word embeddings from MEG data. Subsequently, a beam
search algorithm was adopted to generate text sequences based on the
reconstructed word embeddings. Given a candidate sentence in the beam, a
language model was used to predict the subsequent words. The word embeddings of
the subsequent words were correlated with the reconstructed word embedding.
These correlations were then used as a measure of the probability for the next
word. The results showed that the proposed continuous word embedding model can
effectively leverage both subject-specific and subject-shared information.
Additionally, the decoded text exhibited significant similarity to the target
text, with an average BERTScore of 0.816, a score comparable to that in the
previous fMRI study
Named Entity Recognition in Spanish Biomedical Literature: Short Review and Bert Model
Named Entity Recognition (NER) is the rst step for knowledge acquisition when we deal with an unknown corpus of texts. Having received these entities, we have an opportunity to form parameters space and to solve problems of text mining as concept normalization, speech recognition, etc. The recent advances in NER are related to the technology of word embeddings, which transforms text to the form being effective for Deep Learning. In the paper, we show how NER detects pharmacological substances, compounds, and proteins in the dataset obtained from the Spanish Clinical Case Corpus (SPACCC). To achieve this goal, we use contextualized word embeddings based on BERT language representation, which shows better results than the standard word embeddings
Using Holographically Compressed Embeddings in Question Answering
Word vector representations are central to deep learning natural language
processing models. Many forms of these vectors, known as embeddings, exist,
including word2vec and GloVe. Embeddings are trained on large corpora and learn
the word's usage in context, capturing the semantic relationship between words.
However, the semantics from such training are at the level of distinct words
(known as word types), and can be ambiguous when, for example, a word type can
be either a noun or a verb. In question answering, parts-of-speech and named
entity types are important, but encoding these attributes in neural models
expands the size of the input. This research employs holographic compression of
pre-trained embeddings, to represent a token, its part-of-speech, and named
entity type, in the same dimension as representing only the token. The
implementation, in a modified question answering recurrent deep learning
network, shows that semantic relationships are preserved, and yields strong
performance.Comment: 12 pages, 6 figures, 1 table, 9th International Conference on
Advanced Information Technologies and Applications (ICAITA 2020), July 11~12,
2020, Toronto, Canada, Advanced Natural Language Processing Sub-Conference
(AdNLP 2020
Multilingual Acoustic Word Embedding Models for Processing Zero-Resource Languages
Acoustic word embeddings are fixed-dimensional representations of
variable-length speech segments. In settings where unlabelled speech is the
only available resource, such embeddings can be used in "zero-resource" speech
search, indexing and discovery systems. Here we propose to train a single
supervised embedding model on labelled data from multiple well-resourced
languages and then apply it to unseen zero-resource languages. For this
transfer learning approach, we consider two multilingual recurrent neural
network models: a discriminative classifier trained on the joint vocabularies
of all training languages, and a correspondence autoencoder trained to
reconstruct word pairs. We test these using a word discrimination task on six
target zero-resource languages. When trained on seven well-resourced languages,
both models perform similarly and outperform unsupervised models trained on the
zero-resource languages. With just a single training language, the second model
works better, but performance depends more on the particular training--testing
language pair.Comment: 5 pages, 4 figures, 1 table; accepted to ICASSP 2020. arXiv admin
note: text overlap with arXiv:1811.0040
- …