2,185 research outputs found
Mimicking Word Embeddings using Subword RNNs
Word embeddings improve generalization over lexical features by placing each
word in a lower-dimensional space, using distributional information obtained
from unlabeled data. However, the effectiveness of word embeddings for
downstream NLP tasks is limited by out-of-vocabulary (OOV) words, for which
embeddings do not exist. In this paper, we present MIMICK, an approach to
generating OOV word embeddings compositionally, by learning a function from
spellings to distributional embeddings. Unlike prior work, MIMICK does not
require re-training on the original word embedding corpus; instead, learning is
performed at the type level. Intrinsic and extrinsic evaluations demonstrate
the power of this simple approach. On 23 languages, MIMICK improves performance
over a word-based baseline for tagging part-of-speech and morphosyntactic
attributes. It is competitive with (and complementary to) a supervised
character-based model in low-resource settings.Comment: EMNLP 201
Assessment of timber extraction distance and skid road network in steep karst terrain
This study aims to define a simple and effective method to calculate skidding distances on steep karst terrain, rich in ground obstacles (stoniness and rockiness) to support decision planning of secondary and primary forest infrastructure network for timber extraction in productive selective cut forests. Variations between geometrical extraction distances and actual distances were highlighted on the operational planning level (i.e., compartment level) through GIS-related calculation models, focusing on cable skidder timber extraction. Automation in defining geometrical and real extraction distances, as well as relative forest openness were achieved by geo-processing workflows in GIS environment. Due to variation of extraction correction factors at the compartment level from a minimum of 1.19 to a maximum of 5.05 in the same management unit, it can be concluded that planning harvesting operations (timber extraction) at operational level should not include the use of correction factors previously obtained for entire terrain (topographical) categories, sub-categories or even management units
Hierarchical Character-Word Models for Language Identification
Social media messages' brevity and unconventional spelling pose a challenge
to language identification. We introduce a hierarchical model that learns
character and contextualized word-level representations for language
identification. Our method performs well against strong base- lines, and can
also reveal code-switching
Modeling of the Acute Toxicity of Benzene Derivatives by Complementary QSAR Methods
A data set containing acute toxicity values (96-h LC50) of 69 substituted benzenes for
fathead minnow (Pimephales promelas) was investigated with two Quantitative Structure-
Activity Relationship (QSAR) models, either using or not using molecular descriptors,
respectively. Recursive Neural Networks (RNN) derive a QSAR by direct treatment of the
molecular structure, described through an appropriate graphical tool (variable-size labeled
rooted ordered trees) by defining suitable representation rules. The input trees are encoded by
an adaptive process able to learn, by tuning its free parameters, from a given set of structureactivity
training examples. Owing to the use of a flexible encoding approach, the model is
target invariant and does not need a priori definition of molecular descriptors. The results
obtained in this study were analyzed together with those of a model based on molecular
descriptors, i.e. a Multiple Linear Regression (MLR) model using CROatian MultiRegression
selection of descriptors (CROMRsel). The comparison revealed interesting similarities that
could lead to the development of a combined approach, exploiting the complementary
characteristics of the two approaches
In search of isoglosses: continuous and discrete language embeddings in Slavic historical phonology
This paper investigates the ability of neural network architectures to
effectively learn diachronic phonological generalizations in a multilingual
setting. We employ models using three different types of language embedding
(dense, sigmoid, and straight-through). We find that the Straight-Through model
outperforms the other two in terms of accuracy, but the Sigmoid model's
language embeddings show the strongest agreement with the traditional
subgrouping of the Slavic languages. We find that the Straight-Through model
has learned coherent, semi-interpretable information about sound change, and
outline directions for future research
- …