181,942 research outputs found
Morphological Priors for Probabilistic Neural Word Embeddings
Word embeddings allow natural language processing systems to share
statistical information across related words. These embeddings are typically
based on distributional statistics, making it difficult for them to generalize
to rare or unseen words. We propose to improve word embeddings by incorporating
morphological information, capturing shared sub-word features. Unlike previous
work that constructs word embeddings directly from morphemes, we combine
morphological and distributional information in a unified probabilistic
framework, in which the word embedding is a latent variable. The morphological
information provides a prior distribution on the latent word embeddings, which
in turn condition a likelihood function over an observed corpus. This approach
yields improvements on intrinsic word similarity evaluations, and also in the
downstream task of part-of-speech tagging.Comment: Appeared at the Conference on Empirical Methods in Natural Language
Processing (EMNLP 2016, Austin
Linking norms, ratings, and relations of words and concepts across multiple language varieties
Psychologists and linguists have collected a great diversity of data for word and concept properties. In psychology, many studies accumulate norms and ratings such as word frequencies or age-of-acquisition often for a large number of words. Linguistics, on the other hand, provides valuable insights into relations of word meanings. We present a collection of those data sets for norms, ratings, and relations that cover different languages: ‘NoRaRe.’ To enable a comparison between the diverse data types, we established workflows that facilitate the expansion of the database. A web application allows convenient access to the data (https://digling.org/norare/). Furthermore, a software API ensures consistent data curation by providing tests to validate the data sets. The NoRaRe collection is linked to the database curated by the Concepticon project (https://concepticon.clld.org) which offers a reference catalog of unified concept sets. The link between words in the data sets and the Concepticon concept sets makes a cross-linguistic comparison possible. In three case studies, we test the validity of our approach, the accuracy of our workflow, and the applicability of our database. The results indicate that the NoRaRe database can be applied for the study of word properties across multiple languages. The data can be used by psychologists and linguists to benefit from the knowledge rooted in both research disciplines.Introduction Combing Forests of Data Materials and Methods Materials Methods - Manual Concept Mapping - Automated Concept Mapping - Semi-Automated Concept Mapping - Labeling Word and Concept Properties Validation Descriptive Statistics of NoRaRe Data Curation Workflow Data Applicability - Case Study 1: Replication of existing Findings - Case Study 2: Comparison of Concept Mappings - Case Study 3: Cross-Linguistic Comparison Discussion and Conclusio
A Unified multilingual semantic representation of concepts
Semantic representation lies at the core of several applications in Natural Language Processing. However, most existing semantic representation techniques cannot be used effectively for the representation of individual word senses. We put forward a novel multilingual concept representation, called MUFFIN , which not only enables accurate representation of word senses in different languages, but also provides multiple advantages over existing approaches. MUFFIN represents a given concept in a unified semantic space irrespective of the language of interest, enabling cross-lingual comparison of different concepts. We evaluate our approach in two different evaluation benchmarks, semantic similarity and Word Sense Disambiguation, reporting state-of-the-art performance on several standard datasets
A Unified Model for Opinion Target Extraction and Target Sentiment Prediction
Target-based sentiment analysis involves opinion target extraction and target
sentiment classification. However, most of the existing works usually studied
one of these two sub-tasks alone, which hinders their practical use. This paper
aims to solve the complete task of target-based sentiment analysis in an
end-to-end fashion, and presents a novel unified model which applies a unified
tagging scheme. Our framework involves two stacked recurrent neural networks:
The upper one predicts the unified tags to produce the final output results of
the primary target-based sentiment analysis; The lower one performs an
auxiliary target boundary prediction aiming at guiding the upper network to
improve the performance of the primary task. To explore the inter-task
dependency, we propose to explicitly model the constrained transitions from
target boundaries to target sentiment polarities. We also propose to maintain
the sentiment consistency within an opinion target via a gate mechanism which
models the relation between the features for the current word and the previous
word. We conduct extensive experiments on three benchmark datasets and our
framework achieves consistently superior results.Comment: AAAI 201
Huge automatically extracted training sets for multilingual Word Sense Disambiguation
We release to the community six large-scale sense-annotated datasets in multiple language to pave the way for supervised multilingual Word Sense Disambiguation. Our datasets cover all the nouns in the English WordNet and their translations in other languages for a total of millions of sense-tagged sentences. Experiments prove that these corpora can be effectively used as training sets for supervised WSD systems, surpassing the state of the art for low- resourced languages and providing competitive results for English, where manually annotated training sets are accessible. The data is available at trainomatic. org
- …