6,182 research outputs found
Multi-lingual Common Semantic Space Construction via Cluster-consistent Word Embedding
We construct a multilingual common semantic space based on distributional
semantics, where words from multiple languages are projected into a shared
space to enable knowledge and resource transfer across languages. Beyond word
alignment, we introduce multiple cluster-level alignments and enforce the word
clusters to be consistently distributed across multiple languages. We exploit
three signals for clustering: (1) neighbor words in the monolingual word
embedding space; (2) character-level information; and (3) linguistic properties
(e.g., apposition, locative suffix) derived from linguistic structure knowledge
bases available for thousands of languages. We introduce a new
cluster-consistent correlational neural network to construct the common
semantic space by aligning words as well as clusters. Intrinsic evaluation on
monolingual and multilingual QVEC tasks shows our approach achieves
significantly higher correlation with linguistic features than state-of-the-art
multi-lingual embedding learning methods do. Using low-resource language name
tagging as a case study for extrinsic evaluation, our approach achieves up to
24.5\% absolute F-score gain over the state of the art.Comment: 10 page
Introduction to the special issue on cross-language algorithms and applications
With the increasingly global nature of our everyday interactions, the need for multilingual technologies to support efficient and efective information access and communication cannot be overemphasized. Computational modeling of language has been the focus of
Natural Language Processing, a subdiscipline of Artificial Intelligence. One of the current challenges for this discipline is to design methodologies and algorithms that are cross-language in order to create multilingual technologies rapidly. The goal of this JAIR special
issue on Cross-Language Algorithms and Applications (CLAA) is to present leading research in this area, with emphasis on developing unifying themes that could lead to the development of the science of multi- and cross-lingualism. In this introduction, we provide the reader with the motivation for this special issue and summarize the contributions of the papers that have been included. The selected papers cover a broad range of cross-lingual technologies including machine translation, domain and language adaptation for sentiment
analysis, cross-language lexical resources, dependency parsing, information retrieval and knowledge representation. We anticipate that this special issue will serve as an invaluable resource for researchers interested in topics of cross-lingual natural language processing.Postprint (published version
Language classification from bilingual word embedding graphs
We study the role of the second language in bilingual word embeddings in
monolingual semantic evaluation tasks. We find strongly and weakly positive
correlations between down-stream task performance and second language
similarity to the target language. Additionally, we show how bilingual word
embeddings can be employed for the task of semantic language classification and
that joint semantic spaces vary in meaningful ways across second languages. Our
results support the hypothesis that semantic language similarity is influenced
by both structural similarity as well as geography/contact.Comment: To be published at Coling 201
Applying digital content management to support localisation
The retrieval and presentation of digital content such as that on the World Wide Web (WWW) is a substantial area of research. While recent years have seen huge expansion in the size of web-based archives that can be searched efficiently by commercial search engines, the presentation of potentially relevant content is still limited to ranked document lists represented by simple text snippets or image keyframe surrogates. There is expanding interest in techniques to personalise the presentation of content to improve the richness and effectiveness of the user experience. One of the most significant challenges to achieving this is the increasingly multilingual nature of this data, and the need to provide suitably localised responses to users based on this content. The Digital Content Management (DCM) track of the Centre for Next Generation Localisation (CNGL) is seeking to develop technologies to support advanced personalised access and presentation of information by combining elements from the existing research areas of Adaptive Hypermedia and Information Retrieval. The combination of these technologies is intended to produce significant improvements in the way users access information. We review key features of these technologies and introduce early ideas for how these technologies can support localisation and localised content before concluding with some impressions of future directions in DCM
- âŠ