264 research outputs found
Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification
Text classification must sometimes be applied in a low-resource language with
no labeled training data. However, training data may be available in a related
language. We investigate whether character-level knowledge transfer from a
related language helps text classification. We present a cross-lingual document
classification framework (CACO) that exploits cross-lingual subword similarity
by jointly training a character-based embedder and a word-based classifier. The
embedder derives vector representations for input words from their written
forms, and the classifier makes predictions based on the word vectors. We use a
joint character representation for both the source language and the target
language, which allows the embedder to generalize knowledge about source
language words to target language words with similar forms. We propose a
multi-task objective that can further improve the model if additional
cross-lingual or monolingual resources are available. Experiments confirm that
character-level knowledge transfer is more data-efficient than word-level
transfer between related languages.Comment: AAAI 202
Revisiting the Context Window for Cross-lingual Word Embeddings
Existing approaches to mapping-based cross-lingual word embeddings are based
on the assumption that the source and target embedding spaces are structurally
similar. The structures of embedding spaces largely depend on the co-occurrence
statistics of each word, which the choice of context window determines. Despite
this obvious connection between the context window and mapping-based
cross-lingual embeddings, their relationship has been underexplored in prior
work. In this work, we provide a thorough evaluation, in various languages,
domains, and tasks, of bilingual embeddings trained with different context
windows. The highlight of our findings is that increasing the size of both the
source and target window sizes improves the performance of bilingual lexicon
induction, especially the performance on frequent nouns.Comment: ACL202
Recommended from our members
Analysis and Applications of Cross-Lingual Models in Natural Language Processing
Human languages vary in terms of both typologically and data availability. A typical machine learning-based approach for natural language processing (NLP) requires training data from the language of interest. However, because machine learning-based approaches heavily rely on the amount of data available in each language, the quality of trained model languages without a large amount of data is poor. One way to overcome the lack of data in each language is to conduct cross-lingual transfer learning from resource-rich languages to resource-scarce languages. Cross-lingual word embeddings and multilingual contextualized embeddings are commonly used to conduct cross-lingual transfer learning. However, the lack of resources still makes it challenging to either evaluate or improve such models. This dissertation first proposes a graph-based method to overcome the lack of evaluation data in low-resource languages by focusing on the structure of cross-lingual word embeddings, further discussing approaches to improve cross-lingual transfer learning by using retrofitting methods and by focusing on a specific task. Finally, it provides an analysis of the effect of adding different languages when pretraining multilingual models
Safeguarding Privacy Through Deep Learning Techniques
Over the last few years, there has been a growing need to meet minimum security and privacy requirements. Both public and private companies have had to comply with increasingly stringent standards, such as the ISO 27000 family of standards, or the various laws governing the management of personal data. The huge amount of data to be managed has required a huge effort from the employees who, in the absence of automatic techniques, have had to work tirelessly to achieve the certification objectives. Unfortunately, due to the delicate information contained in the documentation relating to these problems, it is difficult if not impossible to obtain material for research and study purposes on which to experiment new ideas and techniques aimed at automating processes, perhaps exploiting what is in ferment in the scientific community and linked to the fields of ontologies and artificial intelligence for data management. In order to bypass this problem, it was decided to examine data related to the medical world, which, especially for important reasons related to the health of individuals, have gradually become more and more freely accessible over time, without affecting the generality of the proposed methods, which can be reapplied to the most diverse fields in which there is a need to manage privacy-sensitive information
From Word to Sense Embeddings: A Survey on Vector Representations of Meaning
Over the past years, distributed semantic representations have proved to be
effective and flexible keepers of prior knowledge to be integrated into
downstream applications. This survey focuses on the representation of meaning.
We start from the theoretical background behind word vector space models and
highlight one of their major limitations: the meaning conflation deficiency,
which arises from representing a word with all its possible meanings as a
single vector. Then, we explain how this deficiency can be addressed through a
transition from the word level to the more fine-grained level of word senses
(in its broader acceptation) as a method for modelling unambiguous lexical
meaning. We present a comprehensive overview of the wide range of techniques in
the two main branches of sense representation, i.e., unsupervised and
knowledge-based. Finally, this survey covers the main evaluation procedures and
applications for this type of representation, and provides an analysis of four
of its important aspects: interpretability, sense granularity, adaptability to
different domains and compositionality.Comment: 46 pages, 8 figures. Published in Journal of Artificial Intelligence
Researc
- …