31 research outputs found
Exploiting Cross-Lingual Representations For Natural Language Processing
Traditional approaches to supervised learning require a generous amount of labeled data for good generalization. While such annotation-heavy approaches have proven useful for some Natural Language Processing (NLP) tasks in high-resource languages (like English), they are unlikely to scale to languages where collecting labeled data is di cult and time-consuming. Translating supervision available in English is also not a viable solution, because developing a good machine translation system requires expensive to annotate resources which are not available for most languages.
In this thesis, I argue that cross-lingual representations are an effective means of extending NLP tools to languages beyond English without resorting to generous amounts of annotated data or expensive machine translation. These representations can be learned in an inexpensive manner, often from signals completely unrelated to the task of interest. I begin with a review of different ways of inducing such representations using a variety of cross-lingual signals and study algorithmic approaches of using them in a diverse set of downstream tasks. Examples of such tasks covered in this thesis include learning representations to transfer a trained model across languages for document classification, assist in monolingual lexical semantics like word sense induction, identify asymmetric lexical relationships like hypernymy between words in different languages, or combining supervision across languages through a shared feature space for cross-lingual entity linking. In all these applications, the representations make information expressed in other languages available in English, while requiring minimal additional supervision in the language of interest
Evaluating Unsupervised Text Classification: Zero-shot and Similarity-based Approaches
Text classification of unseen classes is a challenging Natural Language
Processing task and is mainly attempted using two different types of
approaches. Similarity-based approaches attempt to classify instances based on
similarities between text document representations and class description
representations. Zero-shot text classification approaches aim to generalize
knowledge gained from a training task by assigning appropriate labels of
unknown classes to text documents. Although existing studies have already
investigated individual approaches to these categories, the experiments in
literature do not provide a consistent comparison. This paper addresses this
gap by conducting a systematic evaluation of different similarity-based and
zero-shot approaches for text classification of unseen classes. Different
state-of-the-art approaches are benchmarked on four text classification
datasets, including a new dataset from the medical domain. Additionally, novel
SimCSE and SBERT-based baselines are proposed, as other baselines used in
existing work yield weak classification results and are easily outperformed.
Finally, the novel similarity-based Lbl2TransformerVec approach is presented,
which outperforms previous state-of-the-art approaches in unsupervised text
classification. Our experiments show that similarity-based approaches
significantly outperform zero-shot approaches in most cases. Additionally,
using SimCSE or SBERT embeddings instead of simpler text representations
increases similarity-based classification results even further.Comment: Accepted to 6th International Conference on Natural Language
Processing and Information Retrieval (NLPIR '22
PersoNER: Persian named-entity recognition
© 1963-2018 ACL. Named-Entity Recognition (NER) is still a challenging task for languages with low digital resources. The main difficulties arise from the scarcity of annotated corpora and the consequent problematic training of an effective NER pipeline. To abridge this gap, in this paper we target the Persian language that is spoken by a population of over a hundred million people world-wide. We first present and provide ArmanPerosNERCorpus, the first manually-annotated Persian NER corpus. Then, we introduce PersoNER, an NER pipeline for Persian that leverages a word embedding and a sequential max-margin classifier. The experimental results show that the proposed approach is capable of achieving interesting MUC7 and CoNNL scores while outperforming two alternatives based on a CRF and a recurrent neural network