11,685 research outputs found
PersoNER: Persian named-entity recognition
© 1963-2018 ACL. Named-Entity Recognition (NER) is still a challenging task for languages with low digital resources. The main difficulties arise from the scarcity of annotated corpora and the consequent problematic training of an effective NER pipeline. To abridge this gap, in this paper we target the Persian language that is spoken by a population of over a hundred million people world-wide. We first present and provide ArmanPerosNERCorpus, the first manually-annotated Persian NER corpus. Then, we introduce PersoNER, an NER pipeline for Persian that leverages a word embedding and a sequential max-margin classifier. The experimental results show that the proposed approach is capable of achieving interesting MUC7 and CoNNL scores while outperforming two alternatives based on a CRF and a recurrent neural network
Unsupervised Learning of Discourse Structures using a Tree Autoencoder
Discourse information, as postulated by popular discourse theories, such as
RST and PDTB, has been shown to improve an increasing number of downstream NLP
tasks, showing positive effects and synergies of discourse with important
real-world applications. While methods for incorporating discourse become more
and more sophisticated, the growing need for robust and general discourse
structures has not been sufficiently met by current discourse parsers, usually
trained on small scale datasets in a strictly limited number of domains. This
makes the prediction for arbitrary tasks noisy and unreliable. The overall
resulting lack of high-quality, high-quantity discourse trees poses a severe
limitation to further progress. In order the alleviate this shortcoming, we
propose a new strategy to generate tree structures in a task-agnostic,
unsupervised fashion by extending a latent tree induction framework with an
auto-encoding objective. The proposed approach can be applied to any
tree-structured objective, such as syntactic parsing, discourse parsing and
others. However, due to the especially difficult annotation process to generate
discourse trees, we initially develop a method to generate larger and more
diverse discourse treebanks. In this paper we are inferring general tree
structures of natural text in multiple domains, showing promising results on a
diverse set of tasks.Comment: Accepted to AAAI 2021, 7 page
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201
- …