Search CORE

1,330 research outputs found

PersoNER: Persian named-entity recognition

Author: Abdous M
Borzeshi EZ
Piccardi M
Poostchi H
Publication venue
Publication date: 01/01/2016
Field of study

© 1963-2018 ACL. Named-Entity Recognition (NER) is still a challenging task for languages with low digital resources. The main difficulties arise from the scarcity of annotated corpora and the consequent problematic training of an effective NER pipeline. To abridge this gap, in this paper we target the Persian language that is spoken by a population of over a hundred million people world-wide. We first present and provide ArmanPerosNERCorpus, the first manually-annotated Persian NER corpus. Then, we introduce PersoNER, an NER pipeline for Persian that leverages a word embedding and a sequential max-margin classifier. The experimental results show that the proposed approach is capable of achieving interesting MUC7 and CoNNL scores while outperforming two alternatives based on a CRF and a recurrent neural network

OPUS - University of Technology Sydney

Trends in Usage-Based and Pragmatic Language Processing and Learning: A Bibliometric Analysis on Psycholinguistics and Second-Language Acquisition Studies

Author: Jiang Xiaoming
Publication venue: 'IntechOpen'
Publication date: 26/04/2020
Field of study

This chapter provides bibliometric analyses of novel trends in the research toward pragmatic aspects of language processing and learning in the studies of psycholinguistics and second-language acquisition. Growing interests in the relevant themes are shown with the analysis of the co-occurrence of keywords in a common literature and the bibliographic coupling between literatures. The emergence of novel experimental methodologies, including the application of neuroimaging and machine learning approaches to the psycholinguistic research, provides new opportunities of looking into the pragmatic aspects of language acquisition and invites new empirical research to validate the theories and extend the boundaries of second-language acquisition research in the real-world setting

IntechOpen

Crossref

Graph Neural Networks for Natural Language Processing: A Survey

Author: Chen Yu
Gao Hanning
Guo Xiaojie
Li Shucheng
Long Bo
Pei Jian
Shen Kai
Wu Lingfei
Publication venue
Publication date: 10/06/2021
Field of study

Deep learning has become the dominant approach in coping with various tasks in Natural LanguageProcessing (NLP). Although text inputs are typically represented as a sequence of tokens, there isa rich variety of NLP problems that can be best expressed with a graph structure. As a result, thereis a surge of interests in developing new deep learning techniques on graphs for a large numberof NLP tasks. In this survey, we present a comprehensive overview onGraph Neural Networks(GNNs) for Natural Language Processing. We propose a new taxonomy of GNNs for NLP, whichsystematically organizes existing research of GNNs for NLP along three axes: graph construction,graph representation learning, and graph based encoder-decoder models. We further introducea large number of NLP applications that are exploiting the power of GNNs and summarize thecorresponding benchmark datasets, evaluation metrics, and open-source codes. Finally, we discussvarious outstanding challenges for making the full use of GNNs for NLP as well as future researchdirections. To the best of our knowledge, this is the first comprehensive overview of Graph NeuralNetworks for Natural Language Processing.Comment: 127 page

arXiv.org e-Print Archive

Representation Learning for Natural Language Processing

Author: Lin Yankai
Liu Zhiyuan
Sun Maosong
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

This open access book provides an overview of the recent advances in representation learning theory, algorithms and applications for natural language processing (NLP). It is divided into three parts. Part I presents the representation learning techniques for multiple language entries, including words, phrases, sentences and documents. Part II then introduces the representation techniques for those objects that are closely related to NLP, including entity-based world knowledge, sememe-based linguistic knowledge, networks, and cross-modal entries. Lastly, Part III provides open resource tools for representation learning techniques, and discusses the remaining challenges and future research directions. The theories and algorithms of representation learning presented can also benefit other related domains such as machine learning, social network analysis, semantic Web, information retrieval, data mining and computational biology. This book is intended for advanced undergraduate and graduate students, post-doctoral fellows, researchers, lecturers, and industrial engineers, as well as anyone interested in representation learning and natural language processing

OAPEN Library

A ciência da leitura e a produção acadêmica: caminhos trilhados

Author: Giraldello Ademir Paulo
Publication venue: UFFS
Publication date: 01/01/2017
Field of study

Linguistics focuses on the different phenomena of language. In macrolinguistics areas, there is Psycholinguistics. This subfield researches (de)coding processes of messages with verbal codes. Thus, one of its influential fields of activity is reading. Reading is one of the most complex information processing tasks. It begins with the graphemes decoding and it finishes with the text comprehension. Regarding the assessment of reading, there are several exams and large-scale tests, such as Pisa, Saeb (Aneb and Anresc/Prova Brasil), ENEM. Alarming statistics come with the indicators from these evaluative instruments. There are, among Brazilians, low levels of reading comprehension and marked functional illiteracy rate. Therefore, this study aimed to research what scientific communication has shared in terms of knowledge about reading. Specifically, the objectives were synthesize, considering the psycholinguistic approach of reading research, studies and research with the most recurrent theme in the reading field evidenced from the electronic communication, in order to investigate the dimensions and limitations of knowledge about this subject. For this, through WebQualis system, Qualis A1 and A2 scientific journals with electronic format and with focuses/scopes related to reading from the areas of (1) Language Arts/Linguistics, (2) Psychology and (3) Education were selected. With the selected journals and through Capes Journals Portal, all their volumes and issues from 2011 to 2015 were analyzed. With this, scientific articles related to reading were mapped. With the mapped articles abstracts, the recurrent themes in reading in the scientific production were observed. Finally, with the full articles that had the recurrent theme, the researches results were integrated, synthesizing and pondering about them. With a critical-reflexive assessment of the data, relevant information was found. First, on one hand, it was noted that the reading has achieved a stable and upward space through the electronic communication. On the other one, it was checked that the contributions of Psychology have a great influence in reading and comprehension research. Second, it was shown that the most frequent theme in electronic productions is comprehension. Finally, with the synthesis, it was found that, increasingly, comprehension topics related to reading neurobiological aspects were empirical and directly investigated. In addition, there are several studies that propose reading teaching methods as well as strategies for improving the comprehension, including the use of TICs. Moreover, it was found that many research results are limited. This is because the comprehension involves several components – cognitive processes and skills. Researches often focus attention on one or the other component of it only, and each research fixes a specific methodology design and that vary considerabably. Regarding the assessment of reading, many of the methodological apparatus tasks evaluate only the product of comprehension and not its process. In other words, built mental representations are evaluated and not how the encoding of this text occurred. Therefore, in short, both the researches advancement in the comprehension field and several limitations were observed.A Linguística atém-se aos mais diferentes fenômenos da língua(gem). Nos domínios macrolinguísticos, há a Psicolinguística. Essa subárea tem como foco de investigação os processos de (de)codificação de mensagens de códigos verbais. Assim, um de seus influentes campos de atuação é o de leitura. A leitura é uma das tarefas de processamento de informações mais complexas. Ela tem como princípio a decodificação grafêmica e como fim a compreensão textual. Em relação à avaliação da leitura, existem diversos testes e provas em larga escala, como o Pisa, o Saeb (Aneb e Anresc/Prova Brasil), o ENEM. Com os indicadores desses instrumentos avaliativos, vêm estatísticas alarmantes. Há, entre os brasileiros, baixos níveis de compreensão leitora e acentuado índice de analfabetismo funcional. Por conseguinte, este trabalho pretendeu investigar o que a comunicação científica tem compartilhado em termos de conhecimento sobre leitura. Especificamente, objetivou-se sintetizar, considerando a abordagem psicolinguística de investigação da leitura, estudos e pesquisas cuja temática evidenciada da comunicação eletrônica fosse a mais recorrente no campo da leitura, a fim de investigar dimensões e limitações do conhecimento a respeito dessa temática. Para isso, selecionaram-se, por meio do sistema WebQualis, periódicos científicos Qualis A1 e A2 em formato eletrônico e com focos/escopos relacionados à leitura, das áreas de (1) Letras/Linguística, (2) Psicologia e (3) Educação. Com os periódicos selecionados e por meio do Portal de Periódicos Capes, analisaram-se todos os seus volumes e números de 2011 a 2015, a fim de mapear artigos científicos com assunto em leitura. Com os resumos dos artigos mapeados, evidenciaram-se temáticas mais recorrentes na produção científica em leitura. Por fim, dos artigos completos cuja temática era a mais recorrente, integraram-se resultados das pesquisas, fazendo-se uma análise, com fins de síntese e reflexão. Da apreciação crítico-reflexiva dos dados, constataram-se relevantes informações. Em primeiro lugar, de um lado, observou-se que a leitura tem conquistado um estável e ascendente espaço em meio à comunicação eletrônica. De outro, demonstrou-se que contribuições da Psicologia têm forte influência na pesquisa de leitura e compreensão. Em segundo, evidenciou-se que a compreensão é a temática mais frequente nas produções eletrônicas. Por fim, com a síntese, constatou-se que, cada vez mais, se investiga empírica e diretamente facetas da compreensão em relação às bases neurobiológicas da leitura. Igualmente, há diversas pesquisas que propõem metodologias de ensino da leitura, bem como estratégias para a melhoria da compreensão, incluindo a utilização das TICs. Além disso, concluiu-se que muitos resultados de pesquisas são limitados. Isso porque a compreensão envolve diversos componentes – processos cognitivos e habilidades. E as pesquisas, muitas vezes, apenas focam a atenção em um ou em outro componente, além de definirem específicos e variados designs de metodologia. Em relação à avaliação da leitura, muitas das tarefas do aparato metodológico das pesquisas apenas avaliam o produto da compreensão e não o seu processo. Ou seja, avaliam-se representações mentais construídas e não como ocorreu a codificação desse texto na mente do leitor. Por conseguinte, em suma, tanto o avanço de pesquisas no campo de compreensão quanto, também, diversas limitações ficaram evidentes

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidade Federal da Fronteira Sul

A Survey on Semantic Processing Techniques

Author: Cambria Erik
Chen Guanyi
He Kai
Mao Rui
Ni Jinjie
Yang Zonglin
Zhang Xulang
Publication venue
Publication date: 22/10/2023
Field of study

Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN 1566-2535. The equal contribution mark is missed in the published version due to the publication policies. Please contact Prof. Erik Cambria for detail

arXiv.org e-Print Archive

GLM-130B: An Open Bilingual Pre-trained Model

Author: Chen Wenguang
Ding Ming
Dong Yuxiao
Du Zhengxiao
Lai Hanyu
Liu Xiao
Ma Zixuan
Tam Weng Lam
Tang Jie
Wang Zihan
Xia Xiao
Xu Yifan
Xue Yufei
Yang Zhuoyi
Zeng Aohan
Zhai Jidong
Zhang Peng
Zheng Wendi
Publication venue
Publication date: 25/10/2023
Field of study

We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and divergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B -- the largest Chinese language model -- across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization without post training, with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4

\times

RTX 3090 (24G) or 8

\times

RTX 2080 Ti (11G) GPUs, the most affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at \url{https://github.com/THUDM/GLM-130B/}.Comment: Accepted to ICLR 202

arXiv.org e-Print Archive