1,330 research outputs found
PersoNER: Persian named-entity recognition
© 1963-2018 ACL. Named-Entity Recognition (NER) is still a challenging task for languages with low digital resources. The main difficulties arise from the scarcity of annotated corpora and the consequent problematic training of an effective NER pipeline. To abridge this gap, in this paper we target the Persian language that is spoken by a population of over a hundred million people world-wide. We first present and provide ArmanPerosNERCorpus, the first manually-annotated Persian NER corpus. Then, we introduce PersoNER, an NER pipeline for Persian that leverages a word embedding and a sequential max-margin classifier. The experimental results show that the proposed approach is capable of achieving interesting MUC7 and CoNNL scores while outperforming two alternatives based on a CRF and a recurrent neural network
Trends in Usage-Based and Pragmatic Language Processing and Learning: A Bibliometric Analysis on Psycholinguistics and Second-Language Acquisition Studies
This chapter provides bibliometric analyses of novel trends in the research toward pragmatic aspects of language processing and learning in the studies of psycholinguistics and second-language acquisition. Growing interests in the relevant themes are shown with the analysis of the co-occurrence of keywords in a common literature and the bibliographic coupling between literatures. The emergence of novel experimental methodologies, including the application of neuroimaging and machine learning approaches to the psycholinguistic research, provides new opportunities of looking into the pragmatic aspects of language acquisition and invites new empirical research to validate the theories and extend the boundaries of second-language acquisition research in the real-world setting
Graph Neural Networks for Natural Language Processing: A Survey
Deep learning has become the dominant approach in coping with various tasks
in Natural LanguageProcessing (NLP). Although text inputs are typically
represented as a sequence of tokens, there isa rich variety of NLP problems
that can be best expressed with a graph structure. As a result, thereis a surge
of interests in developing new deep learning techniques on graphs for a large
numberof NLP tasks. In this survey, we present a comprehensive overview onGraph
Neural Networks(GNNs) for Natural Language Processing. We propose a new
taxonomy of GNNs for NLP, whichsystematically organizes existing research of
GNNs for NLP along three axes: graph construction,graph representation
learning, and graph based encoder-decoder models. We further introducea large
number of NLP applications that are exploiting the power of GNNs and summarize
thecorresponding benchmark datasets, evaluation metrics, and open-source codes.
Finally, we discussvarious outstanding challenges for making the full use of
GNNs for NLP as well as future researchdirections. To the best of our
knowledge, this is the first comprehensive overview of Graph NeuralNetworks for
Natural Language Processing.Comment: 127 page
Representation Learning for Natural Language Processing
This open access book provides an overview of the recent advances in representation learning theory, algorithms and applications for natural language processing (NLP). It is divided into three parts. Part I presents the representation learning techniques for multiple language entries, including words, phrases, sentences and documents. Part II then introduces the representation techniques for those objects that are closely related to NLP, including entity-based world knowledge, sememe-based linguistic knowledge, networks, and cross-modal entries. Lastly, Part III provides open resource tools for representation learning techniques, and discusses the remaining challenges and future research directions. The theories and algorithms of representation learning presented can also benefit other related domains such as machine learning, social network analysis, semantic Web, information retrieval, data mining and computational biology. This book is intended for advanced undergraduate and graduate students, post-doctoral fellows, researchers, lecturers, and industrial engineers, as well as anyone interested in representation learning and natural language processing
A ciência da leitura e a produção acadêmica: caminhos trilhados
Linguistics focuses on the different phenomena of language. In macrolinguistics areas, there
is Psycholinguistics. This subfield researches (de)coding processes of messages with verbal
codes. Thus, one of its influential fields of activity is reading. Reading is one of the most
complex information processing tasks. It begins with the graphemes decoding and it finishes
with the text comprehension. Regarding the assessment of reading, there are several exams
and large-scale tests, such as Pisa, Saeb (Aneb and Anresc/Prova Brasil), ENEM. Alarming
statistics come with the indicators from these evaluative instruments. There are, among
Brazilians, low levels of reading comprehension and marked functional illiteracy rate.
Therefore, this study aimed to research what scientific communication has shared in terms of
knowledge about reading. Specifically, the objectives were synthesize, considering the
psycholinguistic approach of reading research, studies and research with the most recurrent
theme in the reading field evidenced from the electronic communication, in order to
investigate the dimensions and limitations of knowledge about this subject. For this, through
WebQualis system, Qualis A1 and A2 scientific journals with electronic format and with
focuses/scopes related to reading from the areas of (1) Language Arts/Linguistics, (2)
Psychology and (3) Education were selected. With the selected journals and through Capes
Journals Portal, all their volumes and issues from 2011 to 2015 were analyzed. With this,
scientific articles related to reading were mapped. With the mapped articles abstracts, the
recurrent themes in reading in the scientific production were observed. Finally, with the full
articles that had the recurrent theme, the researches results were integrated, synthesizing and
pondering about them. With a critical-reflexive assessment of the data, relevant information
was found. First, on one hand, it was noted that the reading has achieved a stable and upward
space through the electronic communication. On the other one, it was checked that the
contributions of Psychology have a great influence in reading and comprehension research.
Second, it was shown that the most frequent theme in electronic productions is
comprehension. Finally, with the synthesis, it was found that, increasingly, comprehension
topics related to reading neurobiological aspects were empirical and directly investigated. In
addition, there are several studies that propose reading teaching methods as well as strategies
for improving the comprehension, including the use of TICs. Moreover, it was found that
many research results are limited. This is because the comprehension involves several
components – cognitive processes and skills. Researches often focus attention on one or the
other component of it only, and each research fixes a specific methodology design and that
vary considerabably. Regarding the assessment of reading, many of the methodological
apparatus tasks evaluate only the product of comprehension and not its process. In other
words, built mental representations are evaluated and not how the encoding of this text
occurred. Therefore, in short, both the researches advancement in the comprehension field and
several limitations were observed.A Linguística atém-se aos mais diferentes fenômenos da língua(gem). Nos domínios
macrolinguísticos, há a Psicolinguística. Essa subárea tem como foco de investigação os
processos de (de)codificação de mensagens de códigos verbais. Assim, um de seus influentes
campos de atuação é o de leitura. A leitura é uma das tarefas de processamento de
informações mais complexas. Ela tem como princípio a decodificação grafêmica e como fim a
compreensão textual. Em relação à avaliação da leitura, existem diversos testes e provas em
larga escala, como o Pisa, o Saeb (Aneb e Anresc/Prova Brasil), o ENEM. Com os
indicadores desses instrumentos avaliativos, vêm estatísticas alarmantes. Há, entre os
brasileiros, baixos níveis de compreensão leitora e acentuado índice de analfabetismo
funcional. Por conseguinte, este trabalho pretendeu investigar o que a comunicação científica
tem compartilhado em termos de conhecimento sobre leitura. Especificamente, objetivou-se
sintetizar, considerando a abordagem psicolinguística de investigação da leitura, estudos e
pesquisas cuja temática evidenciada da comunicação eletrônica fosse a mais recorrente no
campo da leitura, a fim de investigar dimensões e limitações do conhecimento a respeito dessa
temática. Para isso, selecionaram-se, por meio do sistema WebQualis, periódicos científicos
Qualis A1 e A2 em formato eletrônico e com focos/escopos relacionados à leitura, das áreas
de (1) Letras/Linguística, (2) Psicologia e (3) Educação. Com os periódicos selecionados e
por meio do Portal de Periódicos Capes, analisaram-se todos os seus volumes e números de
2011 a 2015, a fim de mapear artigos científicos com assunto em leitura. Com os resumos dos
artigos mapeados, evidenciaram-se temáticas mais recorrentes na produção científica em
leitura. Por fim, dos artigos completos cuja temática era a mais recorrente, integraram-se
resultados das pesquisas, fazendo-se uma análise, com fins de síntese e reflexão. Da
apreciação crítico-reflexiva dos dados, constataram-se relevantes informações. Em primeiro
lugar, de um lado, observou-se que a leitura tem conquistado um estável e ascendente espaço
em meio à comunicação eletrônica. De outro, demonstrou-se que contribuições da Psicologia
têm forte influência na pesquisa de leitura e compreensão. Em segundo, evidenciou-se que a
compreensão é a temática mais frequente nas produções eletrônicas. Por fim, com a síntese,
constatou-se que, cada vez mais, se investiga empírica e diretamente facetas da compreensão
em relação às bases neurobiológicas da leitura. Igualmente, há diversas pesquisas que
propõem metodologias de ensino da leitura, bem como estratégias para a melhoria da
compreensão, incluindo a utilização das TICs. Além disso, concluiu-se que muitos resultados
de pesquisas são limitados. Isso porque a compreensão envolve diversos componentes –
processos cognitivos e habilidades. E as pesquisas, muitas vezes, apenas focam a atenção em
um ou em outro componente, além de definirem específicos e variados designs de
metodologia. Em relação à avaliação da leitura, muitas das tarefas do aparato metodológico
das pesquisas apenas avaliam o produto da compreensão e não o seu processo. Ou seja,
avaliam-se representações mentais construídas e não como ocorreu a codificação desse texto
na mente do leitor. Por conseguinte, em suma, tanto o avanço de pesquisas no campo de
compreensão quanto, também, diversas limitações ficaram evidentes
A Survey on Semantic Processing Techniques
Semantic processing is a fundamental research domain in computational
linguistics. In the era of powerful pre-trained language models and large
language models, the advancement of research in this domain appears to be
decelerating. However, the study of semantics is multi-dimensional in
linguistics. The research depth and breadth of computational semantic
processing can be largely improved with new technologies. In this survey, we
analyzed five semantic processing tasks, e.g., word sense disambiguation,
anaphora resolution, named entity recognition, concept extraction, and
subjectivity detection. We study relevant theoretical research in these fields,
advanced methods, and downstream applications. We connect the surveyed tasks
with downstream applications because this may inspire future scholars to fuse
these low-level semantic processing tasks with high-level natural language
processing tasks. The review of theoretical research may also inspire new tasks
and technologies in the semantic processing domain. Finally, we compare the
different semantic processing techniques and summarize their technical trends,
application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN
1566-2535. The equal contribution mark is missed in the published version due
to the publication policies. Please contact Prof. Erik Cambria for detail
GLM-130B: An Open Bilingual Pre-trained Model
We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language
model with 130 billion parameters. It is an attempt to open-source a 100B-scale
model at least as good as GPT-3 (davinci) and unveil how models of such a scale
can be successfully pre-trained. Over the course of this effort, we face
numerous unexpected technical and engineering challenges, particularly on loss
spikes and divergence. In this paper, we introduce the training process of
GLM-130B including its design choices, training strategies for both efficiency
and stability, and engineering efforts. The resultant GLM-130B model offers
significant outperformance over GPT-3 175B (davinci) on a wide range of popular
English benchmarks while the performance advantage is not observed in OPT-175B
and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN
3.0 260B -- the largest Chinese language model -- across related benchmarks.
Finally, we leverage a unique scaling property of GLM-130B to reach INT4
quantization without post training, with almost no performance loss, making it
the first among 100B-scale models and more importantly, allowing its effective
inference on 4RTX 3090 (24G) or 8RTX 2080 Ti (11G) GPUs, the
most affordable GPUs required for using 100B-scale models. The GLM-130B model
weights are publicly accessible and its code, training logs, related toolkit,
and lessons learned are open-sourced at
\url{https://github.com/THUDM/GLM-130B/}.Comment: Accepted to ICLR 202
- …