7 research outputs found
Homograph Disambiguation Through Selective Diacritic Restoration
Lexical ambiguity, a challenging phenomenon in all natural languages, is
particularly prevalent for languages with diacritics that tend to be omitted in
writing, such as Arabic. Omitting diacritics leads to an increase in the number
of homographs: different words with the same spelling. Diacritic restoration
could theoretically help disambiguate these words, but in practice, the
increase in overall sparsity leads to performance degradation in NLP
applications. In this paper, we propose approaches for automatically marking a
subset of words for diacritic restoration, which leads to selective homograph
disambiguation. Compared to full or no diacritic restoration, these approaches
yield selectively-diacritized datasets that balance sparsity and lexical
disambiguation. We evaluate the various selection strategies extrinsically on
several downstream applications: neural machine translation, part-of-speech
tagging, and semantic textual similarity. Our experiments on Arabic show
promising results, where our devised strategies on selective diacritization
lead to a more balanced and consistent performance in downstream applications.Comment: accepted in WANLP 201
Efficient Convolutional Neural Networks for Diacritic Restoration
Diacritic restoration has gained importance with the growing need for
machines to understand written texts. The task is typically modeled as a
sequence labeling problem and currently Bidirectional Long Short Term Memory
(BiLSTM) models provide state-of-the-art results. Recently, Bai et al. (2018)
show the advantages of Temporal Convolutional Neural Networks (TCN) over
Recurrent Neural Networks (RNN) for sequence modeling in terms of performance
and computational resources. As diacritic restoration benefits from both
previous as well as subsequent timesteps, we further apply and evaluate a
variant of TCN, Acausal TCN (A-TCN), which incorporates context from both
directions (previous and future) rather than strictly incorporating previous
context as in the case of TCN. A-TCN yields significant improvement over TCN
for diacritization in three different languages: Arabic, Yoruba, and
Vietnamese. Furthermore, A-TCN and BiLSTM have comparable performance, making
A-TCN an efficient alternative over BiLSTM since convolutions can be trained in
parallel. A-TCN is significantly faster than BiLSTM at inference time
(270%-334% improvement in the amount of text diacritized per minute).Comment: accepted in EMNLP 201
Neural Arabic Text Diacritization: State of the Art Results and a Novel Approach for Machine Translation
In this work, we present several deep learning models for the automatic
diacritization of Arabic text. Our models are built using two main approaches,
viz. Feed-Forward Neural Network (FFNN) and Recurrent Neural Network (RNN),
with several enhancements such as 100-hot encoding, embeddings, Conditional
Random Field (CRF) and Block-Normalized Gradient (BNG). The models are tested
on the only freely available benchmark dataset and the results show that our
models are either better or on par with other models, which require
language-dependent post-processing steps, unlike ours. Moreover, we show that
diacritics in Arabic can be used to enhance the models of NLP tasks such as
Machine Translation (MT) by proposing the Translation over Diacritization (ToD)
approach.Comment: 18 pages, 17 figures, 14 table
Ethno-Religious Conflict in Northern Nigeria: The Latency of Episodic Genocide
This dissertation explores the ethnic and religious dimensions of the northern Nigeria conflict in which gruesome killings have intermittently occurred, to determine whether there are genocidal inclinations to the episodic killings. The literature review provides the contextual framework for examining the conflict parties and causation factors to address the research questions: Are there genocidal inclinations to the ethno-religious conflict in northern Nigeria? To what extent does the interplay between ethnicity and religion help to foment and escalate the conflict in northern Nigeria? The study employs a mixed content analysis and grounded theory methodology based on the Strauss and Corbin (1990) approach. Data sourcing was from 197 newspaper articles on the conflict over the study period spanning from the 1966 northern Nigeria massacres of thousands of Ibos up to present, ongoing killings between Muslims and Christians or non-Muslims in the region. Available texts of the conflict cases over the research period were content-analyzed using Nvivo qualitative data analysis software involving processes of categorizing, coding and evaluation of the textual themes. The study structures a theoretical model for determining proclivity to genocide, and finds that there are genocidal inclinations to the northern Nigeria conflict, involving the specific intent to ‘cleanse’ the north through the exclusionary ideology of imposition of the Sharia law through enforced assimilation or extermination of Christians and other non-Muslims who do not assimilate or adopt the Muslim ideology. The study also suggests that there is latency in the recognition of these genocidal manifestations due to their episodic nature and intermittency of occurrence. he study provides further understanding of factors underlying and sustaining the violent conflict between Muslims and Christians in northern Nigeria. It contributes new perspectives and theoretical model for determining genocidal proclivity to the field of conflict analysis and resolution, and proffers alternative strategies for relationship building and peaceful coexistence among different religious groups. The findings will guide recommendations on policy formulations for eliminating religious intolerance in northern Nigeria. The study creates further awareness on the need for global intervention on the region’s sporadic killings to avert full blown Rwandan type genocide in Nigeria
Dicionário de Biblioteconomia e Arquivologia
O objetivo deste dicionário é definir, de forma clara, sucinta e simples, os termos utilizados por bibliotecários, arquivistas e demais profissionais da ampla e multifacetada área de ciência da informação, facilitando
a expansão de seus conhecimentos. O critério básico para inclusão de um termo foi seu uso potencial ao longo do exercício profissional desses especialistas. Em muitos verbetes foram incluídas abonações extraídas da literatura técnico-científica e de léxicos gerais e especializados. A tarefa de compilação sistemática de terminologia é vital para o desenvolvimento de qualquer ramo técnico-científico, pois é impossível atingir
clareza e precisão sem uniformidade na linguagem pelos praticantes da área. Amplo em seu escopo, com mais de quatro mil verbetes, o dicionário inclui não somente a terminologia das várias especializações dentro da biblioteconomia, arquivologia, documentação e estudos de informação, mas também os principais termos de direito autoral, editoração, comércio livreiro, artes gráficas, história do livro, bibliografia, comunicação científica, telecomunicações e informática. Servirá, portanto, a bibliotecários, arquivistas, editores, livreiros, estudantes, pesquisadores e demais profissionais que trabalham na coleta, armazenamento, processamento, recuperação e difusão da informação, em seu formato tradicional impresso ou em meio eletrônico.
Colaborará também para atender às necessidades daqueles estudiosos que necessitam da terminologia técnica em inglês