35 research outputs found
A Deterministic Finite-State Morphological Analyzer for Urdu Nominal System
The morphological analyzer is a computational process that combines lemmas with other linguistic features to produce new lexical word forms. This paper investigates the processing of a nominal system in the Urdu language. It focuses on the inflections of noun forms and studies number, gender, person, and case representations, using a Finite State Machine (FSM) to analyze and create all the possible forms of the standardized registers. The application of the analysis using this tool provides and displays all the possible structures and their declensions. This study adds all the necessary features and values to the lexical concatenating nouns according to their patterns. The accuracy score of the output is 92.7, where the actual output depends on the detailed design of the FSM and the specific morphological processes provided to the finite state tools
Hard Non-Monotonic Attention for Character-Level Transduction
Character-level string-to-string transduction is an important component of
various NLP tasks. The goal is to map an input string to an output string,
where the strings may be of different lengths and have characters taken from
different alphabets. Recent approaches have used sequence-to-sequence models
with an attention mechanism to learn which parts of the input string the model
should focus on during the generation of the output string. Both soft attention
and hard monotonic attention have been used, but hard non-monotonic attention
has only been used in other sequence modeling tasks such as image captioning
and has required a stochastic approximation to compute the gradient. In this
work, we introduce an exact, polynomial-time algorithm for marginalizing over
the exponential number of non-monotonic alignments between two strings, showing
that hard attention models can be viewed as neural reparameterizations of the
classical IBM Model 1. We compare soft and hard non-monotonic attention
experimentally and find that the exact algorithm significantly improves
performance over the stochastic approximation and outperforms soft attention.Comment: Published in EMNLP 201
Graphemic Normalization of the Perso-Arabic Script
Since its original appearance in 1991, the Perso-Arabic script representation
in Unicode has grown from 169 to over 440 atomic isolated characters spread
over several code pages representing standard letters, various diacritics and
punctuation for the original Arabic and numerous other regional orthographic
traditions. This paper documents the challenges that Perso-Arabic presents
beyond the best-documented languages, such as Arabic and Persian, building on
earlier work by the expert community. We particularly focus on the situation in
natural language processing (NLP), which is affected by multiple, often
neglected, issues such as the use of visually ambiguous yet canonically
nonequivalent letters and the mixing of letters from different orthographies.
Among the contributing conflating factors are the lack of input methods, the
instability of modern orthographies, insufficient literacy, and loss or lack of
orthographic tradition. We evaluate the effects of script normalization on
eight languages from diverse language families in the Perso-Arabic script
diaspora on machine translation and statistical language modeling tasks. Our
results indicate statistically significant improvements in performance in most
conditions for all the languages considered when normalization is applied. We
argue that better understanding and representation of Perso-Arabic script
variation within regional orthographic traditions, where those are present, is
crucial for further progress of modern computational NLP techniques especially
for languages with a paucity of resources.Comment: Pre-print to appear in the Proceedings of Grapholinguistics in the
21st Century (G21C), 2022. Telecom Paris, Palaiseau, France, June 8-10, 2022.
41 pages, 38 tables, 3 figure
Identifying Urdu Complex Predication via Bigram Extraction
ABSTRACT A problem that crops up repeatedly in shallow and deep syntactic parsing approaches to South Asian languages like Urdu/Hindi is the proper treatment of complex predications. Problems for the NLP of complex predications are posed by their productiveness and the ill understood nature of the range of their combinatorial possibilities. This paper presents an investigation into whether fine-grained information about the distributional properties of nouns in N+V CPs can be identified by the comparatively simple process of extracting bigrams from a large "raw" corpus of Urdu. In gathering the relevant properties, we were aided by visual analytics in that we coupled our computational data analysis with interactive visual components in the analysis of the large data sets. The visualization component proved to be an essential part of our data analysis, particular for the easy visual identification of outliers and false positives. Another essential component turned out to be our language-particular knowledge and access to existing language-particular resources. Overall, we were indeed able to identify high frequency N-V complex predications as well as pick out combinations we had not been aware of before. However, a manual inspection of our results also pointed to a problem of data sparsity, despite the use of a large corpus
A probabilistic model of Ancient Egyptian writing
This article offers a formalization of how signs form words in Ancient Egyptian writing, for either hieroglyphic or hieratic texts. The formalization is in terms of a sequence of sign functions, which concurrently produce a sequence of signs and a sequence of phonemes. By involving a class of probabilistic automata, we can define the most likely sequence of sign functions that relates a given sequence of signs to a given sequence of phonemes. Experiments with two texts are discussed.Publisher PDFPeer reviewe