1,286 research outputs found
Part of Speech Tagging of Marathi Text Using Trigram Method
In this paper we present a Marathi part of speech tagger. It is a morphologically rich language. It is spoken by the native people of Maharashtra. The general approach used for development of tagger is statistical using trigram Method. The main concept of trigram is to explore the most likely POS for a token based on given information of previous two tags by calculating probabilities to determine which is the best sequence of a tag. In this paper we show the development of the tagger. Moreover we have also shown the evaluation done
Implementation of Rule Based Algorithm for Sandhi-Vicheda Of Compound Hindi Words
Sandhi means to join two or more words to coin new word. Sandhi literally means `putting together' or combining (of sounds), It denotes all combinatory sound-changes effected (spontaneously) for ease of pronunciation. Sandhi-vicheda describes [5] the process by which one letter (whether single or cojoined) is broken to form two words. Part of the broken letter remains as the last letter of the first word and part of the letter forms the first letter of the next letter. Sandhi-Vicheda is an easy and interesting way that can give entirely new dimension that add new way to traditional approach to Hindi Teaching. In this paper using the Rule based algorithm we have reported an accuracy of 60-80% depending upon the number of rules to be implemented
Ergative case, aspect and person splits: Two case studies
Ergativity splits between perfect and imperfective/progressive predicates are observed in languages with a specialized ergative case (Punjabi) and without it (Kurdish). Perfect predicates correspond to a VP projection; external arguments are introduced by means of an oblique case, namely an elementary part–whole predicate saying that the event is ‘included by’, ‘located at’ the argument. A more complex organization is found with imperfective/progressive predicates, where a head Asp projects a functional layer and introduces the external argument. Our proposal further yields the 1/2P vs. 3P Person split as a result of the intrinsic ability of 1/2P to serve as ‘location-of-event’
Beyond Arabic: Software for Perso-Arabic Script Manipulation
This paper presents an open-source software library that provides a set of
finite-state transducer (FST) components and corresponding utilities for
manipulating the writing systems of languages that use the Perso-Arabic script.
The operations include various levels of script normalization, including visual
invariance-preserving operations that subsume and go beyond the standard
Unicode normalization forms, as well as transformations that modify the visual
appearance of characters in accordance with the regional orthographies for
eleven contemporary languages from diverse language families. The library also
provides simple FST-based romanization and transliteration. We additionally
attempt to formalize the typology of Perso-Arabic characters by providing
one-to-many mappings from Unicode code points to the languages that use them.
While our work focuses on the Arabic script diaspora rather than Arabic itself,
this approach could be adopted for any language that uses the Arabic script,
thus providing a unified framework for treating a script family used by close
to a billion people.Comment: Preprint to appear in the Proceedings of the 7th Arabic Natural
Language Processing Workshop (WANLP 2022) at EMNLP, Abu Dhabi, United Arab
Emirates, December 7-11, 2022. 7 page
Language acquisition
This project investigates acquisition of a new language by example. Syntax induction has
been studied widely and the more complex syntax associated with Natural Language is
difficult to induce without restrictions. Chomsky conjectured that natural languages are
restricted by a Universal Grammar. English could be used as a Universal Grammar and
Punjabi derived from it in a similar way as the acquisition of a first language. However, if
English has already been acquired then Punjabi would be induced from English as a
second language. [Continues.
Community languages in higher education : towards realising the potential
This study, Community Languages in Higher Education: Towards Realising the Potential, forms part of the Routes into Languages initiative funded by the Higher Education Funding Council in England (HEFCE) and the Department for Children, Schools and Families (DCSF). It sets out to map provision for community languages, defined as 'all languages in use in a society, other than the dominant, official or national language'. In England, where the dominant language is English, some 300 community languages are in use, the most widespread being Urdu, Cantonese, Punjabi, Bengali, Arabic, Turkish, Russian, Spanish, Portuguese, Gujerati, Hindi and Polish. The research was jointly conducted by the Scottish Centre for Information on Language Teaching and Research (Scottish CILT) at the University of Stirling, and the SOAS-UCL Centre for Excellence for Teaching and Learning 'Languages of the Wider World' (LWW CETL), between February 2007 and January 2008. The overall aim of this study was to map provision for community languages in higher education in England and to consider how it can be developed to meet emerging demand for more extensive provision
- …