84 research outputs found
Clitic Climbing, Finiteness and the Raising-Control Distinction : A Corpusābased study
In the paper, we discuss the phenomenon of clitic climbing out of finite da2-complements in contemporary Serbian. Scholarsā opinions on the acceptability and occurrence of this construction, based on a handful of self-made examples, vary considerably. Expanding on the assumption that the correctness of the phenomenon has often been denied due to its rareness we employ large corpora to examine the problem. We focus on possible constraints arising from the syntactic properties of clause-embedding predicates.Peer reviewe
Otvoreni resursi i tehnologije za obradu srpskog jezika
Open language resources and tools are very important for increasing the quality and speeding up the development of technologies for natural language processing. This paper presents a set of open resources available for processing the Serbian language. We describe several manually annotated corpora, as well as a range of computational models, including a web service designed in order to facilitate their use
Corpus-Based Approaches to Igbo Diacritic Restoration
With natural language processing (NLP), researchers aim to get the computer to identify and understand the patterns in human languages. This is often difficult because a language embeds many dynamic and varied properties in its syntaxes, pragmatics and phonology, which needs to be captured and processed. The capacity of computers to process natural languages is increasing because NLP researchers are pushing its boundaries. But these research works focus more on well resourced languages such as English, Japanese, German, French, Russian, Mandarin Chinese etc. Over 95% of the worldās 7000 languages are low-resourced for NLP i.e. they have little or no data, tools, and techniques for NLP work.
In this thesis, we present an overview of diacritic ambiguity and a review of previous diacritic disambiguation approaches on other languages. Focusing on Igbo language, we report the steps taken to develop a flexible framework for generating datasets for diacritic restoration. Three main approaches, the standard n-gram model, the classification models and the embedding models were proposed. The standard n-gram models use a sequence of previous words to the target stripped word as key predictors of the correct variants. For the classification models, a window
of words on both sides of the target stripped word were use. The embedding models compare the similarity scores of the combined context word embeddings and the embeddings of each of the candidate variant vectors.
The processes and techniques involved in projecting embeddings from a model trained with English texts to an Igbo embedding space and the creation of intrinsic evaluation tasks to validate the models were also discussed. A comparative analysis of the results indicate that all the approaches significantly improved on the baseline performance which uses the unigram model. The details of the processed involved in building the models as well as the possible directions for future work are discussed in this work
Recommended from our members
Using automatic speech recognition to evaluate Arabic to English transliteration
Increased travel and international communication has led to an increased need for transliteration of Arabic proper names for people, places, technical terms and organisations. There are a variety of available Arabic to English transliteration systems such as Unicode, the Buckwalter Arabic transliteration, and ArabTeX. The transliteration tables have been developed and used by researchers for many years, but there are only limited attempts to evaluate and compare different transliteration systems. This thesis investigates whether or not speech recognition technology could be used to evaluate different Arabic-English transliteration systems. In order to do so there were 5 main objectives: firstly, to investigate the possibility of using English speech recognition engines to recognize Arabic words; secondly, to establish the possibility of automatic transliteration of diacritised Arabic words for the purpose of creating a vocabulary for the speech recognition engine; thirdly, to explore the possibility of automatically generating transliterations of non diacritised Arabic words; fourthly to construct a general method to compare and evaluate different transliteration; and finally, to test the system and use it to experiment with new transliterations ideas
Orthographies in Early Modern Europe
This volume provides, for the first time, a pan-European view of the development of written languages at a key time in their history: that of the 16th century. The major cultural and intellectual upheavals that affected Europe at the time - Humanism, the Reformation and the emergence of modern nation-states - were not isolated phenomena, and the evolution of the orthographical systems of European languages shows a large number of convergences, due to the mobility of scholars, ideas and technological innovations throughout the period
JANES v0.4: Korpus slovenskih spletnih uporabniŔkih vsebin
V prispevku predstavimo najnovejÅ”o razliÄico korpusa spletne slovenÅ”Äine Janes, ki vsebuje tvite, spletne forume, novice in uporabniÅ”ke komentarje nanje, blogovske zapise in komentarje nanje ter uporabniÅ”ke in pogovorne strani na Wikipediji. Najprej opiÅ”emo postopek zajema besedil za vsakega od vkljuÄenih virov in podamo kvantitativno analizo zgrajenega korpusa. Sledi predstavitev avtomatskih in roÄnih postopkov za obogatitev korpusa s koristnimi metapodatki, kot so tip, spol in regija avtorja ter sentiment in stopnja tehniÄne in jezikovne standardnosti posameznega besedila. Prispevek sklenemo z opisom delotoka za jezikoslovno oznaÄevanje korpusa, ki vkljuÄuje tokenizacijo, stavÄno segmentacijo, rediakritizacijo, normalizacijo, oblikoskladenjsko oznaÄevanje in lematizacijo
Orthographies in Early Modern Europe
This volume provides, for the first time, a pan-European view of the development of written languages at a key time in their history: that of the 16th century. The major cultural and intellectual upheavals that affected Europe at the time - Humanism, the Reformation and the emergence of modern nation-states - were not isolated phenomena, and the evolution of the orthographical systems of European languages shows a large number of convergences, due to the mobility of scholars, ideas and technological innovations throughout the period
- ā¦