60 research outputs found

    Building Web Corpora for Minority Languages

    Get PDF
    Web corpora creation for minority languages that do not have their own top-level Internet domain is no trivial matter. Web pages in such minority languages often contain text and links to pages in the dominant language of the country. When building corpora in specific languages, one has to decide how and at which stage to make sure the texts gathered are in the desired language. In the {``}Finno-Ugric Languages and the Internet{''} (Suki) project, we created web corpora for Uralic minority languages using web crawling combined with a language identification system in order to identify the language while crawling. In addition, we used language set identification and crowdsourcing before making sentence corpora out of the downloaded texts. In this article, we describe a strategy for collecting textual material from the Internet for minority languages. The strategy is based on the experiences we gained during the Suki project.Peer reviewe

    Hungarian GyerekestĂŒl versus Gyerekkel (‘with [the] kid’)

    Get PDF
    The paper analyzes the various uses of the Hungarian -stUl (‘together with’, ‘along with’) sociative (associative) suffix (later in the paper referred to simply as “sociative”), as in the example gyerekestĂŒl. As opposed to its comitative-instrumental suffix -vAl (‘with’), the - stUl suffix cannot express instrumentality. The paper aims to demonstrate the difference in use between the comitative-instrumental -vAl and the -stUl suffix in contemporary Hungarian, and to illuminate the historical emergence of the suffix as well as its grammatical status. It is argued on the basis of Antal (1960) and Kiefer (2003) that -stUl cannot be analyzed as an inflectional case suffix (such as the -vAl suffix, or -ed, -ing, or the plural in English), but should rather be categorized as a derivational suffix (such as English dis-, re-, in-, -ance, - able, -ish, -like, etc.). The paper also tries to shed light on the hypothetical cognitive psychological distinction between the comitative and the sociative. It is suggested that the sociative is based on the amalgam image schema which is derived from the LINK schema of the comitative. The ironical reading of the sociative is an implicature in the sense of Grice (1989) and Sperber and Wilson (1987). Psycholinguistic experimentation is proposed to follow up on the mental representation of the sociative

    Ludlings and Phonology in Tagalog

    Get PDF
    This paper presents an analysis of the Tagalog “G-word” ludling and addresses its implications in Tagalog phonology. It is shown that the G-word ludling is best analyzed as an iterative infixal ludling, where the sequence of -Vg- is inserted after every onset, rather than infixation of -gV-. Crucially, the G-word ludling reveals constraints on Tagalog phonology that otherwise would be difficult to observe: *C1 VC1 V, hiatus avoidance, and iambic stress. Furthermore, our analysis of the G-words raises an important issue in Tagalog phonology: the possible emergence of the disyllabic “perfect prosodic word” in the G-words. Taken together, this paper offers another case study supporting the important role that ludlings play in phonological theory

    Semi-automatic Endogenous Enrichment of Collaboratively Constructed Lexical Resources: Piggybacking onto Wiktionary

    Get PDF
    International audienceThe lack of large-scale, freely available and durable lexical resources, and the consequences for NLP, is widely acknowledged but the attempts to cope with usual bottlenecks preventing their development often result in dead-ends. This article introduces a language-independent, semi-automatic and endogenous method for enriching lexical resources, based on collaborative editing and random walks through existing lexical relationships, and shows how this approach enables us to overcome recurrent impediments. It compares the impact of using different data sources and similarity measures on the task of improving synonymy networks. Finally, it defines an architecture for applying the presented method to Wiktionary and explains how it has been implemented

    Variable Hiatus in Persian is Affected by Suffix Length

    Get PDF
    We conduct two experiments to examine variable hiatus in Spoken Persian. The production experiment reveals that variation is restricted. For instance, elision of the first vowel, which is cross-linguistically common (Casali 1997), is never attested. Moreover, elision of the second vowel is rare with monosegmental (-V) suffixes. The perception experiment confirms that elision of the second vowel is preferred with polysegmental suffixes, but rare with monosegmental suffixes, where hiatus is favoured instead. The preference of hiatus over epenthesis remains constant regardless of suffix length. This study contributes to the discussion of variable phonological processes and constitutes the first experimentally confirmed case of variable hiatus to date

    Looking for French deverbal nouns in an evolving Web (a short history of WAC)

    Get PDF
    International audienceThis paper describes an 8-year-long research effort for automatically collecting new French deverbal nouns on the Web. The goal has remained the same: building an extensive and cumulative list of noun-verb pairs where the noun denotes the action expressed by the verb (e.g. production - produce). This list is used for both linguistic research and for NLP applications. The initial method consisted in taking advantage of the former Altavista search engine, allowing for a direct access to unknown word forms. The second technique led us to develop a specific crawler, which raised a number of technical difficulties. In the third experiment, we use a collection of web pages made available to us by a commercial search engine. Through all these stages, the general method has remained the same, and the results are similar and cumulative, although the technical environment has greatly evolved

    WebBANC: Building Semantically-Rich Annotated Corpora from Web User Annotations of Minority Languages

    Get PDF
    Proceedings of the 17th Nordic Conference of Computational Linguistics NODALIDA 2009. Editors: Kristiina Jokinen and Eckhard Bick. NEALT Proceedings Series, Vol. 4 (2009), 48-56. © 2009 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/9206

    Entropy measures and predictive recognition as mirrored in gating and lexical decision over multimorphemic Hungarian noun forms

    Get PDF
    Our paper is an attempt to indicate the relevance of information theoretical accounts to understand word recognition and morphological processing in Hungarian, along with other studies using more traditional predictors like linear position and morphological composition. The first two experiments were gating studies. The effect of the decision points was only evident in frequent words. The correct recognition means for the recognition points differ from the means for one-before-recognition points, indicating that the recognition point follows a sudden drop of the entropy value. This shows how entropy measures can be used to predict word recognition in actual language performance. The next two experiments examined the word reconstruction effect. A clear bathtub effect (Aitchison, 1987) was obtained: reconstruction was highest in the cases where both the beginning and the end were correct. The last, lexical decision based study used four basic morphological types of markers (plural, second and first possessive) and three types of case (-nak,-ban-,-ra ‘DAT, INSIDE, ONTO). The main effect of the frequency and the error type was significant. Frequent words were judged faster but less accurately, suggesting a trade-off. The later the mistake is, the faster and easier its rejection was
    • 

    corecore