11 research outputs found

    The Effect of Iconicity Flash Blindness—An Empirical Study

    Get PDF
    In our experiment, the Saussurean postulate of arbitrariness has been empirically tested in order to see whether this postulate can be applied to all words to the same extent. Three hundred participants were asked to match Czech words with their Hindi translations. One set of words was randomly chosen from a Hindi corpus (set A); the second set consisted of both randomly chosen words and words categorized as ideophones (set B). The participants were successful in matching both sets (the lower level of the confidence interval is about 7% above random guessing), and their performance showed unexpected patterns: For one, not only iconic properties (the sound qualities) but also iconicity itself is an important distinctive feature and recipients are able to exploit this. Moreover, even words considered to be non-iconic (set A) apparently contain a degree of iconicity, which participants are able to draw upon. However, participants appear to lose this ability when non-iconic words are presented in the context of words with evident and abundant iconicity (set B). The effect resembles the accommodation process which is known for other senses; therefore, we call the effect “Iconicity flash blindness”

    Ideofony v hindštině

    Get PDF
    Tato diplomová práce si klade za cíl zmapovat ideofony v hindštině. Ideofony jsou "příznaková slova, která zpodobňují smyslové vjemy" (Dingemanse 2011:25). Tato diplomová práce obsahuje 4 hlavní kapitoly. (A) V první části se ideofony definují a nabízí se nová perspektiva na tento jazykový jev. Zároveň se diskutuje jejich vztah s citoslovci. (B) Ve druhé části se uvádí některé společné rysy ideofonů v hindštině, které je oddělují od zbytku slovní zásoby. (C) Třetí kapitola představuje výsledky terénního výzkumu, který se zaměřuje na to, jestli mluvčí hindštiny tato slova skutečně aktivně používají nebo ne. (D) Poslední část se zaměřuje na téma pro ideofony nejzásadnější - na jejich sémantiku. Pro analýzu se využívají nástroje Frame Semantics (rámcové sémantiky). V této kapitole je navržen nový rámec pro zachycení ideofonických významů. Přílohou této magisterské práce je seznam sesbíraných ideofonů, který je první svého druhu.This thesis explores ideophones in Hindi. Ideophones are "marked words that depict sensory imagery" (Dingemnase 2011:25). It focuses on four main topics represented by four main sections. (A) It defines ideophone and offers some new perspective on this linguistic phenomenon. (B) It lists some common features of ideophones in Hindi which set them apart from the rest of the vocabulary. (C) This thesis describes first field research of ideophones. Its main goal was to find out whether speakers of Hindi actively use it or not. (D) Last part of this thesis focuses on the most interesting topic connected to ideophones - on their semantics. It is analyzed from the point of view of the Frame Semantics and the new Vivid sensation frame is suggested to capture ideophonic meanings. Important part of this thesis is ideophone list which is first of its kind.Institute of LinguisticsÚstav obecné lingvistikyFilozofická fakultaFaculty of Art

    HindEnCorp – Hindi-English and Hindi-only Corpus for Machine Translation

    Get PDF
    We present HindEnCorp, a parallel corpus of Hindi and English, and HindMonoCorp, a monolingual corpus of Hindi in their release version 0.5. Both corpora were collected from web sources and preprocessed primarily for the training of statistical machine translation systems. HindEnCorp consists of 274k parallel sentences (3.9 million Hindi and 3.8 million English tokens). HindMonoCorp amounts to 787 million tokens in 44 million sentences. Both the corpora are freely available for non-commercial research and their preliminary release has been used by numerous participants of the WMT 2014 shared translation task

    Repetitive formations in Hindi

    Get PDF
    The main goal of this bachelor thesis is systematization of classification of morphological operation known as repetition. Even elaborated studies on repetition either deal only with small part of this big topic or suffer from fragmentariness. After distinguishing repetition from reduplication we formulate definition for recognizing each case of examined phenomenon. Then we pursue the need to introduce theory of prototypes into thoughts about repetition that helps us to solve some problems concerning classification of this morphological operation. The greatest part of this thesis consists of classification of repetition itself. In introductory part of each chapter we briefly analyze the way in which other linguists treat the concrete type of repetition. Important two parts of each chapter are structural aspect and semantic aspect. By first we mean formal changes which word undergoes when reiterated. By the latter we mean semantic shifts. Both aspects are treated in detailed manner. Each type of repetition we exemplify by words taken from grammars and words taken from our corpus. Another important topic is onomatopoeic reduplication. We try to demonstrate why it is not type of reduplication and at the same time we propose how to account for this phenomenon. These words are best to describe when we..

    Hindi Ideophones

    No full text
    This thesis explores ideophones in Hindi. Ideophones are "marked words that depict sensory imagery" (Dingemnase 2011:25). It focuses on four main topics represented by four main sections. (A) It defines ideophone and offers some new perspective on this linguistic phenomenon. (B) It lists some common features of ideophones in Hindi which set them apart from the rest of the vocabulary. (C) This thesis describes first field research of ideophones. Its main goal was to find out whether speakers of Hindi actively use it or not. (D) Last part of this thesis focuses on the most interesting topic connected to ideophones - on their semantics. It is analyzed from the point of view of the Frame Semantics and the new Vivid sensation frame is suggested to capture ideophonic meanings. Important part of this thesis is ideophone list which is first of its kind

    Hindi Ideophones

    No full text
    This thesis explores ideophones in Hindi. Ideophones are "marked words that depict sensory imagery" (Dingemnase 2011:25). It focuses on four main topics represented by four main sections. (A) It defines ideophone and offers some new perspective on this linguistic phenomenon. (B) It lists some common features of ideophones in Hindi which set them apart from the rest of the vocabulary. (C) This thesis describes first field research of ideophones. Its main goal was to find out whether speakers of Hindi actively use it or not. (D) Last part of this thesis focuses on the most interesting topic connected to ideophones - on their semantics. It is analyzed from the point of view of the Frame Semantics and the new Vivid sensation frame is suggested to capture ideophonic meanings. Important part of this thesis is ideophone list which is first of its kind

    Kategoriální normy češtiny pro 12 kategorií / Categorial Norms for 12 Czech Categories

    No full text
    In this paper, we offer an overview of available category norms and of methodology of their creation. In the second part of the paper, category norms for 12 categories in Czech are presented (i.e. an alcoholic beverage, a colour, a crime, a four-legged animal, a fruit, a metal, a part of the human body, a relative, a sport, a type of vehicle, a toy, a weapon). These norms are then analysed in relation with linguistic frequency and token length. The problems of correlating linguistic frequency which is based on corpus data with associative frequency which is based on category norms are discussed. Preliminarily, it seems that the members of more constrained categories are in a closer relation to each other and activate each other more strongly than members of more open categories. This can be explained based on the principles of the Spreading Activation Theory of Semantic Processing (Collins — Loftus, 1975)

    HindEnCorp 0.5

    No full text
    HindEnCorp parallel texts (sentence-aligned) come from the following sources: Tides, which contains 50K sentence pairs taken mainly from news articles. This dataset was originally col- lected for the DARPA-TIDES surprise-language con- test in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008 (Venkatapathy, 2008). Commentaries by Daniel Pipes contain 322 articles in English written by a journalist Daniel Pipes and translated into Hindi. EMILLE. This corpus (Baker et al., 2002) consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual sub- corpora, including both written and (for some lan- guages) spoken data for fourteen South Asian lan- guages. The EMILLE monolingual corpora contain in total 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations into Hindi and other languages. Smaller datasets as collected by Bojar et al. (2010) include the corpus used at ACL 2005 (a subcorpus of EMILLE), a corpus of named entities from Wikipedia (crawled in 2009), and Agriculture domain parallel corpus.  For the current release, we are extending the parallel corpus using these sources: Intercorp (Čermák and Rosen,2012) is a large multilingual parallel corpus of 32 languages including Hindi. The central language used for alignment is Czech. Intercorp’s core texts amount to 202 million words. These core texts are most suitable for us because their sentence alignment is manually checked and therefore very reliable. They cover predominately short sto- ries and novels. There are seven Hindi texts in Inter- corp. Unfortunately, only for three of them the English translation is available; the other four are aligned only with Czech texts. The Hindi subcorpus of Intercorp contains 118,000 words in Hindi. TED talks 3 held in various languages, primarily English, are equipped with transcripts and these are translated into 102 languages. There are 179 talks for which Hindi translation is available. The Indic multi-parallel corpus (Birch et al., 2011; Post et al., 2012) is a corpus of texts from Wikipedia translated from the respective Indian language into English by non-expert translators hired over Mechanical Turk. The quality is thus somewhat mixed in many respects starting from typesetting and punctuation over capi- talization, spelling, word choice to sentence structure. A little bit of control could be in principle obtained from the fact that every input sentence was translated 4 times. We used the 2012 release of the corpus. Launchpad.net is a software collaboration platform that hosts many open-source projects and facilitates also collaborative localization of the tools. We downloaded all revisions of all the hosted projects and extracted the localization (.po) files. Other smaller datasets. This time, we added Wikipedia entities as crawled in 2013 (including any morphological variants of the named entitity that appears on the Hindi variant of the Wikipedia page) and words, word examples and quotes from the Shabdkosh online dictionary

    HindMonoCorp 0.5

    No full text
    Hindi monolingual corpus. It is based primarily on web crawls performed using various tools and at various times. Since the web is a living data source, we treat these crawls as completely separate sources, despite they may overlap. To estimate the magnitude of this overlap, we compared the total number of segments if we concatenate the individual sources (each source being deduplicated on its own) with the number of segments if we de-duplicate all sources to- gether. The difference is just around 1%, confirming, that various web crawls (or their subsequent processings) differ significantly. HindMonoCorp contains data from: Hindi web texts, a monolingual corpus containing mainly Hindi news articles has already been collected and released by Bojar et al. (2008). We use the HTML files as crawled for this corpus in 2010 and we add a small crawl performed in 2013 and re-process them with the current pipeline. These sources are denoted HWT 2010 and HWT 2013 in the following. Hindi corpora in W2C have been collected by Martin Majliš during his project to automatically collect corpora in many languages (Majliš and Žabokrtský, 2012). There are in fact two corpora of Hindi available—one from web harvest (W2C Web) and one from the Wikipedia (W2C Wiki). SpiderLing is a web crawl carried out during November and December 2013 using SpiderLing (Suchomel and Pomikálek, 2012). The pipeline includes extraction of plain texts and deduplication at the level of documents, see below. CommonCrawl is a non-profit organization that regu- larly crawls the web and provides anyone with the data. We are grateful to Christian Buck for extracting plain text Hindi segments from the 2012 and 2013-fall crawls for us. Intercorp – 7 books with their translations scanned and manually alligned per paragraph RSS Feeds from Webdunia.com and the Hindi version of BBC International followed by our custom crawler from September 2013 till January 2014
    corecore