197 research outputs found

    Using the Swadesh list for creating a simple common taxonomy

    Get PDF
    PACLIC 20 / Wuhan, China / 1-3 November, 200

    Adapting International Standard for Asian Language Technologies

    Get PDF
    Corpus-based approaches and statistical approaches have been the main stream of natural language processing research for the past two decades. Language resources play a key role in such approaches, but there is an insufficient amount of language resources in many Asian languages. In this situation, standardisation of language resources would be of great help in developing resources in new languages. This paper presents the latest development efforts of our project which aims at creating a common standard for Asian language resources that is compatible with an international standard. In particular, the paper focuses on i) lexical specification and data categories relevant for building multilingual lexical resources for Asian languages; ii) a core upper-layer ontology needed for ensuring multilingual interoperability and iii) the evaluation platform used to test the entire architectural framework

    Bayesian phylolinguistics infers the internal structure and the time-depth of the Turkic language family

    No full text
    Despite more than 200 years of research, the internal structure of the Turkic language family remains subject to debate. Classifications of Turkic so far are based on both classical historical–comparative linguistic and distance-based quantitative approaches. Although these studies yield an internal structure of the Turkic family, they cannot give us an understanding of the statistical robustness of the proposed branches, nor are they capable of reliably inferring absolute divergence dates, without assuming constant rates of change. Here we use computational Bayesian phylogenetic methods to build a phylogeny of the Turkic languages, express the reliability of the proposed branches in terms of probability, and estimate the time-depth of the family within credibility intervals. To this end, we collect a new dataset of 254 basic vocabulary items for thirty-two Turkic language varieties based on the recently introduced Leipzig–Jakarta list. Our application of Bayesian phylogenetic inference on lexical data of the Turkic languages is unprecedented. The resulting phylogenetic tree supports a binary structure for Turkic and replicates most of the conventional sub-branches in the Common Turkic branch. We calculate the robustness of the inferences for subgroups and individual languages whose position in the tree seems to be debatable. We infer the time-depth of the Turkic family at around 2100 years before present, thus providing a reliable quantitative basis for previous estimates based on classical historical linguistics and lexicostatistics

    THE MEANING OF COLOR TERM IN CHINESE AND INDONESIAN IDIOMS: NATURAL SEMANTIC METALANGUAGE APPROACH

    Get PDF
    Every human being has the same vision for colors. On the other hand, idioms –as an embodiment of the expression of human experience, can certainly be a medium for expressing the universal meaning possessed by humans across cultures, including the meaning of colors. Through the meaning of colors term in idioms, the universality of the human mind can be known. Natural Semantic Metalanguage is an approach that tries to see the universality of language. By using six-color term in the Morris Swadesh word list, this study aims to find out the universality of the meaning of colors term in Indonesian and Mandarin. This study shows the meaning of the 'black' colors term in Indonesian and Mandarin idioms is universal. Moreover, the meaning of the colors 'black', 'green' and 'red' is universal in the concept of associating colors with objects and conditions

    Tracking Linguistic Primitives: The Phonosemantic Realization of Fundamental Oppositional Pairs

    Get PDF
    This thesis investigates how cross-linguistic phoneme distributions of 56 fundamental oppositional concepts can reveal semantic relationships by looking into the linguistic forms of 75 genetically and areally distributed languages. Based on proposals of semantic primes (Goddard 2002), reduced Swadesh lists (Holman et al. 2008), presumed ultraconservative words (Pagel et.al. 2013), attested basic antonyms (Paradis, Willners & Jones 2009) and sense perception words, a number of semantic oppositional pairs were selected. Five different types of sound groupings were used dividing phonemes according to; the frequency of vowels' second formant and consonants' energy accumulation (Frequency), sonority (Sonority), a combination of the aformentioned two (Combination), general phonetic traits, e.g. voicing (General), and lastly incorporating all traits of the four presented groupings (All). These were analyzed by means of cluster analyses creating biplots, illustrating the phonological relatedness between the investigated concepts. Also, the phoneme distributions' over- and underrepresentation from the average was calculated defining which sounds represented and were lacking for each concept. Significant semantic groupings and relations based solely on phonological contrasts were found for most investigated concepts, including the semantic domains; Small, Intense Vision-Touch, Large, Organic, Horizontal-Vertical Distance, Deictic, Containment, Gender, Parent and Diurnal, and the sole concept OLD. The most notable relations found were; MOTHER/I vs. FATHER, a three-way deictic distinction between I, indicatory deictic concepts and THERE, and a dimensional tripartite oppositional relationship between Small and (possibly with Intense Vision-Touch), Large-Organic and Horizontal-Vertical Distance. Embodiment, benefits of oppositional thinking and evidence for more general concepts to precede complex concepts were proposed as explanations for the results

    The Lexical Grid: Lexical Resources in Language Infrastructures

    Get PDF
    Language Resources are recognized as a central and strategic for the development of any Human Language Technology system and application product. they play a critical role as horizontal technology and have been recognized in many occasions as a priority also by national and spra-national funding a number of initiatives (such as EAGLES, ISLE, ELRA) to establish some sort of coordination of LR activities, and a number of large LR creation projects, both in the written and in the speech areas

    Quantifying loanwords: A study of borrowability in the Finnish lexicon

    Get PDF
    The current study set out to investigate patterns of loanwords in a sample of 1,460 lexical meanings in the Finnish lexicon by means of quantitative methods. The methodology used was borrowed from the Loanword Typology project (Haspelmath & Tadmor 2009a), and consisted of a template including various fields, where information about each lexical item was coded. The fields included measures such as Borrowed status, Age and Donor language, and the data was collected from etymological dictionaries. The values coded for the lexical meanings were analysed to answer the research questions, which had to do with e.g. loanword patterns in relation to semantic domains, immediate donor languages and loanword age. The loanword patterns found in Finnish were also compared to the cross-linguistic averages found by the Loanword Typology project. It was found that, in general, Finnish is a fairly typical language from a loanword typological point of view. It was also corroborated that the overwhelming majority of loanwords in Finnish come from Indo-European, especially from Germanic languages. Support was also found for correlations between loanword age and donor language branch, in that the loanwords from different language branches layered themselves timewise. Although the findings of this study are largely in line with the previous research on loanwords in Finnish, the most important contribution of this thesis is the restructuring of the previous research into a format which makes it comparable to corresponding data in a relatively large sample of languages cross-linguistically.Käsillä olevan tutkimuksen tavoitteena on tutkia suomen kielen sanastossa esiintyviä lainasanoja. Tutkimus on toteutettu kvantifioimalla 1460 leksikaalisen merkityksen etymologiaa lainaamalla projektissa Loanword Typology project (Haspelmath & Tadmor 2009a) käytettyä metodia, jossa sovelletun mallin mukaan etymologista tietoa jokaisesta lekseemistä on kerätty etymologisista sanakirjoista ja kvantifioitu. Analyysi keskittyy löytämään vastauksia kysymyksiin esimerkiksi lekseemien lainautumistilasta, iästä, sekä lainanantajakielestä ja -kieliperheestä. Kerättyä aineistoa analysoimalla tämä tutkimus pyrkii vastaamaan tutkimuskysymyksiin, joiden aiheena on muun muassa tutkia lainasanojen suhteita esimerkiksi semanttisiin luokkiin, lainanantajakieliin sekä lainasanojen ikään. Lainasanatutkimuksen tuloksia verrataan myös vastaaviin, kielirajat ylittäviin tuloksiin, jotka löytyivät edellisessä tutkimuksessa Loanword Typology project. Tulokset osoittavat suomen kielen lainasanojen seuranneen pääsääntöisesti typologisesta perspektiivistä melko tyypillisiä taipumuksia. Tulokset vahvistavat myös valtaosan suomen kielen lainasanoista olevan indoeurooppalaisperäisiä, joista puolestaan valtaosa on germaanisperäisiä lainoja. Tutkimustulokset vahvistavat myös lainanantajakieliryhmien välistä korrelaatiota siten, että indoeurooppalaisista kielihaaroista peräisin olevat lainasanat ryhmittyvät selkeästi toisistaan erottuviin ikäkerrostumiin. Vaikka tutkimuksen tulokset ovatkin pääasiassa odotuksenmukaisia edellisen tutkimuksen valossa, tämän tutkimuksen tärkein myötävaikutus onkin edellisen etymologisen tutkimuksen uudelleenjärjestely sellaiseen muotoon, että tuloksia voi helposti verrata muiden kielten osalta tehtyjen, samankaltaisten tutkimusten tuloksiin

    Going to the Root

    Get PDF
    This paper presents an attempt to reconstruct the most basic features of the language of Homo Sapiens, following the principle of monogenesis, namely the viewpoint that since humans share a common biological ancestry, they also share a common linguistic one. Considering this issue, the basic methods of comparative linguistics are briefly presented first, along with the methodological approach utilized herein, named Qualitative Inquiry. The results of the reconstructing process are presented, classified in terms of phonological, morphological, lexical, grammatical and syntactic aspects. Only bordering to the scope of this paper, a brief comparison of this treatise to previous studies reveals both convergence and discrepancy concerning the features of the language

    A theoretical approach to automatic loanword detection

    Get PDF
    For several years, computational methods found their way into humanities. Especially in the field of computational linguistics several analysis andmethods are studied. It is not surprising that computational analysis arouse interest in the field of historical linguistics. Due to such methods, language evolution can be studied from another point of view. Biological and linguistic evolution show certain parallels. Especially the parallels between phylogenetics and linguistics arouse the interest of combining both fields. Phylogenetics provide a great number of mathematical and computational methods for computing di erent tasks. Based on the parallels, the methods can be adapted into historical linguistics. In historical linguistics, the process of borrowing is a well-known evolutionary process where words are borrowed from one language and adapted into another. Borrowing has its corresponding parallel within phylogenetics, namely horizontal gene transfer. Horizontal gene transfer is the process of transferring genes from one organism to another. The similarity between borrowing and horizontal gene transfer is the transfer of genes or words whereas the organisms or languages are not related. Phylogenetics provides several computational methods and analysis to detect horizontal gene transfer. The methods might be adapted into linguistics to detect borrowing. This paper introduces the background of borrowing and phylogenetics as well as the combination of both fields. The new tree-based approach should indicate if provided methods of phylogenetics can be adapted into linguistics for the detection of borrowing.Vor einigen Jahren haben automatische Methoden und Computeranalysen ihren Weg in die Geisteswissenschaften gefunden. Vor allem die Computerlinguistik untersucht und entwickelt neue Methoden. Es ist daher nicht überraschend, dass das Interesse an unterschiedlichen Computeranalysen im Bereich der historischen Linguistik an Interesse gewonnen hat. Neue Ansätze haben die Sicht auf die Untersuchungsmethoden innerhalb der Sprachevolution verändert. Biologische Evolution und Sprachevolution weisen verschiedene Gemeinsamkeiten auf. Die Ähnlichkeiten zwischen Phylogenetik und Linguistik haben zu einer Kombination dieser Bereiche geführt. Die Phylogenetik stellt eine große Anzahl von mathematischen und auch implementierten Methoden zur Verfügung, um unterschiedliche Prozesse zu analysieren. Einige dieser Methoden können auf Grund der Gemeinsamkeiten dieser Bereiche in die historische Linguistik übernommen werden. In der historischen Linguistik ist die Entlehnung ein bekannter evolutionärer Prozess, bei welchem Wörter der einen Sprache in eine andere entlehnt werden. Der Prozess der Entlehnung weist große Ähnlichkeiten mit dem aus der Phylogenetik bekannten Prozess des Horizontalem Gentransfers auf. Horizontaler Gentransfer beschreibt die Übertragung von Genen von einem Organismus in einen anderen. Die Gemeinsamkeit von Entlehnung und Horizontalem Gentransfer ist die Übertragung von Genen oder Wörtern, wobei der Organismus oder die Sprache nicht verwandt sein müssen. Die Phylogenetik stellt mehrere mathematische Methoden und Analysen zur Verfügung, um Horizontalen Gentransfer zu erkennen. Diese könnten in die Linguistik übernommen werden. In dieser Arbeit werden die Hintergründe von Entlehnung und die Grundlagen der Phylogenetik erklärt. Des Weiteren wird die Kombination der beiden Bereiche erläutert. Der neue baumbasierte Ansatz soll zeigen, ob die Methoden aus der Phylogenetik in die Linguistik aufgenommen werden können und ob diese Entlehnungen erkennen können

    The building blocks of sound symbolism

    Get PDF
    Languages contain thousands of words each and are made up by a seemingly endless collection of sound combinations. Yet a subsection of these show clear signs of corresponding word shapes for the same meanings which is generally known as vocal iconicity and sound symbolism. This dissertation explores the boundaries of sound symbolism in the lexicon from typological, functional and evolutionary perspectives in an attempt to provide a deeper understanding of the role sound symbolism plays in human language. In order to achieve this, the subject in question was triangulated by investigating different methodologies which included lexical data from a large number of language families, experiment participants and robust statistical tests.Study I investigates basic vocabulary items in a large number of language families in order to establish the extent of sound symbolic items in the core of the lexicon, as well as how the sound-meaning associations are mapped and interconnected. This study shows that by expanding the lexical dataset compared to previous studies and completely controlling for genetic bias, a larger number of sound-meaning associations can be established. In addition, by placing focus on the phonetic and semantic features of sounds and meanings, two new types of sounds symbolism could be established, along with 20 semantically and phonetically superordinate concepts which could be linked to the semantic development of the lexicon.Study II explores how sound symbolic associations emerge in arbitrary words through sequential transmission over language users. This study demonstrates that transmission of signals is sufficient for iconic effects to emerge and does not require interactional communication. Furthermore, it also shows that more semantically marked meanings produce stronger effects and that iconicity in the size and shape domains seems to be dictated by similarities between the internal semantic relationships of each oppositional word pair and its respective associated sounds.Studies III and IV use color words to investigate differences and similarities between low-level cross-modal associations and sound symbolism in lexemes. Study III explores the driving factors of cross-modal associations between colors and sounds by experimentally testing implicit preferences between several different acoustic and visual parameters. The most crucial finding was that neither specific hues nor specific vowels produced any notable effects and it is therefore possible that previously reported associations between vowels and colors are actually dependent on underlying visual and acoustic parameters.Study IV investigates sound symbolic associations in words for colors in a large number of language families by correlating acoustically described segments with luminance and saturation values obtained from cross-linguistic color-naming data. In accordance with Study III, this study showed that luminance produced the strongest results and was primarily associated with vowels, while saturation was primarily associated with consonants. This could then be linked to cross-linguistic lexicalization order of color words.To summarize, this dissertation shows the importance of studying the underlying parameters of sound symbolism semantically and phonetically in both language users and cross-linguistic language data. In addition, it also shows the applicability of non-arbitrary sound-meaning associations for gaining a deeper understanding of how linguistic categories have developed evolutionarily and historically
    corecore