47 research outputs found

    Every method counts : combining corpus-based and experimental evidence in the study of synonymy

    Get PDF
    In this study we explore the concurrent, combined use of three research methods, statistical corpus analysis and two psycholinguistic experiments (a forced-choice and an acceptability rating task), using verbal synonymy in Finnish as a case in point. In addition to supporting conclusions from earlier studies concerning the relationships between corpus-based and ex- perimental data (e. g., Featherston 2005), we show that each method adds to our understanding of the studied phenomenon, in a way which could not be achieved through any single method by itself. Most importantly, whereas relative rareness in a corpus is associated with dispreference in selection, such infrequency does not categorically always entail substantially lower acceptability. Furthermore, we show that forced-choice and acceptability rating tasks pertain to distinct linguistic processes, with category-wise in- commensurable scales of measurement, and should therefore be merged with caution, if at all.Peer reviewe

    Machine meets man. Evaluating the psychological reality of corpus-based probabilistic models.

    Get PDF
    Linguistic convention allows speakers various options. Evidence is accumulating that the various options are preferred in different contexts yet the criteria governing the selection of the appropriate form are often far from obvious. Most researchers who attempt to discover the factors determining a preference rely on the linguistic analysis and statistical modeling of data extracted from large corpora. In this paper, we address the question of how to evaluate such models and explicitly compare the performance of a statistical model derived from a corpus with that of native speakers in selecting one of six Russian TRY verbs. Building on earlier work by Divjak (2003, 2004, 2010) and Divjak & Arppe (2013), we trained a polytomous logistic regression model to predict verb choice given the context. We compare the predictions the model makes for 60 unseen sentences to the choices adult native speakers make in those same sentences.1 We then look in more detail at the interplay of the contextual properties and model computationally how individual differences in assessing the importance of contextual properties may impact the linguistic knowledge of native speakers. Finally, we compare the probability the model assigns to encountering each of the 6 verbs in the 60 test sentences to the acceptability ratings the adult native speakers give to those sentences. We discuss the implications of our findings for both usage-based theory and empirical linguistic methodology

    Nordic co-operation in building the language resource infrastructures

    Get PDF
    Proceedings of the NODALIDA 2009 workshop Nordic Perspectives on the CLARIN Infrastructure of Language Resources. Editors: Rickard Domeij, Kimmo Koskenniemi, Steven Krauwer, Bente Maegaard, Eiríkur Rögnvaldsson and Koenraad de Smedt. NEALT Proceedings Series, Vol. 5 (2009), 12-15. © 2009 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/9207

    Monta tapaa ajatella: Tilastollisten menetelmien hyödyntäminen aineistolähtöisessä sanastontutkimuksessa

    Get PDF
    Väitöksenalkajaisesitelmä Helsingin yliopistossa 19. joulukuuta 200

    Katsaus: EI YHTÄ AINOAA POLKUA - SUOMALAISIA KOKEMUKSIA MATKALLA KIELITEKNOLOGISESTA TUTKIMUKSESTA LIIKETOIMINTAAN

    Get PDF
    Tässä katsauksessa tarkastellaan kuinkalingvististä ja kieliteknologista tutkimusta onkaupallistettu Suomessa. Katsauksessahavaitaan, että akateemisella tutkimuksellaon ollut merkittävä taustarooli kieliteknologisenliiketoiminnan syntymisessä Suomessa,vaikkakin yrityksiä on syntynyt myösmuiden taustojen kautta. Katsauksessakäydään läpi aikajärjestyksessä kielitieteenja kieliteknologian alan suomalaisten tutkijoidenja tutkijaryhmien tutkimustuloksiasekä -projekteja, jotka ovat johtaneetkaupallistettuihin tuotteisiin tai yritystenperustamiseen. Tämän jälkeen tarkastellaan,minkälaisia strategioita nämä akateemisentaustan omaavat kieliteknologiayritykset ovatvalinneet, minkälaisia eri vaiheita ne ovatkokeneet ja kuinka ne ovat ylipäänsä onnistuneetkehityksessään ja kasvussaan.Katsauksen havainto on, että nämä yrityksetovat menestyneet varsin vaihtelevasti javaihtelevin konseptein, eikä yhtään kansainvälistäläpimurtoa kansallisen tason menestystarinoistahuolimatta ole vielä nähty.Jälkikäteen arvioituna vaikuttaa siltä, että neyritykset jotka ovat nähneet itsensä ensisijaisestiohjelmistoyrityksinä ovat pidemmällätähtäimellä menestyneet parhaiten.Lopuksi tarkastellaan yritysten nykytilannettaja arvioidaan minkälaisia tulevaisuudenmahdollisuuksia ja haasteita niillä on. This article provides an overview of how research in linguistics and language technology hasbeen commercialized in Finland. It is observed that academic research has played an importantrole in the birth of the language technology business in Finland, though companieshave also come to exist in the field with other types of background. This article goes throughin chronological order the research results and projeets of Finnish language technologyresearchers and reseach groups which have lead to commercialized products or the foundingoflanguage technology companies. After this, the strategies which these companies withacademic backgrounds have chosen are presented, followed by an overview of the subsequentdevelopment and growth of these companies and an assessment of how they havesucceeded. A general observation is that the development of these companies has variedsubstantially, as have the strategies that the companies have pursued. Despite successes onthe domestic level, a major international breakthrough for a Finnish language technologycompany is still in waiting. In retrospect, it seems that those that have managed to interpretthemselves first1y as commercial IT companies have succeeded best. Finally, the presentsituation of these companies is evaluated as well as their future possibilities and challenges

    Lärdomar från utveckling av inflekterande synonymordböcker

    Get PDF
    During 1996-1998, product development projects were carried out at Lingsoft in order to create so-called inflecting thesauri, in which the electronic form of the contents of synonym dictionaries for Swedish, Danish, and Norwegian Bokmål were integrated with computerized morphological models, created according to the two-level mode!, for the respective languages. The resultant computer programs provided practical insights into the interaction between structures of semantic relations, in this case representing synonymity, and with inflectional morphology. As a result, it became evident that the principle of lexical generality cannot be trusted blindly in generating across the board the inflected forms of the base-form components of a synonym dictionary. Furthermore, it would seem that synonymity between words cannot be categorically expected to extend throughout the entire inflectional paradigm of these words. This would suggest that inflected forms of words should also be considered when constructing structures of semantic relations

    Low hanging fruit and the Boasian trilogy in digital lexicography of morphologically rich languages: Lessons from a survey of Indigenous language resources in Canada

    Get PDF
    Online lexicographical resources for the morphologically rich Indigenous languages in Canada use a wide range of strategies for conveying their language’s morphological system, i.e. how words are inflected and derived, which this paper illustrates in a survey of seventeen bilingual online resources. The strategies these resources employ boil down to two basic approaches to the underlying structure of the resource: 1) a lexical database, or 2) a computational model. Most resources we surveyed are constructed around lexical databases. These assume the word(form) as the basic unit, an assumption that makes it difficult to incorporate the language’s sub-word, morphological structure in full detail. However, one resource uses a computational morphological model to bring the language’s morphology into the core of the lexicon – this proved to be a “low-hanging fruit” in the application of language technology that had been accomplished within a reasonable time-frame, as has been advocated by Trond Trosterud. We discuss the value created and questions raised by this approach and argue that it successfully overcomes the traditional Boasian three-way partition of dictionary, grammar, and text, creating integrated language resources that meet the modern needs of low-resource endangered languages and their communities

    Some theoretical and experimental observerations on naive discriminative learning

    Get PDF
    Natural language use is full of choices among multiple possible alternatives, whether phones, words, or constructions, which are influenced by a large number of contextual factors, and which rather exhibit asymptotic, imperfect tendencies favoring one or more of the alternatives, instead of single, categorical, perfect choices. This contrasts with item-by-item learning in simple controlled experiments which typically have been modelled by the Rescorla-Wagner equations. We find the former "messy" types of problems as a key area of interest in modeling and understanding language use, and consequently consider the application of the Rescorla-Wagner equations in the form of a Naive Discriminative Learning classifier to such complex phenomena of considerable utility in linguistic research.There is an updated version of this paper ("http://nbn-resolving.de/urn:nbn:de:bsz:21-dspace-677573"). Please use the updated version for further reference
    corecore