390 research outputs found

    Identification of Fertile Translations in Medical Comparable Corpora: a Morpho-Compositional Approach

    Get PDF
    This paper defines a method for lexicon in the biomedical domain from comparable corpora. The method is based on compositional translation and exploits morpheme-level translation equivalences. It can generate translations for a large variety of morphologically constructed words and can also generate 'fertile' translations. We show that fertile translations increase the overall quality of the extracted lexicon for English to French translation

    Are Automatic Methods for Cognate Detection Good Enough for Phylogenetic Reconstruction in Historical Linguistics?

    Get PDF
    We evaluate the performance of state-of-the-art algorithms for automatic cognate detection by comparing how useful automatically inferred cognates are for the task of phylogenetic inference compared to classical manually annotated cognate sets. Our findings suggest that phylogenies inferred from automated cog- nate sets come close to phylogenies inferred from expert-annotated ones, although on average, the latter are still superior. We con- clude that future work on phylogenetic reconstruction can profit much from automatic cognate detection. Especially where scholars are merely interested in exploring the bigger picture of a language family’s phylogeny, algorithms for automatic cognate detection are a useful complement for current research on language phylogenies

    Foundation, Implementation and Evaluation of the MorphoSaurus System: Subword Indexing, Lexical Learning and Word Sense Disambiguation for Medical Cross-Language Information Retrieval

    Get PDF
    Im medizinischen Alltag, zu welchem viel Dokumentations- und Recherchearbeit gehört, ist mittlerweile der überwiegende Teil textuell kodierter Information elektronisch verfügbar. Hiermit kommt der Entwicklung leistungsfähiger Methoden zur effizienten Recherche eine vorrangige Bedeutung zu. Bewertet man die Nützlichkeit gängiger Textretrievalsysteme aus dem Blickwinkel der medizinischen Fachsprache, dann mangelt es ihnen an morphologischer Funktionalität (Flexion, Derivation und Komposition), lexikalisch-semantischer Funktionalität und der Fähigkeit zu einer sprachübergreifenden Analyse großer Dokumentenbestände. In der vorliegenden Promotionsschrift werden die theoretischen Grundlagen des MorphoSaurus-Systems (ein Akronym für Morphem-Thesaurus) behandelt. Dessen methodischer Kern stellt ein um Morpheme der medizinischen Fach- und Laiensprache gruppierter Thesaurus dar, dessen Einträge mittels semantischer Relationen sprachübergreifend verknüpft sind. Darauf aufbauend wird ein Verfahren vorgestellt, welches (komplexe) Wörter in Morpheme segmentiert, die durch sprachunabhängige, konzeptklassenartige Symbole ersetzt werden. Die resultierende Repräsentation ist die Basis für das sprachübergreifende, morphemorientierte Textretrieval. Neben der Kerntechnologie wird eine Methode zur automatischen Akquise von Lexikoneinträgen vorgestellt, wodurch bestehende Morphemlexika um weitere Sprachen ergänzt werden. Die Berücksichtigung sprachübergreifender Phänomene führt im Anschluss zu einem neuartigen Verfahren zur Auflösung von semantischen Ambiguitäten. Die Leistungsfähigkeit des morphemorientierten Textretrievals wird im Rahmen umfangreicher, standardisierter Evaluationen empirisch getestet und gängigen Herangehensweisen gegenübergestellt

    Computational approaches to semantic change (Volume 6)

    Get PDF
    Semantic change — how the meanings of words change over time — has preoccupied scholars since well before modern linguistics emerged in the late 19th and early 20th century, ushering in a new methodological turn in the study of language change. Compared to changes in sound and grammar, semantic change is the least understood. Ever since, the study of semantic change has progressed steadily, accumulating a vast store of knowledge for over a century, encompassing many languages and language families. Historical linguists also early on realized the potential of computers as research tools, with papers at the very first international conferences in computational linguistics in the 1960s. Such computational studies still tended to be small-scale, method-oriented, and qualitative. However, recent years have witnessed a sea-change in this regard. Big-data empirical quantitative investigations are now coming to the forefront, enabled by enormous advances in storage capability and processing power. Diachronic corpora have grown beyond imagination, defying exploration by traditional manual qualitative methods, and language technology has become increasingly data-driven and semantics-oriented. These developments present a golden opportunity for the empirical study of semantic change over both long and short time spans

    Computational Approaches to Historical Language Comparison

    Get PDF
    The chapter discusses recently developed computational techniques providing concrete help in addressing various tasks in historical language comparison, focusing specifically on those tasks which are typically subsumed under the framework of the comparative method. These include the proof of relationship, cognate and correspondence detection, phonological reconstruction and sound law induction, and the reconstruction of evolutionary scenarios

    The potential of automatic word comparison for historical linguistics

    Get PDF
    The amount of data from languages spoken all over the world is rapidly increasing. Traditional manual methods in historical linguistics need to face the challenges brought by this influx of data. Automatic approaches to word comparison could provide invaluable help to pre-analyze data which can be later enhanced by experts. In this way, computational approaches can take care of the repetitive and schematic tasks leaving experts to concentrate on answering interesting questions. Here we test the potential of automatic methods to detect etymologically related words (cognates) in cross-linguistic data. Using a newly compiled database of expert cognate judgments across five different language families, we compare how well different automatic approaches distinguish related from unrelated words. Our results show that automatic methods can identify cognates with a very high degree of accuracy, reaching 89% for the best-performing method Infomap. We identify the specific strengths and weaknesses of these different methods and point to major challenges for future approaches. Current automatic approaches for cognate detection-although not perfect -could become an important component of future research in historical linguistics.As part of the GlottoBank Project, this work was supported by the Max Planck Institute for the Science of Human History and the Royal Society of New Zealand Marsden Fund grant 13¬UOA-121. This paper was further supported by the DFG research fellowship grant 261553824 “Vertical and lateral aspects of Chinese dialect history”(JML), and the Australian Research Council’s Discovery Projects funding scheme (project number DE120101954, SJG)
    corecore