152 research outputs found

    Acronym-Meaning Extraction from Corpora Using Multi-Tape Weighted Finite-State Machines

    Full text link
    The automatic extraction of acronyms and their meaning from corpora is an important sub-task of text mining. It can be seen as a special case of string alignment, where a text chunk is aligned with an acronym. Alternative alignments have different cost, and ideally the least costly one should give the correct meaning of the acronym. We show how this approach can be implemented by means of a 3-tape weighted finite-state machine (3-WFSM) which reads a text chunk on tape 1 and an acronym on tape 2, and generates all alternative alignments on tape 3. The 3-WFSM can be automatically generated from a simple regular expression. No additional algorithms are required at any stage. Our 3-WFSM has a size of 27 states and 64 transitions, and finds the best analysis of an acronym in a few milliseconds.Comment: 6 pages, LaTe

    Foundation, Implementation and Evaluation of the MorphoSaurus System: Subword Indexing, Lexical Learning and Word Sense Disambiguation for Medical Cross-Language Information Retrieval

    Get PDF
    Im medizinischen Alltag, zu welchem viel Dokumentations- und Recherchearbeit gehört, ist mittlerweile der überwiegende Teil textuell kodierter Information elektronisch verfügbar. Hiermit kommt der Entwicklung leistungsfähiger Methoden zur effizienten Recherche eine vorrangige Bedeutung zu. Bewertet man die Nützlichkeit gängiger Textretrievalsysteme aus dem Blickwinkel der medizinischen Fachsprache, dann mangelt es ihnen an morphologischer Funktionalität (Flexion, Derivation und Komposition), lexikalisch-semantischer Funktionalität und der Fähigkeit zu einer sprachübergreifenden Analyse großer Dokumentenbestände. In der vorliegenden Promotionsschrift werden die theoretischen Grundlagen des MorphoSaurus-Systems (ein Akronym für Morphem-Thesaurus) behandelt. Dessen methodischer Kern stellt ein um Morpheme der medizinischen Fach- und Laiensprache gruppierter Thesaurus dar, dessen Einträge mittels semantischer Relationen sprachübergreifend verknüpft sind. Darauf aufbauend wird ein Verfahren vorgestellt, welches (komplexe) Wörter in Morpheme segmentiert, die durch sprachunabhängige, konzeptklassenartige Symbole ersetzt werden. Die resultierende Repräsentation ist die Basis für das sprachübergreifende, morphemorientierte Textretrieval. Neben der Kerntechnologie wird eine Methode zur automatischen Akquise von Lexikoneinträgen vorgestellt, wodurch bestehende Morphemlexika um weitere Sprachen ergänzt werden. Die Berücksichtigung sprachübergreifender Phänomene führt im Anschluss zu einem neuartigen Verfahren zur Auflösung von semantischen Ambiguitäten. Die Leistungsfähigkeit des morphemorientierten Textretrievals wird im Rahmen umfangreicher, standardisierter Evaluationen empirisch getestet und gängigen Herangehensweisen gegenübergestellt

    Applied and Computational Linguistics

    Get PDF
    Розглядається сучасний стан прикладної та комп’ютерної лінгвістики, проаналізовано лінгвістичні теорії 20-го – початку 21-го століть під кутом розмежування різних аспектів мови з метою формалізованого опису у електронних лінгвістичних ресурсах. Запропоновано критичний огляд таких актуальних проблем прикладної (комп’ютерної) лінгвістики як укладання комп’ютерних лексиконів та електронних текстових корпусів, автоматична обробка природної мови, автоматичний синтез та розпізнавання мовлення, машинний переклад, створення інтелектуальних роботів, здатних сприймати інформацію природною мовою. Для студентів та аспірантів гуманітарного профілю, науково-педагогічних працівників вищих навчальних закладів України

    Formal Linguistic Models and Knowledge Processing. A Structuralist Approach to Rule-Based Ontology Learning and Population

    Get PDF
    2013 - 2014The main aim of this research is to propose a structuralist approach for knowledge processing by means of ontology learning and population, achieved starting from unstructured and structured texts. The method suggested includes distributional semantic approaches and NL formalization theories, in order to develop a framework, which relies upon deep linguistic analysis... [edited by author]XIII n.s

    Towards a corpus of Indian South African English (ISAE) : an investigation of lexical and syntactic features in a spoken corpus of contemporary ISAE

    Get PDF
    There is consensus among scholars that there is not just one English language but a family of “World Englishes”. The umbrella-term “World Englishes” provides a conceptual framework to accommodate the different varieties of English that have evolved as a result of the linguistic cross-fertilization attendant upon colonization, migration, trade and transplantation of the original “strain” or variety. Various theoretical models have emerged in an attempt to understand and classify the extant and emerging varieties of this global language. The hierarchically based model of English, which classifies world English as “First Language”, “Second Language” and “Foreign Language”, has been challenged by more equitably-conceived models which refer to the emerging varieties as New Englishes. The situation in a country such as multi-lingual South Africa is a complex one: there are 11 official languages, one of which is English. However the English used in South Africa (or “South African English”), is not a homogeneous variety, since its speakers include those for whom it is a first language, those for whom it is an additional language and those for whom it is a replacement language. The Indian population in South Africa are amongst the latter group, as theirs is a case where English has ousted the traditional Indian languages and become a de facto first language, which has retained strong community resonances. This study was undertaken using the methodology of corpus linguistics to initiate the creation of a repository of linguistic evidence (or corpus), of Indian South African English, a sub-variety of South African English (Mesthrie 1992b, 1996, 2002). Although small (approximately 60 000 words), and representing a narrow age band of young adults, the resulting corpus of spoken data confirmed the existence of robust features identified in prior research into the sub-variety. These features include the use of ‘y’all’ as a second person plural pronoun, the use of but in a sentence-final position, and ‘lakker’ /'lVk@/ as a pronunciation variant of ‘lekker’ (meaning ‘good’, ‘nice’ or great’). An examination of lexical frequency lists revealed examples of general South African English such as the colloquially pervasive ‘ja’, ‘bladdy’ (for bloody) and jol(ling) (for partying or enjoying oneself) together with neologisms such as ‘eish’, the latter previously associated with speakers of Black South African English. The frequency lists facilitated cross-corpora comparisons with data from the British National Corpus and the Corpus of London Teenage Language and similarities and differences were noted and discussed. The study also used discourse analysis frameworks to investigate the role of high frequency lexical items such as ‘like’ in the data. In recent times ‘like’ has emerged globally as a lexicalized discourse marker, and its appearance in the corpus of Indian South African English confirms this trend. The corpus built as part of this study is intended as the first building block towards a full corpus of Indian South African English which could serve as a standard for referencing research into the sub-variety. Ultimately, it is argued that the establishment of similar corpora of other known sub-varieties of South African English could contribute towards the creation of a truly representative large corpus of South African English and a more nuanced understanding and definition of this important variety of World English

    Semantic Domains in Akkadian Text

    Get PDF
    The article examines the possibilities offered by language technology for analyzing semantic fields in Akkadian. The corpus of data for our research group is the existing electronic corpora, Open richly annotated cuneiform corpus (ORACC). In addition to more traditional Assyriological methods, the article explores two language technological methods: Pointwise mutual information (PMI) and Word2vec.Peer reviewe

    CyberResearch on the Ancient Near East and Eastern Mediterranean

    Get PDF
    CyberResearch on the Ancient Near East and Neighboring Regions provides case studies on archaeology, objects, cuneiform texts, and online publishing, digital archiving, and preservation. Eleven chapters present a rich array of material, spanning the fifth through the first millennium BCE, from Anatolia, the Levant, Mesopotamia, and Iran. Customized cyber- and general glossaries support readers who lack either a technical background or familiarity with the ancient cultures. Edited by Vanessa Bigot Juloux, Amy Rebecca Gansell, and Alessandro Di Ludovico, this volume is dedicated to broadening the understanding and accessibility of digital humanities tools, methodologies, and results to Ancient Near Eastern Studies. Ultimately, this book provides a model for introducing cyber-studies to the mainstream of humanities research
    corecore