78 research outputs found

    The Lexicon Graph Model : a generic model for multimodal lexicon development

    Get PDF
    Trippel T. The Lexicon Graph Model : a generic model for multimodal lexicon development. Bielefeld (Germany): Bielefeld University; 2006.Das Lexicon Graph Model stellt ein Modell für Lexika dar, die korpusbasiert sein können und multimodale Informationen enthalten. Hierbei wird die Perspektive der Lexikontheorie eingenommen, wobei die zugrundeliegenden Datenstrukturen sowohl vom Lexikon als auch von Annotationen betrachtet werden. Letztere fallen dadurch in das Blickfeld, weil sie als Grundlage für die Erstellung von Lexika gesehen werden. Der Begriff des Lexikons bezieht sich hier sowohl auf den Bereich des Wörterbuchs als auch der in elektronischen Applikationen integrierten Lexikondatenbanken. Die existierenden Formalismen und Ansätze der Lexikonentwicklung zeigen verschiedene Probleme im Zusammenhang mit Lexika auf, etwa die Zusammenfassung von existierenden Lexika zu einem, die Disambiguierung von Mehrdeutigkeiten im Lexikon auf verschiedenen lexikalischen Ebenen, die Repräsentation von anderen Modalitäten im Lexikon, die Selektion des lexikalischen Schlüsselbegriffs für Lexikonartikel, etc. Der vorliegende Ansatz geht davon aus, dass sich Lexika zwar in ihrem Inhalt, nicht aber in einer grundlegenden Struktur unterscheiden, so dass verschiedenartige Lexika im Rahmen eines Unifikationsprozesses dublettenfrei miteinander verbunden werden können. Hieraus resultieren deklarative Lexika. Für Lexika können diese Graphen mit dem Lexikongraph-Modell wie hier dargestellt modelliert werden. Dabei sind Lexikongraphen analog den von Bird und Libermann beschriebenen Annotationsgraphen gesehen und können daher auch ähnlich verarbeitet werden. Die Untersuchung des Lexikonformalismus beruht auf vier Schritten. Zunächst werden existierende Lexika analysiert und beschrieben. Danach wird mit dem Lexikongraph-Modell eine generische Darstellung von Lexika vorgestellt, die auch implementiert und getestet wird. Basierend auf diesem Formalismus wird die Beziehung zu Annotationsgraphen hergestellt, wobei auch beschrieben wird, welche Maßstäbe an angemessene Annotationen für die Verwendung zur Lexikonentwicklung angelegt werden müssen.The Lexicon Graph Model provides a model and framework for lexicons that can be corpus based and contain multimodal information. The focus is more from the lexicon theory perspective, looking at the underlying data structures that are part of existing lexicons and corpora. The term lexicon in linguistics and artificial intelligence is used in different ways, including traditional print dictionaries in book form, CD-ROM editions, Web based versions of the same, but also computerized resources of similar structures to be used by applications. These applications cover systems for human-machine communication as well as spell checkers. The term lexicon in this work is used as the most generic term covering all lexical applications. Existing formalisms in lexicon development show different problems with lexicons, for example combining different kinds of lexical resources, disambiguation on different lexical levels, the representation of different modalities in a lexicon. The Lexicon Graph Model presupposes that lexicons can have different structures but have fundamentally a similar structure, making it possible to combine lexicons in a unification process, resulting in a declarative lexicon. The underlying model is a graph, the Lexicon Graph, which is modeled similar to Annotation Graphs as described by Bird and Libermann. The investigation of the lexicon formalism contains four steps, that is the analysis of existing lexicons, the introduction of the Lexicon Graph Model as a generic representation for lexicons, the implementation of the formalism in different contexts and an evaluation of the formalism. It is shown that Annotation Graphs and Lexicon Graphs are indeed related not only in their formalism and it is shown, what standards have to be applied to annotations to be usable for lexicon development

    Information retrieval and text mining technologies for chemistry

    Get PDF
    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European Community’s Horizon 2020 Program (project reference: 654021 - OpenMinted). M.K. additionally acknowledges the Encomienda MINETAD-CNIO as part of the Plan for the Advancement of Language Technology. O.R. and J.O. thank the Foundation for Applied Medical Research (FIMA), University of Navarra (Pamplona, Spain). This work was partially funded by Consellería de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi for useful feedback and discussions during the preparation of the manuscript.info:eu-repo/semantics/publishedVersio

    Open-source resources and standards for Arabic word structure analysis: Fine grained morphological analysis of Arabic text corpora

    Get PDF
    Morphological analyzers are preprocessors for text analysis. Many Text Analytics applications need them to perform their tasks. The aim of this thesis is to develop standards, tools and resources that widen the scope of Arabic word structure analysis - particularly morphological analysis, to process Arabic text corpora of different domains, formats and genres, of both vowelized and non-vowelized text. We want to morphologically tag our Arabic Corpus, but evaluation of existing morphological analyzers has highlighted shortcomings and shown that more research is required. Tag-assignment is significantly more complex for Arabic than for many languages. The morphological analyzer should add the appropriate linguistic information to each part or morpheme of the word (proclitic, prefix, stem, suffix and enclitic); in effect, instead of a tag for a word, we need a subtag for each part. Very fine-grained distinctions may cause problems for automatic morphosyntactic analysis – particularly probabilistic taggers which require training data, if some words can change grammatical tag depending on function and context; on the other hand, finegrained distinctions may actually help to disambiguate other words in the local context. The SALMA – Tagger is a fine grained morphological analyzer which is mainly depends on linguistic information extracted from traditional Arabic grammar books and prior knowledge broad-coverage lexical resources; the SALMA – ABCLexicon. More fine-grained tag sets may be more appropriate for some tasks. The SALMA –Tag Set is a theory standard for encoding, which captures long-established traditional fine-grained morphological features of Arabic, in a notation format intended to be compact yet transparent. The SALMA – Tagger has been used to lemmatize the 176-million words Arabic Internet Corpus. It has been proposed as a language-engineering toolkit for Arabic lexicography and for phonetically annotating the Qur’an by syllable and primary stress information, as well as, fine-grained morphological tagging

    A practical approach to the standardisation and elaboration of Zulu as a technical language

    Get PDF
    The lack of terminology in Zulu can be overcome if it is developed to meet international scientific and technical demands. This lack of terminology can be traced back to the absence of proper language policy implementation with regard to the African languages. Even though Zulu possesses the basic elements that are necessary for its development, such as orthographical standards, dictionaries, grammars and published literature, a number of problems exist within the technical elaboration and standardisation processes: * Inconsistencies in the application of standard rules, in relation to both orthography and terminology. * The lack of standardisation of the (technical) word-formation patterns in Zulu. (Generally the role of culture in elaboration has largely been overlooked). * The avoidance of exploiting written technical text corpora as a resource for terminology. (Text encoding by means of corpus query tools in term extraction has just begun in Zulu and needs to be properly exemplified). * The avoidance of introducing oral technical corpora as a resource for improving the acceptability of technical terminology by, for instance, designing a type of reusable corpus annotation. This study contributes towards solving these problems by offering a practical approach within the context of the real written, standard and oral Zulu language, mainly within the medical terminological domain. This approach offers a reusable methodological foundation with proper language exemplification that can guide terminologists in terminological research, or to some extent even train them, to achieve effective technical elaboration and eventual standardisation. This thesis aims at attaining consistent standardisation on the orthographical level in order to ease the elaboration task of the terminologist. It also aims at standardising the methods of word- (term-) formation linking them to cultural factors, such as taboo. However, this thesis also emphasises the significance of using written and oral technical corpora as terminology resource. This, for instance, is made possible through the application of corpus linguistics, in semi-automatic term extraction from a written technical corpus to aid lemmatisation (listing entries) and in corpus annotation to improve the acceptability of terminology, based on the comparison of standard terms with oral terms.Linguistics and Modern LanguagesD. Litt et Phil. (Linguistics

    Applied and Computational Linguistics

    Get PDF
    Розглядається сучасний стан прикладної та комп’ютерної лінгвістики, проаналізовано лінгвістичні теорії 20-го – початку 21-го століть під кутом розмежування різних аспектів мови з метою формалізованого опису у електронних лінгвістичних ресурсах. Запропоновано критичний огляд таких актуальних проблем прикладної (комп’ютерної) лінгвістики як укладання комп’ютерних лексиконів та електронних текстових корпусів, автоматична обробка природної мови, автоматичний синтез та розпізнавання мовлення, машинний переклад, створення інтелектуальних роботів, здатних сприймати інформацію природною мовою. Для студентів та аспірантів гуманітарного профілю, науково-педагогічних працівників вищих навчальних закладів України

    Philosophy: digital humanities

    Get PDF
    The textbook supplements the lecture material with topical issues of philosophy in the field of social semiotics and digital linguistics. The technological trends in the evolution of convergent structures of digital ecosystems are described. The evolution of social semiotics in conjunction with digital linguistics is analyzed

    Short message service normalization for communication with a health information system

    Get PDF
    Philosophiae Doctor - PhDShort Message Service (SMS) is one of the most popularly used services for communication between mobile phone users. In recent times it has also been proposed as a means for information access. However, there are several challenges to be overcome in order to process an SMS, especially when it is used as a query in an information retrieval system.SMS users often tend deliberately to use compacted and grammatically incorrect writing that makes the message difficult to process with conventional information retrieval systems. To overcome this, a pre-processing step known as normalization is required. In this thesis an investigation of SMS normalization algorithms is carried out. To this end,studies have been conducted into the design of algorithms for translating and normalizing SMS text. Character-based, unsupervised and rule-based techniques are presented. An investigation was also undertaken into the design and development of a system for information access via SMS. A specific system was designed to access information related to a Frequently Asked Questions (FAQ) database in healthcare, using a case study. This study secures SMS communication, especially for healthcare information systems. The proposed technique is to encipher the messages using the secure shell (SSH) protocol
    corecore