77 research outputs found

    Uvid u automatsko izlučivanje metaforičkih kolokacija

    Get PDF
    Collocations have been the subject of much scientific research over the years. The focus of this research is on a subset of collocations, namely metaphorical collocations. In metaphorical collocations, a semantic shift has taken place in one of the components, i.e., one of the components takes on a transferred meaning. The main goal of this paper is to review the existing literature and provide a systematic overview of the existing research on collocation extraction, as well as the overview of existing methods, measures, and resources. The existing research is classified according to the approach (statistical, hybrid, and distributional semantics) and presented in three separate sections. The insights gained from existing research serve as a first step in exploring the possibility of developing a method for automatic extraction of metaphorical collocations. The methods, tools, and resources that may prove useful for future work are highlighted.Kolokacije su već dugi niz godina tema mnogih znanstvenih istraživanja. U fokusu ovoga istraživanja podskupina je kolokacija koju čine metaforičke kolokacije. Kod metaforičkih je kolokacija kod jedne od sastavnica došlo do semantičkoga pomaka, tj. jedna od sastavnica poprima preneseno značenje. Glavni su ciljevi ovoga rada istražiti postojeću literaturu te dati sustavan pregled postojećih istraživanja na temu izlučivanja kolokacija i postojećih metoda, mjera i resursa. Postojeća istraživanja opisana su i klasificirana prema različitim pristupima (statistički, hibridni i zasnovani na distribucijskoj semantici). Također su opisane različite asocijativne mjere i postojeći načini procjene rezultata automatskoga izlučivanja kolokacija. Metode, alati i resursi koji su korišteni u prethodnim istraživanjima, a mogli bi biti korisni za naš budući rad posebno su istaknuti. Stečeni uvidi u postojeća istraživanja čine prvi korak u razmatranju mogućnosti razvijanja postupka za automatsko izlučivanje metaforičkih kolokacija

    Design of a Controlled Language for Critical Infrastructures Protection

    Get PDF
    We describe a project for the construction of controlled language for critical infrastructures protection (CIP). This project originates from the need to coordinate and categorize the communications on CIP at the European level. These communications can be physically represented by official documents, reports on incidents, informal communications and plain e-mail. We explore the application of traditional library science tools for the construction of controlled languages in order to achieve our goal. Our starting point is an analogous work done during the sixties in the field of nuclear science known as the Euratom Thesaurus.JRC.G.6-Security technology assessmen

    Multidisciplinary analysis of the phenomenon of phraseological variation in translation and interpreting

    Get PDF
    Número especial 6 (2020). Análisis multidisciplinar del fenómeno de la variación en traducción e interpretación / Multidisciplinary analysis of the phenomenon of phraseological variation in translation and interpreting. Pedro Mogorrón Huerta (Ed.)Special Issue 6 (2020). Multidisciplinary analysis of the phenomenon of phraseological variation in translation and interpreting /Análisis multidisciplinar del fenómeno de la variación en traducción e interpretación. Pedro Mogorrón Huerta (Ed.

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

    Biomedical Term Extraction: NLP Techniques in Computational Medicine

    Get PDF
    Artificial Intelligence (AI) and its branch Natural Language Processing (NLP) in particular are main contributors to recent advances in classifying documentation and extracting information from assorted fields, Medicine being one that has gathered a lot of attention due to the amount of information generated in public professional journals and other means of communication within the medical profession. The typical information extraction task from technical texts is performed via an automatic term recognition extractor. Automatic Term Recognition (ATR) from technical texts is applied for the identification of key concepts for information retrieval and, secondarily, for machine translation. Term recognition depends on the subject domain and the lexical patterns of a given language, in our case, Spanish, Arabic and Japanese. In this article, we present the methods and techniques for creating a biomedical corpus of validated terms, with several tools for optimal exploitation of the information therewith contained in said corpus. This paper also shows how these techniques and tools have been used in a prototype

    First International Workshop on Lexical Resources

    Get PDF
    International audienceLexical resources are one of the main sources of linguistic information for research and applications in Natural Language Processing and related fields. In recent years advances have been achieved in both symbolic aspects of lexical resource development (lexical formalisms, rule-based tools) and statistical techniques for the acquisition and enrichment of lexical resources, both monolingual and multilingual. The latter have allowed for faster development of large-scale morphological, syntactic and/or semantic resources, for widely-used as well as resource-scarce languages. Moreover, the notion of dynamic lexicon is used increasingly for taking into account the fact that the lexicon undergoes a permanent evolution.This workshop aims at sketching a large picture of the state of the art in the domain of lexical resource modeling and development. It is also dedicated to research on the application of lexical resources for improving corpus-based studies and language processing tools, both in NLP and in other language-related fields, such as linguistics, translation studies, and didactics

    Models to represent linguistic linked data

    Get PDF
    As the interest of the Semantic Web and computational linguistics communities in linguistic linked data (LLD) keeps increasing and the number of contributions that dwell on LLD rapidly grows, scholars (and linguists in particular) interested in the development of LLD resources sometimes find it difficult to determine which mechanism is suitable for their needs and which challenges have already been addressed. This review seeks to present the state of the art on the models, ontologies and their extensions to represent language resources as LLD by focusing on the nature of the linguistic content they aim to encode. Four basic groups of models are distinguished in this work: models to represent the main elements of lexical resources (group 1), vocabularies developed as extensions to models in group 1 and ontologies that provide more granularity on specific levels of linguistic analysis (group 2), catalogues of linguistic data categories (group 3) and other models such as corpora models or service-oriented ones (group 4). Contributions encompassed in these four groups are described, highlighting their reuse by the community and the modelling challenges that are still to be faced

    Mrežni sintaksno-semantički okvir za izvlačenje leksičkih relacija deterministričkim modelom prirodnog jezika

    Get PDF
    Given the extraordinary growth in online documents, methods for automated extraction of semantic relations became popular, and shortly after, became necessary. This thesis proposes a new deterministic language model, with the associated artifact, which acts as an online Syntactic and Semantic Framework (SSF) for the extraction of morphosyntactic and semantic relations. The model covers all fundamental linguistic fields: Morphology (formation, composition, and word paradigms), Lexicography (storing words and their features in network lexicons), Syntax (the composition of words in meaningful parts: phrases, sentences, and pragmatics), and Semantics (determining the meaning of phrases). To achieve this, a new tagging system with more complex structures was developed. Instead of the commonly used vectored systems, this new tagging system uses tree-likeT-structures with hierarchical, grammatical Word of Speech (WOS), and Semantic of Word (SOW) tags. For relations extraction, it was necessary to develop a syntactic (sub)model of language, which ultimately is the foundation for performing semantic analysis. This was achieved by introducing a new ‘O-structure’, which represents the union of WOS/SOW features from T-structures of words and enables the creation of syntagmatic patterns. Such patterns are a powerful mechanism for the extraction of conceptual structures (e.g., metonymies, similes, or metaphors), breaking sentences into main and subordinate clauses, or detection of a sentence’s main construction parts (subject, predicate, and object). Since all program modules are developed as general and generative entities, SSF can be used for any of the Indo-European languages, although validation and network lexicons have been developed for the Croatian language only. The SSF has three types of lexicons (morphs/syllables, words, and multi-word expressions), and the main words lexicon is included in the Global Linguistic Linked Open Data (LLOD) Cloud, allowing interoperability with all other world languages. The SSF model and its artifact represent a complete natural language model which can be used to extract the lexical relations from single sentences, paragraphs, and also from large collections of documents.Pojavom velikoga broja digitalnih dokumenata u okružju virtualnih mreža (interneta i dr.), postali su zanimljivi, a nedugo zatim i nužni, načini identifikacije i strojnoga izvlačenja semantičkih relacija iz (digitalnih) dokumenata (tekstova). U ovome radu predlaže se novi, deterministički jezični model s pripadnim artefaktom (Syntactic and Semantic Framework - SSF), koji će služiti kao mrežni okvir za izvlačenje morfosintaktičkih i semantičkih relacija iz digitalnog teksta, ali i pružati mnoge druge jezikoslovne funkcije. Model pokriva sva temeljna područja jezikoslovlja: morfologiju (tvorbu, sastav i paradigme riječi) s leksikografijom (spremanjem riječi i njihovih značenja u mrežne leksikone), sintaksu (tj. skladnju riječi u cjeline: sintagme, rečenice i pragmatiku) i semantiku (određivanje značenja sintagmi). Da bi se to ostvarilo, bilo je nužno označiti riječ složenijom strukturom, umjesto do sada korištenih vektoriziranih gramatičkih obilježja predložene su nove T-strukture s hijerarhijskim, gramatičkim (Word of Speech - WOS) i semantičkim (Semantic of Word - SOW) tagovima. Da bi se relacije mogle pronalaziti bilo je potrebno osmisliti sintaktički (pod)model jezika, na kojem će se u konačnici graditi i semantička analiza. To je postignuto uvođenjem nove, tzv. O-strukture, koja predstavlja uniju WOS/SOW obilježja iz T-struktura pojedinih riječi i omogućuje stvaranje sintagmatskih uzoraka. Takvi uzorci predstavljaju snažan mehanizam za izvlačenje konceptualnih struktura (npr. metonimija, simila ili metafora), razbijanje zavisnih rečenica ili prepoznavanje rečeničnih dijelova (subjekta, predikata i objekta). S obzirom da su svi programski moduli mrežnog okvira razvijeni kao opći i generativni entiteti, ne postoji nikakav problem korištenje SSF-a za bilo koji od indoeuropskih jezika, premda su provjera njegovog rada i mrežni leksikoni izvedeni za sada samo za hrvatski jezik. Mrežni okvir ima tri vrste leksikona (morphovi/slogovi, riječi i višeriječnice), a glavni leksikon riječi već je uključen u globalni lingvistički oblak povezanih podataka, što znači da je interoperabilnost s drugim jezicima već postignuta. S ovako osmišljenim i realiziranim načinom, SSF model i njegov realizirani artefakt, predstavljaju potpuni model prirodnoga jezika s kojim se mogu izvlačiti leksičke relacije iz pojedinačne rečenice, odlomka, ali i velikog korpusa (eng. big data) podataka

    Syntaxe computationnelle du hongrois : de l'analyse en chunks à la sous-catégorisation verbale

    Get PDF
    We present the creation of two resources for Hungarian NLP applications: a rule-based shallow parser and a database of verbal subcategorization frames. Hungarian, as a non-configurational language with a rich morphology, presents specific challenges for NLP at the level of morphological and syntactic processing. While efficient and precise morphological analyzers are already available, Hungarian is under-resourced with respect to syntactic analysis. Our work aimed at overcoming this problem by providing resources for syntactic processing. Hungarian language is characterized by a rich morphology and a non-configurational encoding of grammatical functions. These features imply that the syntactic processing of Hungarian has to rely on morphological features rather than on constituent order. The broader interest of our undertaking is to propose representations and methods that are adapted to these specific characteristics, and at the same time are in line with state of the art research methodologies. More concretely, we attempt to adapt current results in argument realization and lexical semantics to the task of labeling sentence constituents according to their syntactic function and semantic role in Hungarian. Syntax and semantics are not completely independent modules in linguistic analysis and language processing: it has been known for decades that semantic properties of words affect their syntactic distribution. Within the syntax-semantics interface, the field of argument realization deals with the (partial or complete) prediction of verbal subcategorization from semantic properties. Research on verbal lexical semantics and semantically motivated mapping has been concentrating on predicting the syntactic realization of arguments, taking for granted (either explicitly or implicitly) that the distinction between arguments and adjuncts is known, and that adjuncts' syntactic realization is governed by productive syntactic rules, not lexical properties. However, besides the correlation between verbal aspect or actionsart and time adverbs (e.g. Vendler, 1967 or Kiefer, 1992 for Hungarian), the distribution of adjuncts among verbs or verb classes did not receive significant attention, especially within the lexical semantics framework. We claim that contrary to the widely shared presumption, adjuncts are often not fully productive. We therefore propose a gradual notion of productivity, defined in relation to Levin-type lexical semantic verb classes (Levin, 1993; Levin and Rappaport-Hovav, 2005). The definition we propose for the argument-adjunct dichotomy is based on evidence from Hungarian and exploits the idea that lexical semantics not only influences complement structure but is the key to the argument-adjunct distinction and the realization of adjunctsLa linguistique informatique est un domaine de recherche qui se concentre sur les méthodes et les perspectives de la modélisation formelle (statistique ou symbolique) de la langue naturelle. La linguistique informatique, tout comme la linguistique théorique, est une discipline fortement modulaire : les niveaux d'analyse linguistique comprennent la segmentation, l'analyse morphologique, la désambiguïsation, l'analyse syntaxique et sémantique. Tandis qu'un nombre d'outils existent déjà pour les traitements de bas niveau (analyse morphologique, étiquetage grammatical), le hongrois peut être considéré comme une langue peu doté pour l'analyse syntaxique et sémantique. Le travail décrit dans la présente thèse vise à combler ce manque en créant des ressources pour le traitement syntaxique du hongrois : notamment, un analyseur en chunks et une base de données lexicale de schémas de sous-catégorisation verbale. La première partie de la recherche présentée ici se concentre sur la création d'un analyseur syntaxique de surface (ou analyseur en chunks) pour le hongrois. La sortie de l'analyseur de surface est conçue pour servir d'entrée pour un traitement ultérieur visant à annoter les relations de dépendance entre le prédicat et ses compléments essentiels et circonstanciels. L'analyseur profond est mis en œuvre dans NooJ (Silberztein, 2004) en tant qu'une cascade de grammaires. Le deuxième objectif de recherche était de proposer une représentation lexicale pour la structure argumentale en hongrois. Cette représentation doit pouvoir gérer la vaste gamme de phénomènes qui échappent à la dichotomie traditionnelle entre un complément essentiel et un circonstanciel (p. ex. des structures partiellement productives, des écarts entre la prédictibilité syntaxique et sémantique). Nous avons eu recours à des résultats de la recherche récente sur la réalisation d'arguments et choisi un cadre qui répond à nos critères et qui est adaptable à une langue non-configurationnelle. Nous avons utilisé la classification sémantique de Levin (1993) comme modèle. Nous avons adapté les notions relatives à cette classification, à savoir celle de la composante sémantique et celle de l'alternance syntaxique, ainsi que la méthodologie d'explorer et de décrire le comportement des prédicats à l'aide de cette représentation, à la tâche de construire une représentation lexicale des verbes dans une langue non-configurationnelle. La première étape consistait à définir les règles de codage et de construire un vaste base de données lexicale pour les verbes et leurs compléments. Par la suite, nous avons entrepris deux expériences pour l'enrichissement de ce lexique avec des informations sémantiques lexicales afin de formaliser des généralisations syntaxiques et sémantiques pertinentes sur les classes de prédicats sous-jacentes. La première approche que nous avons testée consistait en une élaboration manuelle de classification de verbes en fonction de leur structure de compléments et de l'attribution de rôles sémantiques à ces compléments. Nous avons cherché la réponse aux questions suivantes: quelles sont les composants sémantiques pertinents pour définir une classification sémantique des prédicats hongrois? Quelles sont les implications syntaxiques spécifiques à ces classes? Et, plus généralement, quelle est la nature des alternances spécifiques aux classes verbales en hongrois ? Dans la phase finale de la recherche, nous avons étudié le potentiel de l'acquisition automatique pour extraire des classes de verbes à partir de corpus. Nous avons effectué une classification non supervisée, basée sur des données distributionnelles, pour obtenir une classification sémantique pertinente des verbes hongrois. Nous avons également testé la méthode de classification non supervisée sur des données françaises

    The Future of Information Sciences : INFuture2009 : Digital Resources and Knowledge Sharing

    Get PDF
    corecore