10,825 research outputs found

    Reflexive pronouns in Spanish Universal Dependencies

    Get PDF
    In this paper, we argue that in current Universal Dependencies treebanks, the annotation of Spanish reflexives is an unsolved problem, which clearly affects the accuracy and consistency of current parsers. We evaluate different proposals for fine-tuning the various categories, and discuss remaining open issues. We believe that the solution for these issues could lie in a multi-layered way of annotating the characteristics, combining annotation of the dependency relation and of the so-called token features, rather than in expanding the number of categories on one layer. We apply this proposal to the v2.5 Spanish UD AnCora treebank and provide a categorized conversion table that can be run with a Python script

    Polarity of opinions about a public person in Ecuador

    Get PDF
    The present investigation is the study of opinion miningtechniques, focused on obtaining information from a public figurein Ecuador, determining signs of polarity for your management ina positive, negative or neutral way, a result that will allow saidcharacter public to make decisions about their actions based on animage of service to the community. The extraction of opinions insocial networks and techniques based on Human LanguageTechnologies enabled the interpretation of polarized data byspecifying parameters of relevance to the resulting opinion focusedon decision making, processing that adapts to the newcommunication formats achieving the interpretation andassessment of opinion. Social networks was the platform for thecapture of texts by means of an API, which after the processing ofthe natural language obtained results of indications of thepopularity of the character

    Spanish word segmentation through neural language models

    Get PDF
    En las plataformas de microblogging abundan ciertos tokens especiales como los hashtags o las menciones en los que un grupo de palabras se escriben juntas sin espaciado entre ellas; p.ej.: #añobisiesto o @ryanreynoldsnet. Debido a la forma en que se escriben este tipo de textos, este fenómeno de ensamblado de palabras puede aparecer junto a su opuesto, la segmentación de palabras, afectando a cualquier elemento del texto y dificultando su análisis. En este trabajo se muestra un enfoque algorítmico que utiliza como base un modelo del lenguaje - en nuestro caso concreto uno basado en redes neuronales - para resolver el problema de la segmentación y ensamblado de palabras, en el que se trata de recuperar el espaciado estándar de las palabras que han sufrido alguna de estas transformaciones añadiendo o quitando espacios donde corresponda. Los resultados obtenidos son prometedores e indican que tras un mayor refinamiento del modelo del lenguaje se podrá sobrepasar al estado del arte.In social media platforms special tokens abound such as hashtags and mentions in which multiple words are written together without spacing between them; e.g. #leapyear or @ryanreynoldsnet. Due to the way this kind of texts are written, this word assembly phenomenon can appear with its opposite, word segmentation, affecting any token of the text and making it more difficult to perform analysis on them. In this work we show an algorithmic approach based on a language model - in this case a neural model - to solve the problem of the segmentation and assembly of words, in which we try to recover the standard spacing of the words that have suffered one of these transformations by adding or deleting spaces when necessary. The promising results indicate that after some further refinement of the language model it will be possible to surpass the state of the art.Este trabajo ha sido parcialmente financiado por el Ministerio de Economía y Competitividad español a través de los proyectos FFI2014-51978-C2-1-R y FFI2014-51978-C2-2-R, y por la Xunta de Galicia a través del programa Oportunius

    Evaluating machine translation in a low-resource language combination : Spanish-Galician

    Get PDF
    This paper reports the results of a study designed to assess the perception of adequacy of three different types of machine translation systems within the context of a minoritized language combination (Spanish-Galician). To perform this evaluation, a mixed design with three different metrics (BLEU, survey and error analysis) is used to extract quantitative and qualitative data about two marketing letters from the energy industry translated with a rulebased system (RBMT), a phrase-based system (PBMT) and a neural system (NMT). Results show that in the case of low-resource languages rule-based and phrase-based machine translations systems still play an important role

    A Supervised Approach for Sentiment Lexicon Generation using Word Skipgrams

    Get PDF
    This Ph.D. thesis work proposes the design, development and evaluation of a supervised approach for sentiment lexicon generation. It is based on the hypothesis that an efficient use of the skipgram modelling can improve sentiment analysis tasks and reduce the resources needed maintaining an acceptable level of quality. In summary, the novelty of this approach lies in the use of skipgrams as information units and the way they are efficiently generated, weighed and filtered, taking advantage of the useful information they provide about the sequentiality of the language.This research work has been supported by TRIVIAL (PID2021-122263OB-C22) funded by MCIN/AEI/10.13039/501100011033 and by “European Union Regional Development Fund (ERDF) A way of making Europe”, by the “European Union NextGenerationEU/PRTR”

    A classificación of Spanish pyschological verbs

    Get PDF
    The present paper is presented within the context of the research currently being carried out within the field of . Computational Lexicography at the University of Barcelona Linguistics Department - in collaboration with the University of Maryland Computer Science Department and provisionally called PIRAPIDES. The research deals with the study of verbal diathesis, subcategorization frames, S-grids and the definition of a typology of S-roles apt for the description of the argumental structure

    Procesador automático de informes médicos

    Get PDF
    El acceso a la información y su intercambio es vital en el ámbito médico, tanto en la investigación como en la gestión hospitalaria. Gran parte de esta información está contenida en informes médicos escritos en lenguaje natural y, por tanto, no es fácilmente tratable por sistemas automáticos. Esta memoria describe el proyecto de fin de carrera "Procesador automático de informes médicos", cuya finalidad es la creación de un sistema de detección de conceptos y términos médicos, representados mediante SNOMED CT, una terminología clínica de referencia. Además, y previamente a dicha extracción de conceptos, se realizan tareas de corrección ortográfica, detección y desambiguación de acrónimos y detección de negaciones. Para la construcción de esta serie de fases, se han aplicado técnicas de procesamiento de lenguaje natural a informes médicos en castellano. Esto supone un reto, dado que la mayoría del trabajo realizado en este campo se ha realizado para lengua inglesa y los recursos para el español son bastante limitados. Todo esto se integra en una herramienta que sirve para procesar automáticamente informes médicos y generar una representación conceptual de su contenido, útil para la gestión de dichos informes en el ámbito clínico-sanitario. Adicionalmente, se han construido dos sistemas auxiliares para medir la eficacia de la aplicación que permiten etiquetar manualmente informes para construir un corpus de informes anotados y usar dicho corpus para evaluar los resultados del procesamiento automático. [ABSTRAC] Accessing to and exchanging information is vital in medical settings, be it in research or in healthcare management. Most of this information is contained in clinical reports written in natural language free text and, therefore, it cannot be easily processed by automatic systems.This document describes our final degree project, “Procesador autom´atico de infor- mes m´edicos”, and its objective, which is the creation of a medical concept extraction system that maps texts to SNOMED CT (a standard reference terminology). Moreover, to prepare the text for the concept detection, several other tasks are performed: spelling correction, acronym detection and disambiguation, and negation detection. In order to build the different parts of the application, we have applied natural language processing techniques to clinical reports in Spanish. This poses a challenge, given that most of the work done in this field deals with texts in English and theavailable resources are rather limited. The previously described tasks are implemented in a software that automatically process medical texts, generates a conceptual representation from their contents and serves as an example of a useful application to manage clinical reports in healthcare and research settings. Furthermore, we have built two auxiliary systems to measure the effectiveness of our tool, which allow to manually tag reports to build an annotated corpus and to use such corpus to evaluate the results of the automatic processing

    Generación automática de resúmenes abstractivos mono documento utilizando análisis semántico y del discurso

    Get PDF
    The web is a giant resource of data and information about security, health, education, and others, matters that have great utility for people, but to get a synthesis or abstract about one or many documents is an expensive labor, which with manual process might be impossible due to the huge amount of data. Abstract generation is a challenging task, due to that involves analysis and comprehension of the written text in non structural natural language dependent of a context and it must describe an events synthesis or knowledge in a simple form, becoming natural for any reader. There are diverse approaches to summarize. These categorized into extractive or abstractive. On abstractive technique, summaries are generated starting from selecting outstanding sentences on source text. Abstractive summaries are created by regenerating the content extracted from source text, through that phrases are reformulated by terms fusion, compression or suppression processes. In this manner, paraphrasing sentences are obtained or even sentences were not in the original text. This summarize type has a major probability to reach coherence and smoothness like one generated by human beings. The present work implements a method that allows to integrate syntactic, semantic (AMR annotator) and discursive (RST) information into a conceptual graph. This will be summarized through the use of a new measure of concept similarity on WordNet.To find the most relevant concepts we use PageRank, considering all discursive information given by the O”Donell method application. With the most important concepts and semantic roles information got from the PropBank, a natural language generation method was implemented with tool SimpleNLG. In this work we can appreciated the results of applying this method to the corpus of Document Understanding Conference 2002 and tested by Rouge metric, widely used in the automatic summarization task. Our method reaches a measure F1 of 24 % in Rouge-1 metric for the mono-document abstract generation task. This shows that using these techniques are workable and even more profitable and recommended configurations and useful tools for this task.Tesi

    Natural Language Generation: Revision of the State of the Art

    Get PDF
    El ser humano se comunica y expresa a través del lenguaje. Para conseguirlo, ha de desarrollar una serie de habilidades de alto nivel cognitivo cuya complejidad se pone de manifiesto en la tarea de automatizar el proceso, tanto cuando se trata de producir lenguaje como de interpretarlo. Cuando la acción comunicativa ocurre entre una persona y un ordenador y éste último es el destinatario de la acción, se emplean lenguajes computacionales que, como norma general, están sujetos a un conjunto de reglas fuertemente tipadas, acotadas y sin ambigüedad. Sin embargo, cuando el sentido de la comunicación es el contrario y la máquina ha de transmitir información a la persona, si el mensaje se quiere transmitir en lenguaje natural, el procedimiento para generarlo debe lidiar con la flexibilidad y la ambigüedad que lo caracterizan, dando lugar a una tarea de alto nivel de complejidad. Para que las máquinas sean capaces de manejar el lenguaje humano se hacen necesarias técnicas de Lingüística Computacional. Dentro de esta disciplina, el campo que se encarga de crear textos en lenguaje natural se denomina Generación de Lenguaje Natural (GLN). En este artículo se va a hacer un recorrido exhaustivo de este campo. Se describen las fases en las que se suelen descomponer los sistemas de GLN junto a las técnicas que se aplican y se analiza con detalle la situación actual de esta área de investigación y su problemática, así como los recursos más relevantes y las técnicas que se están empleando para evaluar la calidad de los sistemas.Language is one of the highest cognitive skills developed by human beings and, therefore, one of the most complex tasks to be faced from the computational perspective. Human-computer communication processes imply two different degrees of difficulty depending on the nature of that communication. If the language used is oriented towards the domain of the machine, there is no place for ambiguity since it is restricted by rules. However, when the communication is in terms of natural language, its flexibility and ambiguity becomes unavoidable. Computational Linguistic techniques are mandatory for machines when it comes to process human language. Among them, the area of Natural Language Generation aims to automatical development of techniques to produce human utterances, text and speech. This paper presents a deep survey of this research area taking into account different points of view about the theories, methodologies, architectures, techniques and evaluation approaches, thus providing a review of the current situation and possible future research in the field.Esta investigación ha sido financiada por la Generalitat Valenciana a través del proyecto DIIM2.0: Desarrollo de técnicas Inteligentes e Interactivas de Minería y generación de información sobre la web 2.0 (PROMETEOII/2014/001). Además, ha sido parcialmente financiada por la Comisión Europea a través del proyecto SAM (FP7-611312); por el Ministerio de Economía y Competitividad del Gobierno de España mediante los proyectos: “Análisis de Tendencias Mediante Técnicas de Opinión Semántica” (TIN2012-38536-C03-03) y ‘Técnicas de Deconstrucción en la Tecnología del Lenguaje Humano” (TIN2012-31224); y finalmente, por la Universidad de Alicante a través del proyecto “Explotación y tratamiento de la información disponible en Internet para la anotación y generación de textos adaptados al usuario” (GRE13-15)

    Towards Syntactic Iberian Polarity Classification

    Full text link
    Lexicon-based methods using syntactic rules for polarity classification rely on parsers that are dependent on the language and on treebank guidelines. Thus, rules are also dependent and require adaptation, especially in multilingual scenarios. We tackle this challenge in the context of the Iberian Peninsula, releasing the first symbolic syntax-based Iberian system with rules shared across five official languages: Basque, Catalan, Galician, Portuguese and Spanish. The model is made available.Comment: 7 pages, 5 tables. Contribution to the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA-2017) at EMNLP 201
    corecore