75 research outputs found

    Towards a Distant Reading of the Golden Ages Hendecasyllable: Metrical Patterns, Frequencies and Historical Development

    Get PDF
    En este trabajo se desarrolla un análisis de los principales tipos de endecasílabos utilizados en los sonetos del Siglo de Oro. Como novedad, aplicamos un método de análisis macro o distante, mediante el análisis computacional de un corpus de más de setenta mil (70.000) versos. A partir de un modelo formal de patrón métrico, analizamos los tipos de patrones métricos más frecuentes y su evolución histórica. Los resultados, sin ser aún concluyentes, sí muestran las principales preferencias métricas de los diferentes autores y cómo varían a lo largo de los siglos XVI y XVII.In this paper an analysis of the hendecasyllable meter in the Golden Age Spanish sonnets is presented. A macroanalysis or (computer-based) “distant reading” approach is applied to a corpus of more than 70 000 hendecasyllables. Based on a formal definition of metrical pattern, I analyze the most frequent metrical patterns and their historical development. Results are not entirely conclusive, but they show the main authors’ metrical preferences and their evolution during 16th and 17th Centuries

    Ordenación de eventos multidocumento usando inferencia de relaciones temporales y modelos semánticos distribucionales

    Get PDF
    This paper focuses on the contribution of temporal relations inference and distributional semantic models to the event ordering task. Our system automatically builds ordered timelines of events from different written texts in English by performing first temporal clustering and then semantic clustering. In order to determine temporal compatibility, an inference from the temporal relationships between events –automatically extracted from a Temporal Information Processing system– is applied. Regarding semantic compatibility between events, we analyze two different distributional semantic models: LDA Topic modeling and Word2Vec word embeddings. Both semantic models together with the temporal inference have been evaluated within the framework of SemEval 2015 Task 4 Track B. Experiments show that, using both models, the current State of the Art is improved, showing significant advance in the Cross-Document Event Ordering task.Este artículo se centra en estudiar la contribución que la inferencia de relaciones temporales y los modelos semánticos distribucionales hacen a la tarea de ordenación de eventos. Nuestro sistema construye automáticamente líneas de tiempo con eventos extraídos de diferentes documentos escritos en inglés. Para ello realiza primero una agrupación temporal y posteriormente una agrupación semántica. Para determinar la compatibilidad temporal se realiza una inferencia sobre las relaciones temporales entre los eventos extraídos de un sistema automático de procesamiento de información temporal. Para la compatibilidad semántica entre eventos hemos analizado dos modelos semánticos distribucionales distintos: LDA Topic Modeling y Word2Vec Word Embeddings. Ambos modelos semánticos junto con la inferencia temporal han sido evaluados bajo el marco de evaluación de SemEval 2015 Task 4 Track B. Los experimentos muestran que, usando ambos modelos se mejora el estado del arte actual, implicando un avance importante en la tarea de ordenación de eventos multidocumento.This paper has been partially supported by the Spanish government, project TIN2015-65100-R, project TIN2015-65136-C2-2-R and PROMETEOII/2014/001

    Hacia un análisis distante del endecasílabo áureo: patrones métricos, frecuencias y evolución histórica

    Get PDF
    En este trabajo se desarrolla un análisis de los principales tipos de endecasílabos utilizados en los sonetos del Siglo de Oro. Como novedad, aplicamos un método de análisis macro o distante, mediante el análisis computacional de un corpus de más de setenta mil (70.000) versos. A partir de un modelo formal de patrón métrico, analizamos los tipos de patrones métricos más frecuentes y su evolución histórica. Los resultados, sin ser aún concluyentes, sí muestran las principales preferencias métricas de los diferentes autores y cómo varían a lo largo de los siglos XVI y XVII.In this paper an analysis of the hendecasyllable meter in the Golden Age Spanish sonnets is presented. A macroanalysis or (computer-based) “distant reading” approach is applied to a corpus of more than 70 000 hendecasyllables. Based on a formal definition of metrical pattern, I analyze the most frequent metrical patterns and their historical development. Results are not entirely conclusive, but they show the main authors’ metrical preferences and their evolution during 16th and 17th Centuries

    Enriched Digital Edition: a Multilevel Annotation Model for Golden-Age Spanish Poetry

    Get PDF
    En este capítulo se presenta un modelo general para la anotación multinivel de corpora de texto literario. Por multinivel se hace referencia a la combinación, en un mismo corpus, de información de diferentes niveles de descripción lingüística o literaria, desde datos relacionados con palabras o sílabas, hasta cuestiones temáticas, textuales o pragmáticas. El objetivo final de un corpus de estas características es fijar un posible análisis literario, por lo que se considera como una edición digital enriquecida. Se defienden cuatro características que un corpus de texto literario debe cumplir: interoperabilidad, perspectivismo, unidad y claridad/sencillez. Se da cuenta de los principales problemas de formalización en un corpus multinivel de este tipo: la combinación de diferentes formalismos de representación y, en el caso de XML, el problema de un anidamiento incorrecto. Finalmente se propone un modelo para un corpus de poesía del Siglo de Oro.This paper presents a general model for the multilevel annotation of a literary corpus. Multilevel refers to the combination of information from different linguistic or literary levels in the same corpus: from word related data to thematic, textual or pragmatic questions. The objective is to fix a possible literary analysis. To be considered an enriched digital edition, an annotated corpus must meet four characteristics: interoperability, perspectivism, unity and clarity/simplicity. The main formalization problems are discussed: the combination of different representation formalisms and, in the case of XML, the improper nesting. Finally, a model for a corpus of poetry from the Spanish Golden-Age is proposed.Trabajo parcialmente financiado por el Ministerio de Ciencia e Innovación a través del proyecto “CORTEX: Conscious Text Generation” (PID2021-123956OB-I00): MCIN/AEI/10.13039/501100011033/ y “FEDER Una manera de hacer Europa”; y por la Generalitat Valenciana (Conselleria d’Educació, Investigació, Cultura i Esport) a través del Proyecto: NL4DISMIS: Tecnologías del Lenguaje Natural para lidiar con la desinformación (CIPROM/2021/021)

    On Poetic Topic Modeling: Extracting Themes and Motifs From a Corpus of Spanish Poetry

    Get PDF
    This paper analyzes the application of LDA topic modeling to a corpus of poetry. First, it explains how the most coherent LDA-topics have been established by running several tests and automatically evaluating the coherence of the resulting LDA-topics. Results show, on one hand, that when dealing with a corpus of poetry, lemmatization is not advisable because several poetic features are lost in the process; and, on the other hand, that a standard LDA algorithm is better than a specific version of LDA for short texts (LF-LDA). The resulting LDA-topics have then been manually analyzed in order to define the relation between word topics and poems. The analysis shows that there are mainly two kinds of semantic relations: an LDA-topic could represent the subject or theme of the poem, but it could also represent a poetic motif. All these analyses have been undertaken on a large corpus of Golden Age Spanish sonnets. Finally, the paper shows the most relevant themes and motifs in this corpus such as “love,” “religion,” “heroics,” “moral,” or “mockery” on one hand, and “rhyme,” “marine,” “music,” or “painting” on the other hand.This work was supported by the BBVA Foundation: grants for research groups 2016, project Distant Reading Approach to Golden Age Spanish Sonnets (Ayudas fundación BBVA a equipos de investigación científica, proyecto Análisis distante de base computacional del soneto castellano del Siglo de Oro): http://adso.gplsi.es. It was also partially conducted in the context of the COST Action Distant Reading for European Literary History (CA16204 - Distant-Reading): www.distant-reading.net

    The Simplification of the Language of Public Administration: The Case of Ombudsman Institutions

    Get PDF
    Language produced by Public Administrations has crucial implications in citizens’ lives. However, its syntactic complexity and the use of legal jargon, among other factors, make it difficult to be understood for laypeople and certain target audiences. The NLP task of Automatic Text Simplification (ATS) can help to the necessary simplification of this technical language. For that purpose, specialized parallel datasets of complex-simple pairs need to be developed for the training of these ATS systems. In this position paper, an on-going project is presented, whose main objectives are (a) to extensively analyze the syntactical, lexical, and discursive features of the language of English-speaking ombudsmen, as samples of public administrative language, with special attention to those characteristics that pose a threat to comprehension, and (b) to develop the OmbudsCorpus, a parallel corpus of complex-simple supra-sentential fragments from ombudsmen’s case reports that have been manually simplified by professionals and annotated with standardized simplification operations. This research endeavor aims to provide a deeper understanding of the simplification process and to enhance the training of ATS systems specialized in administrative texts.This paper has been partially funded by the Spanish Government through the R&D projects “CORTEX: Conscious Text Generation” (PID2021-123956OB-I00, funded by MCIN/AEI/10.13039/501100011033/ and by “ERDF A way of making Europe”) and “CLEAR.TEXT: Enhancing the modernization public sector organizations by deploying Natural Language Processing to make their digital content CLEARER to those with cognitive disabilities” (TED2021-130707B-I00), and by the Generalitat Valenciana through the project “NL4DISMIS: Natural Language Technologies for dealing with dis- and misinformation with grant reference (CIPROM/2021/21)”

    An approach to the recommendation of scientific articles according to their degree of specificity

    Get PDF
    En este artículo se presenta un método para recomendar artículos científicos teniendo en cuenta su grado de generalidad o especificidad. Este enfoque se basa en la idea de que personas menos expertas en un tema preferirían leer artículos más generales para introducirse en el mismo, mientras que personas más expertas preferirían artículos más específicos. Frente a otras técnicas de recomendación que se centran en el análisis de perfiles de usuario, nuestra propuesta se basa puramente en el análisis del contenido. Presentamos dos aproximaciones para recomendar artículos basados en el modelado de tópicos (Topic Modelling). El primero de ellos se basa en la divergencia de tópicos que se dan en los documentos, mientras que el segundo se basa en la similitud que se dan entre estos tópicos. Con ambas medidas se consiguió determinar lo general o específico de un artículo para su recomendación, superando en ambos casos a un sistema de recuperación de información tradicional.This article presents a method for recommending scientific articles taking into consideration their degree of generality or specificity. This approach is based on the idea that less expert people in a specific topic prefer to read more general articles to be introduced into it, while people with more expertise prefer to read more specific articles. Compared to other recommendation techniques that focus on the analysis of user profiles, our proposal is purely based on content analysis. We present two methods for recommending articles, based on Topic Modelling. The first one is based on the divergence of topics given in the documents, while the second uses the similarities that exist between these topics. By using the proposed methods it was possible to determine the degree of specificity of an article, and the results obtained with them overcame those produced by an information retrieval traditional system.Este trabajo ha sido parcialmente financiado por los siguientes proyectos: ATTOS (TIN2012-38536-C03-03), LEGOLANG-UAGE (TIN2012-31224), FIRST (FP7-287607), DIIM2.0 (PROMETEOII/2014/001) y por el Programa Nacional de Movilidad de Recursos Humanos del Plan Nacional de I+D+i (CAS12/00113)

    Metrical Annotation of a Large Corpus of Spanish Sonnets: Representation, Scansion and Evaluation

    Get PDF
    In order to analyze metrical and semantics aspects of poetry in Spanish with computational techniques, we have developed a large corpus annotated with metrical information. In this paper we will present and discuss the development of this corpus: the formal representation of metrical patterns, the semi-automatic annotation process based on a new automatic scansion system, the main annotation problems, and the evaluation, in which an inter-annotator agreement of 96% has been obtained. The corpus is open and available

    Cross-document event ordering through temporal, lexical and distributional knowledge

    Get PDF
    In this paper we present a system that automatically builds ordered timelines of events from different written texts in English. The system deals with problems such as automatic event extraction, cross-document temporal relation extraction and cross-document event coreference resolution. Its main characteristic is the application of three different types of knowledge: temporal knowledge, lexical-semantic knowledge and distributional-semantic knowledge, in order to anchor and order the events in the timeline. It has been evaluated within the framework of SemEval 2015. The proposed system improves the current state-of-the-art systems in all measures (up to eight points of F1-score over other systems) and shows a significant advance in the Cross-document event ordering task.This paper has been partially supported by the Spanish government, project TIN2015-65100-R and project TIN2015-65136-C2-2-R
    corecore