6 research outputs found

    Resumen multidocumento utilizando teorías semántico-discursivas

    Get PDF
    El resumen automático tiene por objetivo reducir el tamaño de los textos, preservando el contenido más importante. En este trabajo, proponemos algunos métodos de resumen basados en dos teorías semántico-discursivas: Teoría de la Estructura Retórica (Rhetorical Structure Theory, RST) y Teoría de la Estructura Inter-Documento (Cross-document Structure Theory, CST). Han sido elegidas ambas teorías con el fin de abordar de un modo más relevante de un texto, los fenómenos relacionales de inter-documentos y la distribución de subtopicos en los textos. Los resultados muestran que el uso de informaciones semánticas y discursivas para la selección de contenidos mejora la capacidad informativa de los resúmenes automáticos.Automatic multi-document summarization aims at reducing the size of texts while preserving the important content. In this paper, we propose some methods for automatic summarization based on two semantic discourse models: Rhetorical Structure Theory (RST) and Cross-document Structure Theory (CST). These models are chosen in order to properly address the relevance of information, multi-document phenomena and subtopical distribution in the source texts. The results show that using semantic discourse knowledge for content selection improve the informativeness of automatic summaries

    Generación de Texto a partir de AMR en Contexto de Bajos Recursos: Un Estudio para el Portugués Brasileño

    No full text
    This work presents a study of how varied strategies for tackling low-resource AMR-to-text generation for three approaches are helpful in Brazilian Portuguese. Specifically, we explore the helpfulness of additional translated corpus, different granularity levels in input representation, and three preprocessing steps. Results show that translation is useful. However, it must be used in each approach differently. In addition, finer-grained representations as characters and subwords improve the performance and reduce the bias on the development set, and preprocessing steps are helpful in different contexts, being delexicalisation and preordering the most important ones.Este trabajo presenta un estudio de cómo diversas estrategias para abordar la generación de textos a partir de AMR en contextos de bajos recursos para tres enfoques son útiles en portugués brasileño. Específicamente, exploramos la utilidad de un corpus traducido, diferentes niveles de granularidad en la representación de entradas y tres técnicas de preprocesamiento. Los resultados muestran que el corpus traducido es útil. Sin embargo, debe usarse en cada enfoque de manera diferente. Además, las representaciones más detalladas, como las basadas en caracteres y subpalabras, mejoran el rendimiento y reducen el sesgo en el conjunto de validación, y los pasos de preprocesamiento son útiles en diferentes contextos, siendo la deslexicalización y el preordenamiento los más importantes.The authors are grateful to CAPES and the Center for Artificial Intelligence (C4AI - http://c4ai.inova.usp.br/) of the University of São Paulo, sponsored by IBM and FAPESP (grant #2019/07665-4). Besides, this research has been carried out using the computational resources of the Center for Mathematical Sciences Applied to Industry (CeMEAI) funded by FAPESP (grant 2013/07375-0)

    Exploración de Métodos basados en Conocimiento Clásicos y Lingüísticamente Enriquecidos para Desambiguación del Sentido de los Verbos en Textos de Noticias del Portugués Brasileño

    No full text
    Word Sense Disambiguation (WSD) aims at determining the appropriate sense of a word in a given context. This task is challenging and highly relevant for the Natural Language Processing community. However, there are few works on Portuguese word sense disambiguation and some of these are domain oriented. In this paper, we report a study on general purpose WSD methods for verbs in Brazilian Portuguese. This study is divided into three steps: (1) the sense annotation of a corpus, (2) the exploration of classical WSD methods, and (3) the incorporation of linguistic knowledge to some of these classical methods. Among the contributions, we emphasize the free availability of the sense-annotated corpus and the use of a verb-focused repository to support classical methods in a new way.La Desambiguación del Sentido de las Palabras (DSP) tiene como objetivo determinar el sentido más apropiado para una palabra en un contexto específico. Esta tarea es desafiante y altamente relevante para la comunidad de Procesamiento de Lenguaje Natural, mas existen pocos trabajos para el portugués y varios de ellos están orientados a dominios específicos. En este trabajo reportamos un nuevo estudio sobre métodos de DSP de propósito general para verbos en portugués brasileño. Este estudio se divide en tres etapas: (1) la anotación del sentido de verbos en un corpus, (2) la exploración de métodos clásicos de DSP, y (3) la incorporación de conocimiento lingüístico a algunos de estos métodos clásicos. Entre las contribuciones podemos enfatizar la libre disponibilidad del corpus anotado y el uso de un repositorio centrado en verbos para ayudar a métodos clásicos en una nueva forma.CAPES, FAPESP and Samsung Electrônica da Amazônia Ltda

    NEOTROPICAL XENARTHRANS: a data set of occurrence of xenarthran species in the Neotropics

    No full text
    Xenarthrans—anteaters, sloths, and armadillos—have essential functions for ecosystem maintenance, such as insect control and nutrient cycling, playing key roles as ecosystem engineers. Because of habitat loss and fragmentation, hunting pressure, and conflicts with domestic dogs, these species have been threatened locally, regionally, or even across their full distribution ranges. The Neotropics harbor 21 species of armadillos, 10 anteaters, and 6 sloths. Our data set includes the families Chlamyphoridae (13), Dasypodidae (7), Myrmecophagidae (3), Bradypodidae (4), and Megalonychidae (2). We have no occurrence data on Dasypus pilosus (Dasypodidae). Regarding Cyclopedidae, until recently, only one species was recognized, but new genetic studies have revealed that the group is represented by seven species. In this data paper, we compiled a total of 42,528 records of 31 species, represented by occurrence and quantitative data, totaling 24,847 unique georeferenced records. The geographic range is from the southern United States, Mexico, and Caribbean countries at the northern portion of the Neotropics, to the austral distribution in Argentina, Paraguay, Chile, and Uruguay. Regarding anteaters, Myrmecophaga tridactyla has the most records (n = 5,941), and Cyclopes sp. have the fewest (n = 240). The armadillo species with the most data is Dasypus novemcinctus (n = 11,588), and the fewest data are recorded for Calyptophractus retusus (n = 33). With regard to sloth species, Bradypus variegatus has the most records (n = 962), and Bradypus pygmaeus has the fewest (n = 12). Our main objective with Neotropical Xenarthrans is to make occurrence and quantitative data available to facilitate more ecological research, particularly if we integrate the xenarthran data with other data sets of Neotropical Series that will become available very soon (i.e., Neotropical Carnivores, Neotropical Invasive Mammals, and Neotropical Hunters and Dogs). Therefore, studies on trophic cascades, hunting pressure, habitat loss, fragmentation effects, species invasion, and climate change effects will be possible with the Neotropical Xenarthrans data set. Please cite this data paper when using its data in publications. We also request that researchers and teachers inform us of how they are using these data

    A global metagenomic map of urban microbiomes and antimicrobial resistance

    No full text
    We present a global atlas of 4,728 metagenomic samples from mass-transit systems in 60 cities over 3 years, representing the first systematic, worldwide catalog of the urban microbial ecosystem. This atlas provides an annotated, geospatial profile of microbial strains, functional characteristics, antimicrobial resistance (AMR) markers, and genetic elements, including 10,928 viruses, 1,302 bacteria, 2 archaea, and 838,532 CRISPR arrays not found in reference databases. We identified 4,246 known species of urban microorganisms and a consistent set of 31 species found in 97% of samples that were distinct from human commensal organisms. Profiles of AMR genes varied widely in type and density across cities. Cities showed distinct microbial taxonomic signatures that were driven by climate and geographic differences. These results constitute a high-resolution global metagenomic atlas that enables discovery of organisms and genes, highlights potential public health and forensic applications, and provides a culture-independent view of AMR burden in cities.Funding: the Tri-I Program in Computational Biology and Medicine (CBM) funded by NIH grant 1T32GM083937; GitHub; Philip Blood and the Extreme Science and Engineering Discovery Environment (XSEDE), supported by NSF grant number ACI-1548562 and NSF award number ACI-1445606; NASA (NNX14AH50G, NNX17AB26G), the NIH (R01AI151059, R25EB020393, R21AI129851, R35GM138152, U01DA053941); STARR Foundation (I13- 0052); LLS (MCL7001-18, LLS 9238-16, LLS-MCL7001-18); the NSF (1840275); the Bill and Melinda Gates Foundation (OPP1151054); the Alfred P. Sloan Foundation (G-2015-13964); Swiss National Science Foundation grant number 407540_167331; NIH award number UL1TR000457; the US Department of Energy Joint Genome Institute under contract number DE-AC02-05CH11231; the National Energy Research Scientific Computing Center, supported by the Office of Science of the US Department of Energy; Stockholm Health Authority grant SLL 20160933; the Institut Pasteur Korea; an NRF Korea grant (NRF-2014K1A4A7A01074645, 2017M3A9G6068246); the CONICYT Fondecyt Iniciación grants 11140666 and 11160905; Keio University Funds for Individual Research; funds from the Yamagata prefectural government and the city of Tsuruoka; JSPS KAKENHI grant number 20K10436; the bilateral AT-UA collaboration fund (WTZ:UA 02/2019; Ministry of Education and Science of Ukraine, UA:M/84-2019, M/126-2020); Kyiv Academic Univeristy; Ministry of Education and Science of Ukraine project numbers 0118U100290 and 0120U101734; Centro de Excelencia Severo Ochoa 2013–2017; the CERCA Programme / Generalitat de Catalunya; the CRG-Novartis-Africa mobility program 2016; research funds from National Cheng Kung University and the Ministry of Science and Technology; Taiwan (MOST grant number 106-2321-B-006-016); we thank all the volunteers who made sampling NYC possible, Minciencias (project no. 639677758300), CNPq (EDN - 309973/2015-5), the Open Research Fund of Key Laboratory of Advanced Theory and Application in Statistics and Data Science – MOE, ECNU, the Research Grants Council of Hong Kong through project 11215017, National Key RD Project of China (2018YFE0201603), and Shanghai Municipal Science and Technology Major Project (2017SHZDZX01) (L.S.
    corecore