275 research outputs found
ExTaSem! Extending, Taxonomizing and Semantifying Domain Terminologies
We introduce EXTASEM!, a novel approach for the automatic learning of lexical taxonomies from domain terminologies. First, we exploit a very large semantic network to collect thousands of in-domain textual definitions. Second, we extract (hyponym, hypernym) pairs from each definition with a CRF-based algorithm trained on manuallyvalidated data. Finally, we introduce a graph induction procedure which constructs a full-fledged taxonomy where each edge is weighted according to its domain pertinence. EXTASEM! achieves state-of-the-art results in the following taxonomy evaluation experiments: (1) Hypernym discovery, (2) Reconstructing gold standard taxonomies, and (3) Taxonomy quality according to structural measures. We release weighted taxonomies for six domains for the use and scrutiny of the communit
Génération automatique de résumés par analyse sélective
Thèse numérisée par la Direction des bibliothèques de l'Université de Montréal
Description and Evaluation of a Definition Extraction System for Catalan
La extracción automática de definiciones (ED) es una tarea que consiste en identificar definiciones en texto. Este artÃculo presenta un método para la identificación de definiciones para el catalán en el dominio enciclopédico, tomando como corpora para entrenamiento y evaluación una colección de documentos de la Wikipedia en catalán (Viquipèdia). El corpus de evaluación ha sido validado manualmente. El sistema consiste en un algoritmo de clasificación supervisado basado en Conditional Random Fields. Además de los habituales rasgos lingüÃsticos, se introducen rasgos que explotan la frecuencia de palabras en dominios generales y especÃficos, en definiciones y oraciones no definitorias, y en posición de definiendum (el término que se define) y de definiens (el clúster de palabras que define el definiendum). Los resultados obtenidos son prometedores, y sugieren que la combinación de rasgos lingüÃsticos y estadÃsticos juegan un papel importante en el desarrollo de sistemas ED para lenguas minoritarias.Automatic Definition Extraction (DE) consists of identifying definitions in naturally-occurring text. This paper presents a method for the identification of definitions in Catalan in the encyclopedic domain. The train and test corpora come from the Catalan Wikipedia (Viquipèdia). The test set has been manually validated. We approach the task as a supervised classification problem, using the Conditional Random Fields algorithm. In addition to the common linguistic features, we introduce features that exploit the frequency of a word in general and specific domains, in definitional and non-definitional sentences, and in definiendum (term to be defined) and definiens (cluster of words that defines the definiendum) position. We obtain promising results that suggest that combining linguistic and statistical features can prove useful for developing DE systems for under-resourced languages.Este trabajo ha sido parcialmente financiado por el proyecto número TIN2012-38584-C06-03 del Ministerio de EconomÃa y Competitividad, SecretarÃa de Estado de Investigación, Desarrollo e Innovación, España
Resumen y extracción de información en tu Tablet
In this article we present a Web-based demonstration of on-line text summarization and information extraction technology. News summarization in Spanish has been implemented in a system that monitors a news provider and summarizes the latest published news. The possibility to generate summaries from user's provided text is also available for English and Spanish. The demonstrator also features event extraction functionalities since it identifies the relevant concepts that characterize several types of events by mining English textual contents.En este artÃculo describimos la demonstración de una serie de aplicaciones de resumen automático y extracción de informaciones integradas en una tableta. Se presentan funcionalidades para resumir las últimas noticias publicadas en la Web, extraer información sobre eventos concretos, y resumir textos en inglés y español ingresados por el usuario. La aplicación está disponible en un Web-browser y una tableta con sistema operativo Android.We acknowledge support from the Spanish research project SKATER-UPF-TALN TIN2012-38584-C06-03, the EU project Dr. Inventor FP7-ICT-2013.8.1 611383, and UPF projects PlaQUID 65 2013-2014 and PlaQUID 47 2011-2012
BODEGA: Benchmark for Adversarial Example Generation in Credibility Assessment
Text classification methods have been widely investigated as a way to detect
content of low credibility: fake news, social media bots, propaganda, etc.
Quite accurate models (likely based on deep neural networks) help in moderating
public electronic platforms and often cause content creators to face rejection
of their submissions or removal of already published texts. Having the
incentive to evade further detection, content creators try to come up with a
slightly modified version of the text (known as an attack with an adversarial
example) that exploit the weaknesses of classifiers and result in a different
output. Here we introduce BODEGA: a benchmark for testing both victim models
and attack methods on four misinformation detection tasks in an evaluation
framework designed to simulate real use-cases of content moderation. We also
systematically test the robustness of popular text classifiers against
available attacking techniques and discover that, indeed, in some cases barely
significant changes in input text can mislead the models. We openly share the
BODEGA code and data in hope of enhancing the comparability and replicability
of further research in this area
¿Es satÃrico este tweet? Un método automático para la identificación del lenguaje satÃrico en español
Computational approaches to analyze figurative language are attracting a growing interest in Computational Linguistics. In this paper, we study the characterization of Twitter messages in Spanish that advertise satirical news. We present and evaluate a system able to classify tweets as satirical or not. To this purpose, we concentrate on the tweets published by several satirical and non-satirical Twitter accounts. We model the text of each tweet by a set of linguistically motivated features that aim at capturing the style more than the content of the message. Our experiments demonstrate that our model outperforms a word-based baseline. We also demonstrate that our system models global features of satirical language by showing that it is able to detect if a tweet contains or not satirical contents independently from the account that generated the tweet.La lingüÃstica computacional está cada vez más interesada en el procesamiento del lenguaje figurado. En este artÃculo estudiamos la detección de noticias satÃricas en español y más especÃficamente la detección de sátira en mensajes de Twitter. Nuestro modelo computacional se basa en la representación de cada mensaje con un conjunto de rasgos diseñados para detectar el estilo satÃrico y no el contenido. Nuestros experimentos muestran que nuestro modelo siempre funciona mejor que un modelo de bolsa de palabras. También mostramos que el sistema es capaz de detectar este tipo de lenguaje independientemente de la cuenta de Twitter que lo origina.The research described in this paper is partially funded by the SKATER-UPF-TALN project (TIN2012-38584-C06-03)
Un sistema de simplificación de textos on-line para el inglés
Text Simplification is the task of reducing the lexical and syntactic complexity of documents in order to improve their readability and understandability. This paper presents a web-based demonstration of a text simplification system that performs state-of-the-art lexical and syntactic simplification of English texts. The core simplification technology used for this demonstration is highly customizable making it suitable for different types of users.La simplificación textual consiste en reducir la complejidad léxica y sintáctica de documentos con el fin de mejorar su legibilidad y comprensibilidad. En este trabajo se presenta una demostración de un sistema on-line de simplificación léxica y sintáctica de textos en inglés. Nuestro sistema es modular y adaptable, lo que lo hace adecuado para diversos tipos de usuarios.This work was funded by the ABLE-TO-INCLUDE project (European Commission Competitiveness and Innovation Framework Programme under Grant Agreement No. 621055) and project SKATER-UPF-TALN (TIN2012- 38584-C06-03) from Ministerio de EconomÃa y Competitividad, SecretarÃa de Estado de Investigación, Desarrollo e Innovación, Spain
DysWebxia: making texts more accessible for people with dyslexia
Alrededor del 10% de la población mundial tiene dislexia, un trastorno de aprendizaje que afecta a las habilidades de lectoescritura. Aunque la dislexia tiene un origen neurológico, mediante ciertas modificaciones en los textos podemos conseguir que éstos sean más accesibles para este colectivo. En este trabajo presentamos DysWebxia, un modelo público que integra recomendaciones de diseño textual y técnicas de procesamiento del lenguaje natural. El modelo está fundamentado en los resultados de nuestras investigaciones llevadas a cabo con personas con dislexia usando metodologÃas de evaluación de interacción hombre-máquina como eye-tracking. Asimismo presentamos las integraciones actuales del modelo en diferentes aplicaciones de software de lectura.About 10% of the world population has dyslexia, a learning disability affecting reading and writing. Even if dyslexia is neurological in origin, certain text modifications can make texts more accessible for people with dyslexia. We present DysWebxia, a public model that integrates our findings from research conducted with this target group by using natural language processing strategies and human computer interaction evaluation techniques such as eye-tracking. This model alters content and presentation of the text to make it more readable. We also present the current integrations of DysWebxia in different reading software applications.Esta investigación ha sido parcialmente financiada por la beca predoctoral FI de la Generalitat de Catalunya
Deliverable 6.1 Infrastructure for Extractive Summarization
SKATER Internal Report: software of infrastructure for extractive Summarization (work carried out until December 2013)Preprin
Colouring summaries BLEU
Abstract In this paper we attempt to apply the IBM algorithm, BLEU, to the output of four different summarizers in order to perform an intrinsic evaluation of their output. The objective of this experiment is to explore whether a metric, originally developed for the evaluation of machine translation output, could be used for assessing another type of output reliably. Changing the type of text to be evaluated by BLEU into automatically generated extracts and setting the conditions and parameters of the evaluation experiment according to the idiosyncrasies of the task, we put the feasibility of porting BLEU in different Natural Language Processing research areas under test. Furthermore, some important conclusions relevant to the resources needed for evaluating summaries have come up as a side-effect of running the whole experiment
- …