141 research outputs found
Automatic text summarization with Maximal Frequent Sequences
En las últimas dos décadas un aumento exponencial de la información electrónica
ha provocado una gran necesidad de entender rápidamente grandes
volúmenes de información. En este libro se desarrollan los métodos automáticos
para producir un resumen. Un resumen es un texto corto que transmite la información
más importante de un documento o de una colección de documentos. Los
resúmenes utilizados en este libro son extractivos: una selección de las oraciones
más importantes del texto. Otros retos consisten en generar resúmenes de manera
independiente de lenguaje y dominio.
Se describe la identificación de cuatro etapas para generación de resúmenes
extractivos. La primera etapa es la selección de términos, en la que uno tiene
que decidir qué unidades contarían como términos individuales. El proceso de
estimación de la utilidad de los términos individuales se llama etapa de pesado
de términos. El siguiente paso se denota como pesado de oraciones, donde todas
las secuencias reciben alguna medida numérica de acuerdo con la utilidad de
términos. Finalmente, el proceso de selección de las oraciones más importantes
se llama selección de oraciones. Los diferentes métodos para generación de resúmenes
extractivos pueden ser caracterizados como representan estas etapas.
En este libro se describe la etapa de selección de términos, en la que la detección
de descripciones multipalabra se realiza considerando Secuencias Frecuentes
Maximales (sfms), las cuales adquieren un significado importante, mientras
Secuencias Frecuentes (sf) no maximales, que son partes de otros sf, no deben
de ser consideradas. En la motivación se consideró costo vs. beneficio: existen
muchas sf no maximales, mientras que la probabilidad de adquirir un significado
importante es baja. De todos modos, las sfms representan todas las sfs en el
modo compacto: todas las sfs podrían ser obtenidas a partir de todas las sfms
explotando cada sfm al conjunto de todas sus subsecuencias. Se presentan los nuevos métodos basados en grafos, algoritmos de agrupamiento
y algoritmos genéticos, los cuales facilitan la tarea de generación de
resúmenes de textos. Se ha experimentado diferentes combinaciones de las opciones
de selección de términos, pesado de términos, pesado de oraciones y
selección de oraciones para generar los resúmenes extractivos de textos independientes
de lenguaje y dominio para una colección de noticias. Se ha analizado
algunas opciones basadas en descripciones multipalabra considerándolas en los
métodos de grafos, algoritmos de agrupamiento y algoritmos genéticos. Se han
obtenido los resultados superiores al de estado de arte.
Este libro está dirigido a los estudiantes y científicos del área de Lingüística
Computacional, y también a quienes quieren saber sobre los recientes avances en
las investigaciones de generación automática de resúmenes de textos.In the last two decades, an exponential increase in the available electronic information
causes a big necessity to quickly understand large volumes of information.
It raises the importance of the development of automatic methods for
detecting the most relevant content of a document in order to produce a shorter
text. Automatic Text Summarization (ats) is an active research area dedicated to
generate abstractive and extractive summaries not only for a single document, but
also for a collection of documents. Other necessity consists in finding method for
ats in a language and domain independent way.
In this book we consider extractive text summarization for single document
task. We have identified that a typical extractive summarization method consists
in four steps. First step is a term selection where one should decide what units
will count as individual terms. The process of estimating the usefulness of the
individual terms is called term weighting step. The next step denotes as sentence
weighting where all the sentences receive some numerical measure according to
the usefulness of its terms. Finally, the process of selecting the most relevant sentences
calls sentence selection. Different extractive summarization methods can
be characterized how they perform these steps.
In this book, in the term selection step, we describe how to detect multiword
descriptions considering Maximal Frequent Sequences (mfss), which bearing important
meaning, while non-maximal frequent sequences (fss), those that are
parts of another fs, should not be considered. Our additional motivation was
cost vs. benefit considerations: there are too many non-maximal fss while their
probability to bear important meaning is lower. In any case, mfss represent all fss
in a compact way: all fss can be obtained from all mfss by bursting each mfs into
a set of all its subsequences.New methods based on graph algorithms, genetic algorithms, and clustering
algorithms which facilitate the text summarization task are presented. We
have tested different combinations of term selection, term weighting, sentence
weighting and sentence selection options for language-and domain-independent
extractive single-document text summarization on a news report collection. We
analyzed several options based on mfss, considering them with graph, genetic,
and clustering algorithms. We obtained results superior to the existing state-ofthe-
art methods.
This book is addressed for students and scientists of the area of Computational
Linguistics, and also who wants to know recent developments in the area of Automatic
Text Generation of Summaries
Corpora for Computational Linguistics
Since the mid 90s corpora has become very important for computational linguistics. This paper offers a survey of how they are currently used in different fields of the discipline, with particular emphasis on anaphora and coreference resolution, automatic summarisation and term extraction.
Their influence on other fields is also briefly discussed
Verso la costruzione di una biblioteca digitale.
A data base of the "Antonio Zampolli Fund" has been created and the respective catalogue has been published1. The work of analysis and selection of texts for cataloguing helped in creating this bibliography, in large part built on references extracted by books and journals. Very old bibliographical references have also been retrieved by curricula prepared by Professor Zampolli for various projects and commissions
Why We Need New Evaluation Metrics for NLG
The majority of NLG evaluation relies on automatic metrics, such as BLEU . In
this paper, we motivate the need for novel, system- and data-independent
automatic evaluation methods: We investigate a wide range of metrics, including
state-of-the-art word-based and novel grammar-based ones, and demonstrate that
they only weakly reflect human judgements of system outputs as generated by
data-driven, end-to-end NLG. We also show that metric performance is data- and
system-specific. Nevertheless, our results also suggest that automatic metrics
perform reliably at system-level and can support system development by finding
cases where a system performs poorly.Comment: accepted to EMNLP 201
Finding Important Terms for Patients in Their Electronic Health Records: A Learning-to-Rank Approach Using Expert Annotations
BACKGROUND: Many health organizations allow patients to access their own electronic health record (EHR) notes through online patient portals as a way to enhance patient-centered care. However, EHR notes are typically long and contain abundant medical jargon that can be difficult for patients to understand. In addition, many medical terms in patients\u27 notes are not directly related to their health care needs. One way to help patients better comprehend their own notes is to reduce information overload and help them focus on medical terms that matter most to them. Interventions can then be developed by giving them targeted education to improve their EHR comprehension and the quality of care.
OBJECTIVE: We aimed to develop a supervised natural language processing (NLP) system called Finding impOrtant medical Concepts most Useful to patientS (FOCUS) that automatically identifies and ranks medical terms in EHR notes based on their importance to the patients.
METHODS: First, we built an expert-annotated corpus. For each EHR note, 2 physicians independently identified medical terms important to the patient. Using the physicians\u27 agreement as the gold standard, we developed and evaluated FOCUS. FOCUS first identifies candidate terms from each EHR note using MetaMap and then ranks the terms using a support vector machine-based learn-to-rank algorithm. We explored rich learning features, including distributed word representation, Unified Medical Language System semantic type, topic features, and features derived from consumer health vocabulary. We compared FOCUS with 2 strong baseline NLP systems.
RESULTS: Physicians annotated 90 EHR notes and identified a mean of 9 (SD 5) important terms per note. The Cohen\u27s kappa annotation agreement was .51. The 10-fold cross-validation results show that FOCUS achieved an area under the receiver operating characteristic curve (AUC-ROC) of 0.940 for ranking candidate terms from EHR notes to identify important terms. When including term identification, the performance of FOCUS for identifying important terms from EHR notes was 0.866 AUC-ROC. Both performance scores significantly exceeded the corresponding baseline system scores (P \u3c .001). Rich learning features contributed to FOCUS\u27s performance substantially.
CONCLUSIONS: FOCUS can automatically rank terms from EHR notes based on their importance to patients. It may help develop future interventions that improve quality of care
Grafitos: un editor didactico para grafos conceptuales
En este documento se describen las principales particularidades de Grafitos, un editor didáctico para grafos conceptuales, el cual hace parte del proyecto de grado INTRODUCCIÓN A LOS GRAFOS CONCEPTUALES: UN EDITOR DIDÁCTICO, en el programa Ingeniería de Sistemas y Computación, de la Universidad Tecnológica de Pereira. Con este editor se pretende presentar a la comunidad universitaria y al publico en general, una herramienta didáctica que facilite una introducción clara y precisa a lo que son los grafos conceptuales y sus principales características
- …