56 research outputs found
A Neural Attention Model for Abstractive Sentence Summarization
Summarization based on text extraction is inherently limited, but
generation-style abstractive methods have proven challenging to build. In this
work, we propose a fully data-driven approach to abstractive sentence
summarization. Our method utilizes a local attention-based model that generates
each word of the summary conditioned on the input sentence. While the model is
structurally simple, it can easily be trained end-to-end and scales to a large
amount of training data. The model shows significant performance gains on the
DUC-2004 shared task compared with several strong baselines.Comment: Proceedings of EMNLP 201
Citation Handling for Improved Summarization of Scientific Documents
In this paper we present the first steps toward improving summarization
of scientific documents through citation analysis and parsing. Prior
work (Mohammad et al., 2009) argues that citation texts (sentences that
cite other papers) play a crucial role in automatic summarization of a
topical area, but did not take into account the noise introduced by the
citations themselves. We demonstrate that it is possible to improve
summarization output through careful handling of these citations. We
base our experiments on the application of an improved trimming approach
to summarization of citation texts extracted from Question-Answering and
Dependency-Parsing documents. We demonstrate that confidence scores from
the Stanford NLP Parser (Klein and Manning, 2003) are significantly
improved, and that Trimmer (Zajic et al., 2007), a sentence-compression
tool, is able to generate higher-quality candidates. Our summarization
output is currently used as part of a larger system, Action Science
Explorer (ASE) (Gove, 2011)
Analyzing collaborative learning processes automatically
In this article we describe the emerging area of text classification research focused on the problem of collaborative learning process analysis both from a broad perspective and more specifically in terms of a publicly available tool set called TagHelper tools. Analyzing the variety of pedagogically valuable facets of learners’ interactions is a time consuming and effortful process. Improving automated analyses of such highly valued processes of collaborative learning by adapting and applying recent text classification technologies would make it a less arduous task to obtain insights from corpus data. This endeavor also holds the potential for enabling substantially improved on-line instruction both by providing teachers and facilitators with reports about the groups they are moderating and by triggering context sensitive collaborative learning support on an as-needed basis. In this article, we report on an interdisciplinary research project, which has been investigating the effectiveness of applying text classification technology to a large CSCL corpus that has been analyzed by human coders using a theory-based multidimensional coding scheme. We report promising results and include an in-depth discussion of important issues such as reliability, validity, and efficiency that should be considered when deciding on the appropriateness of adopting a new technology such as TagHelper tools. One major technical contribution of this work is a demonstration that an important piece of the work towards making text classification technology effective for this purpose is designing and building linguistic pattern detectors, otherwise known as features, that can be extracted reliably from texts and that have high predictive power for the categories of discourse actions that the CSCL community is interested in
Graph-based Patterns for Local Coherence Modeling
Coherence is an essential property of well-written texts. It distinguishes a multi-sentence text from a sequence of randomly strung sentences. The task of local coherence modeling is about the way that sentences in a text link up one another. Solving this task is beneficial for assessing the quality of texts. Moreover, a coherence model can be integrated into text generation systems such as text summarizers to produce coherent texts.
In this dissertation, we present a graph-based approach to local coherence modeling that accounts for the connectivity structure among sentences in a text. Graphs give our model the capability to take into account relations between non-adjacent sentences as well as those between adjacent sentences. Besides, the connectivity style among nodes in graphs reflects the relationships among sentences in a text.
We first employ the entity graph approach, proposed by Guinaudeau and Strube (2013), to represent a text via a graph. In the entity graph representation of a text, nodes encode sentences and edges depict the existence of a pair of coreferent mentions in sentences. We then devise graph-based features to capture the connectivity structure of nodes in a graph, and accordingly the connectivity structure of sentences in the corresponding text. We extract all subgraphs of entity graphs as features which encode the connectivity structure of graphs. Frequencies of subgraphs correlate with the perceived coherence of their corresponding texts. Therefore, we refer to these subgraphs as coherence patterns.
In order to complete our approach to coherence modeling, we propose a new graph representation of texts, rather than the entity graph. Our approach employs lexico-semantic relations among words in sentences, instead of only entity coreference relations, to model relationships between sentences via a graph. This new lexical graph representation of text plus our method for mining coherence patterns make our coherence model.
We evaluate our approach on the readability assessment task because a primary factor of readability is coherence. Coherent texts are easy to read and consequently demand less effort from their readers. Our extensive experiments on two separate readability assessment datasets show that frequencies of coherence patterns in texts correlate with the readability ratings assigned by human judges. By training a machine learning method on our coherence patterns, our model outperforms its counterparts on ranking texts with respect to their readability. As one of the ultimate goals of coherence models is to be used in text generation systems, we show how our coherence patterns can be integrated into a graph-based text summarizer to produce informative and coherent summaries. Our coherence patterns improve the performance of the summarization system based on both standard summarization metrics and human evaluations. An implementation of the approaches discussed in this dissertation is publicly available
SPEC5G: A Dataset for 5G Cellular Network Protocol Analysis
5G is the 5th generation cellular network protocol. It is the
state-of-the-art global wireless standard that enables an advanced kind of
network designed to connect virtually everyone and everything with increased
speed and reduced latency. Therefore, its development, analysis, and security
are critical. However, all approaches to the 5G protocol development and
security analysis, e.g., property extraction, protocol summarization, and
semantic analysis of the protocol specifications and implementations are
completely manual. To reduce such manual effort, in this paper, we curate
SPEC5G the first-ever public 5G dataset for NLP research. The dataset contains
3,547,586 sentences with 134M words, from 13094 cellular network specifications
and 13 online websites. By leveraging large-scale pre-trained language models
that have achieved state-of-the-art results on NLP tasks, we use this dataset
for security-related text classification and summarization. Security-related
text classification can be used to extract relevant security-related properties
for protocol testing. On the other hand, summarization can help developers and
practitioners understand the high level of the protocol, which is itself a
daunting task. Our results show the value of our 5G-centric dataset in 5G
protocol analysis automation. We believe that SPEC5G will enable a new research
direction into automatic analyses for the 5G cellular network protocol and
numerous related downstream tasks. Our data and code are publicly available
Table-to-Text: Generating Descriptive Text for Scientific Tables from Randomized Controlled Trials
Unprecedented amounts of data have been generated in the biomedical domain, and the bottleneck for biomedical research has shifted from data generation to data management, interpretation, and communication. Therefore, it is highly desirable to develop systems to assist in text generation from biomedical data, which will greatly improve the dissemination of scientific findings. However, very few studies have investigated issues of data-to-text generation in the biomedical domain. Here I present a systematic study for generating descriptive text from tables in randomized clinical trials (RCT) articles, which includes: (1) an information model for representing RCT tables; (2) annotated corpora containing pairs of RCT table and descriptive text, and labeled structural and semantic information of RCT tables; (3) methods for recognizing structural and semantic information of RCT tables; (4) methods for generating text from RCT tables, evaluated by a user study on three aspects: relevance, grammatical quality, and matching. The proposed hybrid text generation method achieved a low bilingual evaluation understudy (BLEU) score of 5.69; but human review achieved scores of 9.3, 9.9 and 9.3 for relevance, grammatical quality and matching, respectively, which are comparable to review of original human-written text. To the best of our knowledge, this is the first study to generate text from scientific tables in the biomedical domain. The proposed information model, labeled corpora and developed methods for recognizing tables and generating descriptive text could also facilitate other biomedical and informatics research and applications
Feasibility of using citations as document summaries
The purpose of this research is to establish whether it is feasible to use citations as document summaries. People are good at creating and selecting summaries and are generally the standard for evaluating computer generated summaries. Citations can be characterized as concept symbols or short summaries of the document they are citing. Similarity metrics have been used in retrieval and text summarization to determine how alike two documents are. Similarity metrics have never been compared to what human subjects think are similar between two documents. If similarity metrics reflect human judgment, then we can mechanize the selection of citations that act as short summaries of the document they are citing. The research approach was to gather rater data comparing document abstracts to citations about the same document and then to statistically compare those results to several document metrics; frequency count, similarity metric, citation location and type of citation. There were two groups of raters, subject experts and non-experts. Both groups of raters were asked to evaluate seven parameters between abstract and citations: purpose, subject matter, methods, conclusions, findings, implications, readability, andunderstandability. The rater was to identify how strongly the citation represented the content of the abstract, on a five point likert scale. Document metrics were collected for frequency count, cosine, and similarity metric between abstracts and associated citations. In addition, data was collected on the location of the citations and the type of citation. Location was identified and dummy coded for introduction, method, discussion, review of the literature and conclusion. Citations were categorized and dummy coded for whether they refuted, noted, supported, reviewed, or applied information about the cited document. The results show there is a relationship between some similarity metrics and human judgment of similarity.Ph.D., Information Studies -- Drexel University, 200
Multilingual opinion mining
170 p.Cada dĂa se genera gran cantidad de texto en diferentes medios online. Gran parte de ese texto contiene opiniones acerca de multitud de entidades, productos, servicios, etc. Dada la creciente necesidad de disponer de medios automatizados para analizar, procesar y explotar esa informaciĂłn, las tĂ©cnicas de análisis de sentimiento han recibido gran cantidad de atenciĂłn por parte de la industria y la comunidad cientĂfica durante la Ăşltima dĂ©cada y media. No obstante, muchas de las tĂ©cnicas empleadas suelen requerir de entrenamiento supervisado utilizando para ello ejemplos anotados manualmente, u otros recursos lingĂĽĂsticos relacionados con un idioma o dominio de aplicaciĂłn especĂficos. Esto limita la aplicaciĂłn de este tipo de tĂ©cnicas, ya que dicho recursos y ejemplos anotados no son sencillos de obtener. En esta tesis se explora una serie de mĂ©todos para realizar diversos análisis automáticos de texto en el marco del análisis de sentimiento, incluyendo la obtenciĂłn automática de tĂ©rminos de un dominio, palabras que expresan opiniĂłn, polaridad del sentimiento de dichas palabras (positivas o negativas), etc. Finalmente se propone y se evalĂşa un mĂ©todo que combina representaciĂłn continua de palabras (continuous word embeddings) y topic-modelling inspirado en la tĂ©cnica de Latent Dirichlet Allocation (LDA), para obtener un sistema de análisis de sentimiento basado en aspectos (ABSA), que sĂłlo necesita unas pocas palabras semilla para procesar textos de un idioma o dominio determinados. De este modo, la adaptaciĂłn a otro idioma o dominio se reduce a la traducciĂłn de las palabras semilla correspondientes
- …