90 research outputs found

    Use of Genetic Algorithm for Cohesive Summary Extraction to Assist Reading Difficulties

    Get PDF
    Learners with reading difficulties normally face significant challenges in understanding the text-based learning materials. In this regard, there is a need for an assistive summary to help such learners to approach the learning documents with minimal difficulty. An important issue in extractive summarization is to extract cohesive summary from the text. Existing summarization approaches focus mostly on informative sentences rather than cohesive sentences. We considered several existing features, including sentence location, cardinality, title similarity, and keywords to extract important sentences. Moreover, learner-dependent readability-related features such as average sentence length, percentage of trigger words, percentage of polysyllabic words, and percentage of noun entity occurrences are considered for the summarization purpose. The objective of this work is to extract the optimal combination of sentences that increase readability through sentence cohesion using genetic algorithm. The results show that the summary extraction using our proposed approach performs better in -measure, readability, and cohesion than the baseline approach (lead) and the corpus-based approach. The task-based evaluation shows the effect of summary assistive reading in enhancing readability on reading difficulties

    PersoNER: Persian named-entity recognition

    Full text link
    © 1963-2018 ACL. Named-Entity Recognition (NER) is still a challenging task for languages with low digital resources. The main difficulties arise from the scarcity of annotated corpora and the consequent problematic training of an effective NER pipeline. To abridge this gap, in this paper we target the Persian language that is spoken by a population of over a hundred million people world-wide. We first present and provide ArmanPerosNERCorpus, the first manually-annotated Persian NER corpus. Then, we introduce PersoNER, an NER pipeline for Persian that leverages a word embedding and a sequential max-margin classifier. The experimental results show that the proposed approach is capable of achieving interesting MUC7 and CoNNL scores while outperforming two alternatives based on a CRF and a recurrent neural network

    A Comparative Study of Text Summarization on E-mail Data Using Unsupervised Learning Approaches

    Get PDF
    Over the last few years, email has met with enormous popularity. People send and receive a lot of messages every day, connect with colleagues and friends, share files and information. Unfortunately, the email overload outbreak has developed into a personal trouble for users as well as a financial concerns for businesses. Accessing an ever-increasing number of lengthy emails in the present generation has become a major concern for many users. Email text summarization is a promising approach to resolve this challenge. Email messages are general domain text, unstructured and not always well developed syntactically. Such elements introduce challenges for study in text processing, especially for the task of summarization. This research employs a quantitative and inductive methodologies to implement the Unsupervised learning models that addresses summarization task problem, to efficiently generate more precise summaries and to determine which approach of implementing Unsupervised clustering models outperform the best. The precision score from ROUGE-N metrics is used as the evaluation metrics in this research. This research evaluates the performance in terms of the precision score of four different approaches of text summarization by using various combinations of feature embedding technique like Word2Vec /BERT model and hybrid/conventional clustering algorithms. The results reveals that both the approaches of using Word2Vec and BERT feature embedding along with hybrid PHA-ClusteringGain k-Means algorithm achieved increase in the precision when compared with the conventional k-means clustering model. Among those hybrid approaches performed, the one using Word2Vec as feature embedding method attained 55.73% as maximum precision value

    Monolingual Sentence Rewriting as Machine Translation: Generation and Evaluation

    Get PDF
    In this thesis, we investigate approaches to paraphrasing entire sentences within the constraints of a given task, which we call monolingual sentence rewriting. We introduce a unified framework for monolingual sentence rewriting, and apply it to three representative tasks: sentence compression, text simplification, and grammatical error correction. We also perform a detailed analysis of the evaluation methodologies for each task, identify bias in common evaluation techniques, and propose more reliable practices. Monolingual rewriting can be thought of as translating between two types of English (such as from complex to simple), and therefore our approach is inspired by statistical machine translation. In machine translation, a large quantity of parallel data is necessary to model the transformations from input to output text. Parallel bilingual data naturally occurs between common language pairs (such as English and French), but for monolingual sentence rewriting, there is little existing parallel data and annotation is costly. We modify the statistical machine translation pipeline to harness monolingual resources and insights into task constraints in order to drastically diminish the amount of annotated data necessary to train a robust system. Our method generates more meaning-preserving and grammatical sentences than earlier approaches and requires less task-specific data. Once candidate sentences are generated, it is crucial to have reliable evaluation methods. Sentential paraphrases must fulfill a variety of requirements: preserve the meaning of the original sentence, be grammatical, and meet any stylistic or task-specific constraints. We analyze common evaluation practices and propose better methods that more accurately measure the quality of output. Often overlooked, robust automatic evaluation methodology is necessary for improving systems, and this work presents new metrics and outlines important considerations for reliably measuring the quality of the generated text

    Supervised extractive summarisation of news events

    Get PDF
    This thesis investigates whether the summarisation of news-worthy events can be improved by using evidence about entities (i.e.\ people, places, and organisations) involved in the events. More effective event summaries, that better assist people with their news-based information access requirements, can help to reduce information overload in today's 24-hour news culture. Summaries are based on sentences extracted verbatim from news articles about the events. Within a supervised machine learning framework, we propose a series of entity-focused event summarisation features. Computed over multiple news articles discussing a given event, such entity-focused evidence estimates: the importance of entities within events; the significance of interactions between entities within events; and the topical relevance of entities to events. The statement of this research work is that augmenting supervised summarisation models, which are trained on discriminative multi-document newswire summarisation features, with evidence about the named entities involved in the events, by integrating entity-focused event summarisation features, we will obtain more effective summaries of news-worthy events. The proposed entity-focused event summarisation features are thoroughly evaluated over two multi-document newswire summarisation scenarios. The first scenario is used to evaluate the retrospective event summarisation task, where the goal is to summarise an event to-date, based on a static set of news articles discussing the event. The second scenario is used to evaluate the temporal event summarisation task, where the goal is to summarise the changes in an ongoing event, based on a time-stamped stream of news articles discussing the event. The contributions of this thesis are two-fold. First, this thesis investigates the utility of entity-focused event evidence for identifying important and salient event summary sentences, and as a means to perform anti-redundancy filtering to control the volume of content emitted as a summary of an evolving event. Second, this thesis also investigates the validity of automatic summarisation evaluation metrics, the effectiveness of standard summarisation baselines, and the effective training of supervised machine learned summarisation models

    Automatic Scaling of Text for Training Second Language Reading Comprehension

    Get PDF
    For children learning their first language, reading is one of the most effective ways to acquire new vocabulary. Studies link students who read more with larger and more complex vocabularies. For second language learners, there is a substantial barrier to reading. Even the books written for early first language readers assume a base vocabulary of nearly 7000 word families and a nuanced understanding of grammar. This project will look at ways that technology can help second language learners overcome this high barrier to entry, and the effectiveness of learning through reading for adults acquiring a foreign language. Through the implementation of Dokusha, an automatic graded reader generator for Japanese, this project will explore how advancements in natural language processing can be used to automatically simplify text for extensive reading in Japanese as a foreign language

    Graph-based Patterns for Local Coherence Modeling

    Get PDF
    Coherence is an essential property of well-written texts. It distinguishes a multi-sentence text from a sequence of randomly strung sentences. The task of local coherence modeling is about the way that sentences in a text link up one another. Solving this task is beneficial for assessing the quality of texts. Moreover, a coherence model can be integrated into text generation systems such as text summarizers to produce coherent texts. In this dissertation, we present a graph-based approach to local coherence modeling that accounts for the connectivity structure among sentences in a text. Graphs give our model the capability to take into account relations between non-adjacent sentences as well as those between adjacent sentences. Besides, the connectivity style among nodes in graphs reflects the relationships among sentences in a text. We first employ the entity graph approach, proposed by Guinaudeau and Strube (2013), to represent a text via a graph. In the entity graph representation of a text, nodes encode sentences and edges depict the existence of a pair of coreferent mentions in sentences. We then devise graph-based features to capture the connectivity structure of nodes in a graph, and accordingly the connectivity structure of sentences in the corresponding text. We extract all subgraphs of entity graphs as features which encode the connectivity structure of graphs. Frequencies of subgraphs correlate with the perceived coherence of their corresponding texts. Therefore, we refer to these subgraphs as coherence patterns. In order to complete our approach to coherence modeling, we propose a new graph representation of texts, rather than the entity graph. Our approach employs lexico-semantic relations among words in sentences, instead of only entity coreference relations, to model relationships between sentences via a graph. This new lexical graph representation of text plus our method for mining coherence patterns make our coherence model. We evaluate our approach on the readability assessment task because a primary factor of readability is coherence. Coherent texts are easy to read and consequently demand less effort from their readers. Our extensive experiments on two separate readability assessment datasets show that frequencies of coherence patterns in texts correlate with the readability ratings assigned by human judges. By training a machine learning method on our coherence patterns, our model outperforms its counterparts on ranking texts with respect to their readability. As one of the ultimate goals of coherence models is to be used in text generation systems, we show how our coherence patterns can be integrated into a graph-based text summarizer to produce informative and coherent summaries. Our coherence patterns improve the performance of the summarization system based on both standard summarization metrics and human evaluations. An implementation of the approaches discussed in this dissertation is publicly available

    ALens: An Adaptive Domain-Oriented Abstract Writing Training Tool for Novice Researchers

    Full text link
    The significance of novice researchers acquiring proficiency in writing abstracts has been extensively documented in the field of higher education, where they often encounter challenges in this process. Traditionally, students have been advised to enroll in writing training courses as a means to develop their abstract writing skills. Nevertheless, this approach frequently falls short in providing students with personalized and adaptable feedback on their abstract writing. To address this gap, we initially conducted a formative study to ascertain the user requirements for an abstract writing training tool. Subsequently, we proposed a domain-specific abstract writing training tool called ALens, which employs rhetorical structure parsing to identify key concepts, evaluates abstract drafts based on linguistic features, and employs visualization techniques to analyze the writing patterns of exemplary abstracts. A comparative user study involving an alternative abstract writing training tool has been conducted to demonstrate the efficacy of our approach.Comment: Accepted by HHME/CHCI 202

    Principled Approaches to Automatic Text Summarization

    Get PDF
    Automatic text summarization is a particularly challenging Natural Language Processing (NLP) task involving natural language understanding, content selection and natural language generation. In this thesis, we concentrate on the content selection aspect, the inherent problem of summarization which is controlled by the notion of information Importance. We present a simple and intuitive formulation of the summarization task as two components: a summary scoring function θ measuring how good a text is as a summary of the given sources, and an optimization technique O extracting a summary with a high score according to θ. This perspective offers interesting insights over previous summarization efforts and allows us to pinpoint promising research directions. In particular, we realize that previous works heavily constrained the summary scoring function in order to solve convenient optimization problems (e.g., Integer Linear Programming). We question this assumption and demonstrate that General Purpose Optimization (GPO) techniques like genetic algorithms are practical. These GPOs do not require mathematical properties from the objective function and, thus, the summary scoring function can be relieved from its previously imposed constraints. Additionally, the summary scoring function can be evaluated on its own based on its ability to correlate with humans. This offers a principled way of examining the inner workings of summarization systems and complements the traditional evaluations of the extracted summaries. In fact, evaluation metrics are also summary scoring functions which should correlate well with humans. Thus, the two main challenges of summarization, the evaluation and the development of summarizers, are unified within the same setup: discovering strong summary scoring functions. Hence, we investigated ways of uncovering such functions. First, we conducted an empirical study of learning the summary scoring function from data. The results show that an unconstrained summary scoring function is better able to correlate with humans. Furthermore, an unconstrained summary scoring function optimized approximately with GPO extracts better summaries than a constrained summary scoring function optimized exactly with, e.g., ILP. Along the way, we proposed techniques to leverage the small and biased human judgment datasets. Additionally, we released a new evaluation metric explicitly trained to maximize its correlation with humans. Second, we developed a theoretical formulation of the notion of Importance. In a framework rooted in information theory, we defined the quantities: Redundancy, Relevance and Informativeness. Importance arises as the notion unifying these concepts. More generally, Importance is the measure that guides which choices to make when information must be discarded. Finally, evaluation remains an open-problem with a massive impact on summarization progress. Thus, we conducted experiments on available human judgment datasets commonly used to compare evaluation metrics. We discovered that these datasets do not cover the high-quality range in which summarization systems and evaluation metrics operate. This motivates efforts to collect human judgments for high-scoring summaries as this would be necessary to settle the debate over which metric to use. This would also be greatly beneficial for improving summarization systems and metrics alike
    corecore