678 research outputs found

    Sentence Simplification for Text Processing

    Get PDF
    A thesis submitted in partial fulfilment of the requirement of the University of Wolverhampton for the degree of Doctor of Philosophy.Propositional density and syntactic complexity are two features of sentences which affect the ability of humans and machines to process them effectively. In this thesis, I present a new approach to automatic sentence simplification which processes sentences containing compound clauses and complex noun phrases (NPs) and converts them into sequences of simple sentences which contain fewer of these constituents and have reduced per sentence propositional density and syntactic complexity. My overall approach is iterative and relies on both machine learning and handcrafted rules. It implements a small set of sentence transformation schemes, each of which takes one sentence containing compound clauses or complex NPs and converts it one or two simplified sentences containing fewer of these constituents (Chapter 5). The iterative algorithm applies the schemes repeatedly and is able to simplify sentences which contain arbitrary numbers of compound clauses and complex NPs. The transformation schemes rely on automatic detection of these constituents, which may take a variety of forms in input sentences. In the thesis, I present two new shallow syntactic analysis methods which facilitate the detection process. The first of these identifies various explicit signs of syntactic complexity in input sentences and classifies them according to their specific syntactic linking and bounding functions. I present the annotated resources used to train and evaluate this sign tagger (Chapter 2) and the machine learning method used to implement it (Chapter 3). The second syntactic analysis method exploits the sign tagger and identifies the spans of compound clauses and complex NPs in input sentences. In Chapter 4 of the thesis, I describe the development and evaluation of a machine learning approach performing this task. This chapter also presents a new annotated dataset supporting this activity. In the thesis, I present two implementations of my approach to sentence simplification. One of these exploits handcrafted rule activation patterns to detect different parts of input sentences which are relevant to the simplification process. The other implementation uses my machine learning method to identify compound clauses and complex NPs for this purpose. Intrinsic evaluation of the two implementations is presented in Chapter 6 together with a comparison of their performance with several baseline systems. The evaluation includes comparisons of system output with human-produced simplifications, automated estimations of the readability of system output, and surveys of human opinions on the grammaticality, accessibility, and meaning of automatically produced simplifications. Chapter 7 presents extrinsic evaluation of the sentence simplification method exploiting handcrafted rule activation patterns. The extrinsic evaluation involves three NLP tasks: multidocument summarisation, semantic role labelling, and information extraction. Finally, in Chapter 8, conclusions are drawn and directions for future research considered

    An evaluation of syntactic simplification rules for people with autism

    Get PDF
    Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR) at the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2014)Syntactically complex sentences constitute an obstacle for some people with Autistic Spectrum Disorders. This paper evaluates a set of simplification rules specifically designed for tackling complex and compound sentences. In total, 127 different rules were developed for the rewriting of complex sentences and 56 for the rewriting of compound sentences. The evaluation assessed the accuracy of these rules individually and revealed that fully automatic conversion of these sentences into a more accessible form is not very reliable.EC FP7-ICT-2011-

    Identifying Signs of Syntactic Complexity for Rule-Based Sentence Simplification

    Get PDF
    This article presents a new method to automatically simplify English sentences. The approach is designed to reduce the number of compound clauses and nominally bound relative clauses in input sentences. The article provides an overview of a corpus annotated with information about various explicit signs of syntactic complexity and describes the two major components of a sentence simplification method that works by exploiting information on the signs occurring in the sentences of a text. The first component is a sign tagger which automatically classifies signs in accordance with the annotation scheme used to annotate the corpus. The second component is an iterative rule-based sentence transformation tool. Exploiting the sign tagger in conjunction with other NLP components, the sentence transformation tool automatically rewrites long sentences containing compound clauses and nominally bound relative clauses as sequences of shorter single-clause sentences. Evaluation of the different components reveals acceptable performance in rewriting sentences containing compound clauses but less accuracy when rewriting sentences containing nominally bound relative clauses. A detailed error analysis revealed that the major sources of error include inaccurate sign tagging, the relatively limited coverage of the rules used to rewrite sentences, and an inability to discriminate between various subtypes of clause coordination. Despite this, the system performed well in comparison with two baselines. This finding was reinforced by automatic estimations of the readability of system output and by surveys of readers’ opinions about the accuracy, accessibility, and meaning of this output

    Type-driven semantic interpretation and feature dependencies in R-LFG

    Full text link
    Once one has enriched LFG's formal machinery with the linear logic mechanisms needed for semantic interpretation as proposed by Dalrymple et. al., it is natural to ask whether these make any existing components of LFG redundant. As Dalrymple and her colleagues note, LFG's f-structure completeness and coherence constraints fall out as a by-product of the linear logic machinery they propose for semantic interpretation, thus making those f-structure mechanisms redundant. Given that linear logic machinery or something like it is independently needed for semantic interpretation, it seems reasonable to explore the extent to which it is capable of handling feature structure constraints as well. R-LFG represents the extreme position that all linguistically required feature structure dependencies can be captured by the resource-accounting machinery of a linear or similiar logic independently needed for semantic interpretation, making LFG's unification machinery redundant. The goal is to show that LFG linguistic analyses can be expressed as clearly and perspicuously using the smaller set of mechanisms of R-LFG as they can using the much larger set of unification-based mechanisms in LFG: if this is the case then we will have shown that positing these extra f-structure mechanisms is not linguistically warranted.Comment: 30 pages, to appear in the the ``Glue Language'' volume edited by Dalrymple, uses tree-dvips, ipa, epic, eepic, fullnam

    Argumentation Mining in User-Generated Web Discourse

    Full text link
    The goal of argumentation mining, an evolving research field in computational linguistics, is to design methods capable of analyzing people's argumentation. In this article, we go beyond the state of the art in several ways. (i) We deal with actual Web data and take up the challenges given by the variety of registers, multiple domains, and unrestricted noisy user-generated Web discourse. (ii) We bridge the gap between normative argumentation theories and argumentation phenomena encountered in actual data by adapting an argumentation model tested in an extensive annotation study. (iii) We create a new gold standard corpus (90k tokens in 340 documents) and experiment with several machine learning methods to identify argument components. We offer the data, source codes, and annotation guidelines to the community under free licenses. Our findings show that argumentation mining in user-generated Web discourse is a feasible but challenging task.Comment: Cite as: Habernal, I. & Gurevych, I. (2017). Argumentation Mining in User-Generated Web Discourse. Computational Linguistics 43(1), pp. 125-17

    Automatic Scaling of Text for Training Second Language Reading Comprehension

    Get PDF
    For children learning their first language, reading is one of the most effective ways to acquire new vocabulary. Studies link students who read more with larger and more complex vocabularies. For second language learners, there is a substantial barrier to reading. Even the books written for early first language readers assume a base vocabulary of nearly 7000 word families and a nuanced understanding of grammar. This project will look at ways that technology can help second language learners overcome this high barrier to entry, and the effectiveness of learning through reading for adults acquiring a foreign language. Through the implementation of Dokusha, an automatic graded reader generator for Japanese, this project will explore how advancements in natural language processing can be used to automatically simplify text for extensive reading in Japanese as a foreign language

    Proceedings of the 3rd Swiss conference on barrier-free communication (BfC 2020)

    Get PDF

    SemClinBr -- a multi institutional and multi specialty semantically annotated corpus for Portuguese clinical NLP tasks

    Full text link
    The high volume of research focusing on extracting patient's information from electronic health records (EHR) has led to an increase in the demand for annotated corpora, which are a very valuable resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multi-purpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field. In this study, we developed a semantically annotated corpus using clinical texts from multiple medical specialties, document types, and institutions. We present the following: (1) a survey listing common aspects and lessons learned from previous research, (2) a fine-grained annotation schema which could be replicated and guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation of the annotations. The result of this work is the SemClinBr, a corpus that has 1,000 clinical notes, labeled with 65,117 entities and 11,263 relations, and can support a variety of clinical NLP tasks and boost the EHR's secondary use for the Portuguese language

    Bridging Cross-Modal Alignment for OCR-Free Content Retrieval in Scanned Historical Documents

    Get PDF
    In this work, we address the limitations of current approaches to document retrieval by incorporating vision-based topic extraction. While previous methods have primarily focused on visual elements or relied on optical character recognition (OCR) for text extraction, we propose a paradigm shift by directly incorporating vision into the topic space. We demonstrate that recognizing all visual elements within a document is unnecessary for identifying its underlying topic. Visual cues such as icons, writing style, and font can serve as sufficient indicators. By leveraging ranking loss functions and convolutional neural networks (CNNs), we learn complex topological representations that mimic the behavior of text representations. Our approach aims to eliminate the need for OCR and its associated challenges, including efficiency, performance, data-hunger, and expensive annotation. Furthermore, we highlight the significance of incorporating vision in historical documentation, where visually antiquated documents contain valuable cues. Our research contributes to the understanding of topic extraction from a vision perspective and offers insights into annotation-cheap document retrieval system
    • …
    corecore