15 research outputs found

    Discourse analysis of arabic documents and application to automatic summarization

    Get PDF
    Dans un discours, les textes et les conversations ne sont pas seulement une juxtaposition de mots et de phrases. Ils sont plutôt organisés en une structure dans laquelle des unités de discours sont liées les unes aux autres de manière à assurer à la fois la cohérence et la cohésion du discours. La structure du discours a montré son utilité dans de nombreuses applications TALN, y compris la traduction automatique, la génération de texte et le résumé automatique. L'utilité du discours dans les applications TALN dépend principalement de la disponibilité d'un analyseur de discours performant. Pour aider à construire ces analyseurs et à améliorer leurs performances, plusieurs ressources ont été annotées manuellement par des informations de discours dans des différents cadres théoriques. La plupart des ressources disponibles sont en anglais. Récemment, plusieurs efforts ont été entrepris pour développer des ressources discursives pour d'autres langues telles que le chinois, l'allemand, le turc, l'espagnol et le hindi. Néanmoins, l'analyse de discours en arabe standard moderne (MSA) a reçu moins d'attention malgré le fait que MSA est une langue de plus de 422 millions de locuteurs dans 22 pays. Le sujet de thèse s'intègre dans le cadre du traitement automatique de la langue arabe, plus particulièrement, l'analyse de discours de textes arabes. Cette thèse a pour but d'étudier l'apport de l'analyse sémantique et discursive pour la génération de résumé automatique de documents en langue arabe. Pour atteindre cet objectif, nous proposons d'étudier la théorie de la représentation discursive segmentée (SDRT) qui propose un cadre logique pour la représentation sémantique de phrases ainsi qu'une représentation graphique de la structure du texte où les relations de discours sont de nature sémantique plutôt qu'intentionnelle. Cette théorie a été étudiée pour l'anglais, le français et l'allemand mais jamais pour la langue arabe. Notre objectif est alors d'adapter la SDRT à la spécificité de la langue arabe afin d'analyser sémantiquement un texte pour générer un résumé automatique. Nos principales contributions sont les suivantes : Une étude de la faisabilité de la construction d'une structure de discours récursive et complète de textes arabes. En particulier, nous proposons : Un schéma d'annotation qui couvre la totalité d'un texte arabe, dans lequel chaque constituant est lié à d'autres constituants. Un document est alors représenté par un graphe acyclique orienté qui capture les relations explicites et les relations implicites ainsi que des phénomènes de discours complexes, tels que l'attachement, la longue distance du discours pop-ups et les dépendances croisées. Une nouvelle hiérarchie des relations de discours. Nous étudions les relations rhétoriques d'un point de vue sémantique en se concentrant sur leurs effets sémantiques et non pas sur la façon dont elles sont déclenchées par des connecteurs de discours, qui sont souvent ambigües en arabe. o une analyse quantitative (en termes de connecteurs de discours, de fréquences de relations, de proportion de relations implicites, etc.) et une analyse qualitative (accord inter-annotateurs et analyse des erreurs) de la campagne d'annotation. Un outil d'analyse de discours où nous étudions à la fois la segmentation automatique de textes arabes en unités de discours minimales et l'identification automatique des relations explicites et implicites du discours. L'utilisation de notre outil pour résumer des textes arabes. Nous comparons la représentation de discours en graphes et en arbres pour la production de résumés.Within a discourse, texts and conversations are not just a juxtaposition of words and sentences. They are rather organized in a structure in which discourse units are related to each other so as to ensure both discourse coherence and cohesion. Discourse structure has shown to be useful in many NLP applications including machine translation, natural language generation and language technology in general. The usefulness of discourse in NLP applications mainly depends on the availability of powerful discourse parsers. To build such parsers and improve their performances, several resources have been manually annotated with discourse information within different theoretical frameworks. Most available resources are in English. Recently, several efforts have been undertaken to develop manually annotated discourse information for other languages such as Chinese, German, Turkish, Spanish and Hindi. Surprisingly, discourse processing in Modern Standard Arabic (MSA) has received less attention despite the fact that MSA is a language with more than 422 million speakers in 22 countries. Computational processing of Arabic language has received a great attention in the literature for over twenty years. Several resources and tools have been built to deal with Arabic non concatenative morphology and Arabic syntax going from shallow to deep parsing. However, the field is still very vacant at the layer of discourse. As far as we know, the sole effort towards Arabic discourse processing was done in the Leeds Arabic Discourse Treebank that extends the Penn Discourse TreeBank model to MSA. In this thesis, we propose to go beyond the annotation of explicit relations that link adjacent units, by completely specifying the semantic scope of each discourse relation, making transparent an interpretation of the text that takes into account the semantic effects of discourse relations. In particular, we propose the first effort towards a semantically driven approach of Arabic texts following the Segmented Discourse Representation Theory (SDRT). Our main contributions are: A study of the feasibility of building a recursive and complete discourse structures of Arabic texts. In particular, we propose: An annotation scheme for the full discourse coverage of Arabic texts, in which each constituent is linked to other constituents. A document is then represented by an oriented acyclic graph, which captures explicit and implicit relations as well as complex discourse phenomena, such as long-distance attachments, long-distance discourse pop-ups and crossed dependencies. A novel discourse relation hierarchy. We study the rhetorical relations from a semantic point of view by focusing on their effect on meaning and not on how they are lexically triggered by discourse connectives that are often ambiguous, especially in Arabic. A thorough quantitative analysis (in terms of discourse connectives, relation frequencies, proportion of implicit relations, etc.) and qualitative analysis (inter-annotator agreements and error analysis) of the annotation campaign. An automatic discourse parser where we investigate both automatic segmentation of Arabic texts into elementary discourse units and automatic identification of explicit and implicit Arabic discourse relations. An application of our discourse parser to Arabic text summarization. We compare tree-based vs. graph-based discourse representations for producing indicative summaries and show that the full discourse coverage of a document is definitively a plus

    An information-based approach to punctuation

    Get PDF
    Ankara : Department of Computer Engineering and Information Science and the Institute of Engineering and Science of Bilkent University, 1998.Thesis (Ph. D.) -- Bilkent University, 1998.Includes bibliographical references leaves 83-93.Say, BilgePh.D

    The Automatic Acquisition of Knowledge about Discourse Connectives

    Get PDF
    Institute for Communicating and Collaborative SystemsThis thesis considers the automatic acquisition of knowledge about discourse connectives. It focuses in particular on their semantic properties, and on the relationships that hold between them. There is a considerable body of theoretical and empirical work on discourse connectives. For example, Knott (1996) motivates a taxonomy of discourse connectives based on relationships between them, such as HYPONYMY and EXCLUSIVE, which are defined in terms of substitution tests. Such work requires either great theoretical insight or manual analysis of large quantities of data. As a result, to date no manual classification of English discourse connectives has achieved complete coverage. For example, Knott gives relationships between only about 18% of pairs obtained from a list of 350 discourse connectives. This thesis explores the possibility of classifying discourse connectives automatically, based on their distributions in texts. This thesis demonstrates that state-of-the-art techniques in lexical acquisition can successfully be applied to acquiring information about discourse connectives. Central to this thesis is the hypothesis that distributional similarity correlates positively with semantic similarity. Support for this hypothesis has previously been found for word classes such as nouns and verbs (Miller and Charles, 1991; Resnik and Diab, 2000, for example), but there has been little exploration of the degree to which it also holds for discourse connectives. We investigate the hypothesis through a number of machine learning experiments. These experiments all use unsupervised learning techniques, in the sense that they do not require any manually annotated data, although they do make use of an automatic parser. First, we show that a range of semantic properties of discourse connectives, such as polarity and veridicality (whether or not the semantics of a connective involves some underlying negation, and whether the connective implies the truth of its arguments, respectively), can be acquired automatically with a high degree of accuracy. Second, we consider the tasks of predicting the similarity and substitutability of pairs of discourse connectives. To assist in this, we introduce a novel information theoretic function based on variance that, in combination with distributional similarity, is useful for learning such relationships. Third, we attempt to automatically construct taxonomies of discourse connectives capturing substitutability relationships. We introduce a probability model of taxonomies, and show that this can improve accuracy on learning substitutability relationships. Finally, we develop an algorithm for automatically constructing or extending such taxonomies which uses beam search to help find the optimal taxonomy

    A study of the summarizing strategies used by ESL first year science students at the University of Botswana

    Get PDF
    One of the major problems faced by speakers of English as a second language (ESL) or non-native speakers of English (NNS) is that when they go to college or university, they find themselves without sufficient academic literacy skills to enable them to navigate their learning successfully, such as the ability to summarize textual material. This thesis examines the summarizing strategies used by ESL first year science students at the University of Botswana. Using multiple data collection methods, otherwise known as triangulation or pluralistic research, which is a combination of quantitative and qualitative methods, one hundred and twenty randomly sampled students completed questionnaires and summarized a scientific text. In order to observe the students more closely, nine students (3 high-, 3 average- and 3 low-proficiency) were purposively selected from the sample and wrote a further summary. The nine students were later interviewed in order to find out from them the kinds of strategies they had used in summarizing the texts. To obtain systematic data, the summaries and the taped interview were coded and analyzed using a hybrid scoring classification previously used by other researchers. The results from the Likert type of questionnaire suggest that the ESL first year science students are 'aware' of the appropriate reading, production and self-assessment strategies to use when summarizing. However, when the data from the questionnaire were cross-checked against the strategies they had used in the actual summarization of the text, most of their claims, especially those of the low-proficiency students, were not sustained. As a whole, the results show that high-proficiency students produce more accurate idea units and are more capable of generalizing ideas than low-proficiency students who prefer to "cut and paste" ideas. There are also significant differences between high- and low proficiency students in the manner in which they decode the text: low-proficiency students produce more distortions in their summaries than high-proficiency students who generally give accurate information. Similarly, high-proficiency students are able to sort out global ideas from a labyrinth of localized ideas, unlike average- and low-proficiency students who include trivial information. The same trend is observed with paraphrasing and sentence combinations: high-proficiency students are generally able to recast and coordinate their ideas, unlike low-proficiency students who produce run-on ideas. In terms of the discrete cognitive and meta-cognitive skills preferred by students, low proficiency students are noticeably unable to exploit pre-summarizing cognitive strategies such as discriminating, selecting, note-making, grouping, inferring meanings of new words and using synonyms to convey the intended meanings. There are also greater differences between high- and low-proficiency students when it comes to the use of meta-cognitive strategies. Unlike high-proficiency students who use their reservoir of meta-cognitive skills such as self-judgment, low-proficiency students ostensibly find it difficult to direct their summaries to the demands of the task and are unable to check the accuracy of their summaries. The findings also show that some of the high-proficiency students and many average- and low-proficiency students distort idea units, find it difficult to use their own words and cannot distinguish between main and supporting details. This resulted in the production of circuitous summaries that often failed to capture the gist of the argument. The way the students processed the main ideas also reveals an inherent weakness: most students of different proficiency levels were unable to combine ideas from different paragraphs to produce a coherent text. Not surprising, then, there were too many long summaries produced by both high- and low-proficiency students. To tackle some of the problems related to summarization, pre-reading strategies can be taught, which activate relevant prior knowledge, so that the learning of new knowledge can be facilitated. During the reading process students can become more meta-cognitively aware by monitoring their level of understanding of the text by using, for example, the strategy suggested by Schraw (1998) of "stop, read and think". Text analysis can also be used to help the students identify the main themes or macro-propositions in a text, and hence gain a more global perspective of the content, which is important for selecting the main ideas in a text. A particularly useful approach to fostering a deeper understanding of content is to use a form of reciprocal or peer-mediated teaching, in which students in pairs can articulate to each other their understanding of the main ideas expressed in the text. As part of the solution to the problems faced by students when processing information, we need to take Sewlall's (2000: 170) advice that there should be "a paradigm shift in the learning philosophy from content-based to an emphasis on the acquisition of skills". In this regard, both content and ESL teachers need to train their students in the explicit use of summarizing strategies, and to plan interwoven lessons and learning activities that develop the learners' intellectual ways of dealing with different learning problems so that they can make learning quicker, easier, more effective and exciting

    Great expectations: unsupervised inference of suspense, surprise and salience in storytelling

    Get PDF
    Stories interest us not because they are a sequence of mundane and predictable events but because they have drama and tension. Crucial to creating dramatic and exciting stories are surprise and suspense. Likewise, certain events are key to the plot and more important than others. Importance is referred to as salience. Inferring suspense, surprise and salience are highly challenging for computational systems. It is difficult because all these elements require a strong comprehension of the characters and their motivations, places, changes over time, and the cause/effect of complex interactions. Recently advances in machine learning (often called deep learning) have substantially improved in many language-related tasks, including story comprehension and story writing. Most of these systems rely on supervision; that is, huge numbers of people need to tag large quantities of data to tell the system what to teach these systems. An example would be tagging which events are suspenseful. It is highly inflexible and costly. Instead, the thesis trains a series of deep learning models via only reading stories, a self-supervised (or unsupervised) system. Narrative theory methods (rules and procedures) are applied to the knowledge built into the deep learning models to directly infer salience, surprise, and salience in stories. Extensions add memory and external knowledge from story plots and from Wikipedia to infer salience on novels such as Great Expectations and plays such as Macbeth. Other work adapts the models as a planning system for generating new stories. The thesis finds that applying the narrative theory to deep learning models can align with the typical reader. In follow up work, the insights could help improve computer models for tasks such as automatic story writing, assistance for writing, summarising or editing stories. Moreover, the approach of applying narrative theory to the inherent qualities built in a system that learns itself (self-supervised) from reading from books, watching videos, listening to audio is much cheaper and more adaptable to other domains and tasks. Progress is swift in improving self-supervised systems. As such, the thesis's relevance is that applying domain expertise with these systems may be a more productive approach in many areas of interest for applying machine learning

    A genre analysis of medical research articles

    Get PDF
    Hospitals and other health institutions around the world have begun to tie staff promotion and careers to publication; accordingly, an increasing number of medical journal articles are being written by non-native English speakers and novice writers. This work aims to analyse medical journal articles as a genre, and follows Swales’ (1990) framework for doing so, by interviewing a sample of the discourse community and finding the Rhetorical Moves that make up the genre, with additional investigation of stance, via selected reporting verbs, and cohesion, through selected discourse markers. I compiled one of the larger corpora of medical research articles (250), as well as one of the most recent (2001-2011). Previous studies reviewed 50 articles at most, drawn from earlier periods of time. As part of the examination of the genre, this study includes discussions with a sample of the discourse community, the users of the genre, with interviews from ten doctors and five editors from around the world who have a wide range of experience in writing, publishing and editing articles. In addition, I identified 17 Rhetorical Moves, with four considered optional, with the idea to identify a sequence that writers and educators can use to see how the medical article may be written. I also examined 13 reporting verbs to determine if it is possible to identify authorial stance regarding the information being reported, and were coded as being factive (the authors agreed with the information), non-factive (the authors conveyed no judgement on the information) and counter-factive (the authors disagreed with the information being reported). Finally, the study looked at how cohesion is maintained through examples of the five types of discourse markers. This study presents the most comprehensive examination of the genre to date, which, through the utilization of corpus analysis techniques, allows a more in-depth analysis than previous studies

    DEAL 2022

    Get PDF

    Text complexity and text simplification in the crisis management domain

    Get PDF
    Due to the fact that emergency situations can lead to substantial losses, both financial and in terms of human lives, it is essential that texts used in a crisis situation be clearly understandable. This thesis is concerned with the study of the complexity of the crisis management sub-language and with methods to produce new, clear texts and to rewrite pre-existing crisis management documents which are too complex to be understood. By doing this, this interdisciplinary study makes several contributions to the crisis management field. First, it contributes to the knowledge of the complexity of the texts used in the domain, by analysing the presence of a set of written language complexity issues derived from the psycholinguistic literature in a novel corpus of crisis management documents. Second, since the text complexity analysis shows that crisis management documents indeed exhibit high numbers of text complexity issues, the thesis adapts to the English language controlled language writing guidelines which, when applied to the crisis management language, reduce its complexity and ambiguity, leading to clear text documents. Third, since low quality of communication can have fatal consequences in emergency situations, the proposed controlled language guidelines and a set of texts which were re-written according to them are evaluated from multiple points of view. In order to achieve that, the thesis both applies existing evaluation approaches and develops new methods which are more appropriate for the task. These are used in two evaluation experiments – evaluation on extrinsic tasks and evaluation of users’ acceptability. The evaluations on extrinsic tasks (evaluating the impact of the controlled language on text complexity, reading comprehension under stress, manual translation, and machine translation tasks) Text Complexity and Text Simplification in the Crisis Management domain 4 show a positive impact of the controlled language on simplified documents and thus ensure the quality of the resource. The evaluation of users’ acceptability contributes additional findings about manual simplification and helps to determine directions for future implementation. The thesis also gives insight into reading comprehension, machine translation, and cross-language adaptability, and provides original contributions to machine translation, controlled languages, and natural language generation evaluation techniques, which make it valuable for several scientific fields, including Linguistics, Psycholinguistics, and a number of different sub-fields of NLP.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018 : 10-12 December 2018, Torino

    Get PDF
    On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-­‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-­‐it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after five years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges

    Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018

    Get PDF
    On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-­‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-­‐it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after five years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges
    corecore