10 research outputs found

    Anaphora in Czech: Large Data and Experiments with Automatic Anaphora Resolution

    Get PDF

    Pattern Based Information Extraction System in Business News Articles

    Get PDF
    Business news journals provide a rich resource of business events, which enable domain experts to further understand the spatio-temporal changes occur among a set of firms and people. However, extracting structured data from journal resource that is text-based and unstructured is a non-trivial challenge. This project designs and implements a Business Information Extraction System, which combines advanced natural language processing (NLP) tools and knowledge-based extraction patterns to process and extract information of target business event from news journals automatically. The performance evaluation on the proposed system suggests that IE techniques works well on business event extraction and it is promising to apply the technique to extract more types of business events.Master of Science in Information Scienc

    Resolving pronominal anaphora using commonsense knowledge

    Get PDF
    Coreference resolution is the task of resolving all expressions in a text that refer to the same entity. Such expressions are often used in writing and speech as shortcuts to avoid repetition. The most frequent form of coreference is the anaphor. To resolve anaphora not only grammatical and syntactical strategies are required, but also semantic approaches should be taken into consideration. This dissertation presents a framework for automatically resolving pronominal anaphora by integrating recent findings from the field of linguistics with new semantic features. Commonsense knowledge is the routine knowledge people have of the everyday world. Because such knowledge is widely used it is frequently omitted from social communications such as texts. It is understandable that without this knowledge computers will have difficulty making sense of textual information. In this dissertation a new set of computational and linguistic features are used in a supervised learning approach to resolve the pronominal anaphora in document. Commonsense knowledge sources such as ConceptNet and WordNet are used and similarity measures are extracted to uncover the elaborative information embedded in the words that can help in the process of anaphora resolution. The anaphoric system is tested on 350 Wall Street Journal articles from the BBN corpus. When compared with other systems available such as BART (Versley et al. 2008) and Charniak and Elsner 2009, our system performed better and also resolved a much wider range of anaphora. We were able to achieve a 92% F-measure on the BBN corpus and an average of 85% F-measure when tested on other genres of documents such as children stories and short stories selected from the web

    'Healthy' Coreference: Applying Coreference Resolution to the Health Education Domain

    Get PDF
    This thesis investigates coreference and its resolution within the domain of health education. Coreference is the relationship between two linguistic expressions that refer to the same real-world entity, and resolution involves identifying this relationship among sets of referring expressions. The coreference resolution task is considered among the most difficult of problems in Artificial Intelligence; in some cases, resolution is impossible even for humans. For example, "she" in the sentence "Lynn called Jennifer while she was on vacation" is genuinely ambiguous: the vacationer could be either Lynn or Jennifer. There are three primary motivations for this thesis. The first is that health education has never before been studied in this context. So far, the vast majority of coreference research has focused on news. Secondly, achieving domain-independent resolution is unlikely without understanding the extent to which coreference varies across different genres. Finally, coreference pervades language and is an essential part of coherent discourse. Its effective use is a key component of easy-to-understand health education materials, where readability is paramount. No suitable corpus of health education materials existed, so our first step was to create one. The comprehensive analysis of this corpus, which required manual annotation of coreference, confirmed our hypothesis that the coreference used in health education differs substantially from that in previously studied domains. This analysis was then used to shape the design of a knowledge-lean algorithm for resolving coreference. This algorithm performed surprisingly well on this corpus, e.g., successfully resolving over 85% of all pronouns when evaluated on unseen data. Despite the importance of coreferentially annotated corpora, only a handful are known to exist, likely because of the difficulty and cost of reliably annotating coreference. The paucity of genres represented in these existing annotated corpora creates an implicit bias in domain-independent coreference resolution. In an effort to address these issues, we plan to make our health education corpus available to the wider research community, hopefully encouraging a broader focus in the future

    Extracting information from fiction

    Get PDF
    Information Extraction (IE) based techniques have great potential to enable companies to leverage valuable information embedded in unstructured textual data. Such data could be exploited to help drive sales and to enhance the customer's experience when searching or browsing for products. Extensive research has been performed in the field of IE; however, to date no work has been directly applied to the domain of fiction. The aim of this study is to explore the ability of IE techniques to extract the central characters and their relationships from the full textual content of works of fiction. To begin our investigation, we present a collection of hypotheses outlining our expectations in approaching and resolving these problems. We then outline our data collection process, which resulted in the creation of a Gold Standard containing ordered lists of characters and their relationships for eight classic book texts. For the task of character extraction, we test two rule-based co-reference resolution models, and two ordering techniques. Our best model achieves an average of 100% coverage on the three most important characters and 78.4% across all central characters, compared to a baseline of 73.3% and 57.4% respectively. For the task of relation extraction, we implement rule-based systems to detect the presence and types of relationships between characters. We achieved 73.3% coverage in detecting the top three pairs of characters involved in relationships. The results for inferring relationship types are preliminary. We provide an analysis of relationship mentions in works of fiction and propose a number of approaches for future work

    Résolution d'anaphores et identification des chaînes de coréférence selon le type de texte

    Get PDF
    Mémoire numérisé par la Direction des bibliothèques de l'Université de Montréal

    Dating Victorians: an experimental approach to stylochronometry

    Get PDF
    A thesis submitted for the degree of Doctor of Philosophy ofthe University of LutonThe writing style of a number of authors writing in English was empirically investigated for the purpose of detecting stylistic patterns in relation to advancing age. The aim was to identify the type of stylistic markers among lexical, syntactical, phonemic, entropic, character-based, and content ones that would be most able to discriminate between early, middle, and late works of the selected authors, and the best classification or prediction algorithm most suited for this task. Two pilot studies were initially conducted. The first one concentrated on Christina Georgina Rossetti and Edgar Allan Poe from whom personal letters and poetry were selected as the genres of study, along with a limited selection of variables. Results suggested that authors and genre vary inconsistently. The second pilot study was based on Shakespeare's plays using a wider selection of variables to assess their discriminating power in relation to a past study. It was observed that the selected variables were of satisfactory predictive power, hence judged suitable for the task. Subsequently, four experiments were conducted using the variables tested in the second pilot study and personal correspondence and poetry from two additional authors, Edna St Vincent Millay and William Butler Yeats. Stepwise multiple linear regression and regression trees were selected to deal with the first two prediction experiments, and ordinal logistic regression and artificial neural networks for two classification experiments. The first experiment revealed inconsistency in accuracy of prediction and total number of variables in the final models affected by differences in authorship and genre. The second experiment revealed inconsistencies for the same factors in terms of accuracy only. The third experiment showed total number of variables in the model and error in the final model to be affected in various degrees by authorship, genre, different variable types and order in which the variables had been calculated. The last experiment had all measurements affected by the four factors. Examination of whether differences in method within each task play an important part revealed significant influences of method, authorship, and genre for the prediction problems, whereas all factors including method and various interactions dominated in the classification problems. Given the current data and methods used, as well as the results obtained, generalizable conclusions for the wider author population have been avoided

    Advances in automatic terminology processing: methodology and applications in focus

    Get PDF
    A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy.The information and knowledge era, in which we are living, creates challenges in many fields, and terminology is not an exception. The challenges include an exponential growth in the number of specialised documents that are available, in which terms are presented, and the number of newly introduced concepts and terms, which are already beyond our (manual) capacity. A promising solution to this ‘information overload’ would be to employ automatic or semi-automatic procedures to enable individuals and/or small groups to efficiently build high quality terminologies from their own resources which closely reflect their individual objectives and viewpoints. Automatic terminology processing (ATP) techniques have already proved to be quite reliable, and can save human time in terminology processing. However, they are not without weaknesses, one of which is that these techniques often consider terms to be independent lexical units satisfying some criteria, when terms are, in fact, integral parts of a coherent system (a terminology). This observation is supported by the discussion of the notion of terms and terminology and the review of existing approaches in ATP presented in this thesis. In order to overcome the aforementioned weakness, we propose a novel methodology in ATP which is able to extract a terminology as a whole. The proposed methodology is based on knowledge patterns automatically extracted from glossaries, which we considered to be valuable, but overlooked resources. These automatically identified knowledge patterns are used to extract terms, their relations and descriptions from corpora. The extracted information can facilitate the construction of a terminology as a coherent system. The study also aims to discuss applications of ATP, and describes an experiment in which ATP is integrated into a new NLP application: multiplechoice test item generation. The successful integration of the system shows that ATP is a viable technology, and should be exploited more by other NLP applications

    Evaluation tool for rule-based anaphora resolution methods

    No full text
    corecore