9 research outputs found

    A time-sensitive historical thesaurus-based semantic tagger for deep semantic annotation

    Get PDF
    Automatic extraction and analysis of meaning-related information from natural language data has been an important issue in a number of research areas, such as natural language processing (NLP), text mining, corpus linguistics, and data science. An important aspect of such information extraction and analysis is the semantic annotation of language data using a semantic tagger. In practice, various semantic annotation tools have been designed to carry out different levels of semantic annotation, such as topics of documents, semantic role labeling, named entities or events. Currently, the majority of existing semantic annotation tools identify and tag partial core semantic information in language data, but they tend to be applicable only for modern language corpora. While such semantic analyzers have proven useful for various purposes, a semantic annotation tool that is capable of annotating deep semantic senses of all lexical units, or all-words tagging, is still desirable for a deep, comprehensive semantic analysis of language data. With large-scale digitization efforts underway, delivering historical corpora with texts dating from the last 400 years, a particularly challenging aspect is the need to adapt the annotation in the face of significant word meaning change over time. In this paper, we report on the development of a new semantic tagger (the Historical Thesaurus Semantic Tagger), and discuss challenging issues we faced in this work. This new semantic tagger is built on existing NLP tools and incorporates a large-scale historical English thesaurus linked to the Oxford English Dictionary. Employing contextual disambiguation algorithms, this tool is capable of annotating lexical units with a historically-valid highly fine-grained semantic categorization scheme that contains about 225,000 semantic concepts and 4,033 thematic semantic categories. In terms of novelty, it is adapted for processing historical English data, with rich information about historical usage of words and a spelling variant normalizer for historical forms of English. Furthermore, it is able to make use of knowledge about the publication date of a text to adapt its output. In our evaluation, the system achieved encouraging accuracies ranging from 77.12% to 91.08% on individual test texts. Applying time-sensitive methods improved results by as much as 3.54% and by 1.72% on average

    Novel Event Detection and Classification for Historical Texts

    Get PDF
    Event processing is an active area of research in the Natural Language Processing community but resources and automatic systems developed so far have mainly addressed contemporary texts. However, the recognition and elaboration of events is a crucial step when dealing with historical texts particularly in the current era of massive digitization of historical sources: research in this domain can lead to the development of methodologies and tools that can assist historians in enhancing their work, while having an impact also on the field of Natural Language Processing. Our work aims at shedding light on the complex concept of events when dealing with historical texts. More specifically, we introduce new annotation guidelines for event mentions and types, categorised into 22 classes. Then, we annotate a historical corpus accordingly, and compare two approaches for automatic event detection and classification following this novel scheme. We believe that this work can foster research in a field of inquiry so far underestimated in the area of Temporal Information Processing. To this end, we release new annotation guidelines, a corpus and new models for automatic annotation

    “The fire of love and joy of chivalry” : a lexical frequency and semantic category analysis of The Faerie Queene

    Get PDF
    While The Faerie Queene has been the subject of copious literary analysis, there has been little research done using quantitative corpus analysis techniques. This research aims to determine the core vocabulary and semantic categories of the first book of The Faerie Queene with the purpose of uncovering Spenser’s key interests and motivations. There are two main stages to the project: lexical frequency and concordance analyses performed using WMatrix and AntConc, followed by a semantic category analysis using the Historical Thesaurus of English. The results of these analyses found that the core vocabulary and semantic categories are mostly interlinked, focusing on words relating to chivalry and the human body. Examining the broader semantic categories, however, highlights the importance of emotion and social class, in addition to religion and morality. This indicates that despite the allegory of the book, Spenser adheres to the traditional themes of chivalric literature like love and social status. He also uses emotion and body part words to encode more abstract concepts within the text and uses suffering and morality to accentuate his interpretation of Protestantism found within the text. These findings link to what is known about Spenser, with particular focus on aspects of literary style and what it means to be a good Protestant in the sixteenth century

    Semantic Tagging for the Urdu Language:Annotated Corpus and Multi-Target Classification Methods

    Get PDF
    Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as natural language processing, corpus linguistics, information retrieval, and data science. An important aspect of such automatic information extraction and analysis is the annotation of language data using semantic tagging tools. Different semantic tagging tools have been designed to carry out various levels of semantic analysis, for instance, named entity recognition and disambiguation, sentiment analysis, word sense disambiguation, content analysis, and semantic role labelling. Common to all of these tasks, in the supervised setting, is the requirement for a manually semantically annotated corpus, which acts as a knowledge base from which to train and test potential word and phrase-level sense annotations. Many benchmark corpora have been developed for various semantic tagging tasks, but most are for English and other European languages. There is a dearth of semantically annotated corpora for the Urdu language, which is widely spoken and used around the world. To fill this gap, this study presents a large benchmark corpus and methods for the semantic tagging task for the Urdu language. The proposed corpus contains 8,000 tokens in the following domains or genres: news, social media, Wikipedia, and historical text (each domain having 2K tokens). The corpus has been manually annotated with 21 major semantic fields and 232 sub-fields with the USAS (UCREL Semantic Analysis System) semantic taxonomy which provides a comprehensive set of semantic fields for coarse-grained annotation. Each word in our proposed corpus has been annotated with at least one and up to nine semantic field tags to provide a detailed semantic analysis of the language data, which allowed us to treat the problem of semantic tagging as a supervised multi-target classification task. To demonstrate how our proposed corpus can be used for the development and evaluation of Urdu semantic tagging methods, we extracted local, topical and semantic features from the proposed corpus and applied seven different supervised multi-target classifiers to them. Results show an accuracy of 94% on our proposed corpus which is free and publicly available to download

    An Urdu semantic tagger - lexicons, corpora, methods and tools

    Get PDF
    Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as Natural Language Processing (NLP), corpus linguistics, data sciences, etc. An important aspect of such automatic information extraction and analysis is the semantic annotation of language data using semantic annotation tool (a.k.a semantic tagger). Generally, different semantic annotation tools have been designed to carry out various levels of semantic annotations, for instance, sentiment analysis, word sense disambiguation, content analysis, semantic role labelling, etc. These semantic annotation tools identify or tag partial core semantic information of language data, moreover, they tend to be applicable only for English and other European languages. A semantic annotation tool that can annotate semantic senses of all lexical units (words) is still desirable for the Urdu language based on USAS (the UCREL Semantic Analysis System) semantic taxonomy, in order to provide comprehensive semantic analysis of Urdu language text. This research work report on the development of an Urdu semantic tagging tool and discuss challenging issues which have been faced in this Ph.D. research work. Since standard NLP pipeline tools are not widely available for Urdu, alongside the Urdu semantic tagger a suite of newly developed tools have been created: sentence tokenizer, word tokenizer and part-of-speech tagger. Results for these proposed tools are as follows: word tokenizer reports F1F_1 of 94.01\%, and accuracy of 97.21\%, sentence tokenizer shows F1_1 of 92.59\%, and accuracy of 93.15\%, whereas, POS tagger shows an accuracy of 95.14\%. The Urdu semantic tagger incorporates semantic resources (lexicon and corpora) as well as semantic field disambiguation methods. In terms of novelty, the NLP pre-processing tools are developed either using rule-based, statistical, or hybrid techniques. Furthermore, all semantic lexicons have been developed using a novel combination of automatic or semi-automatic approaches: mapping, crowdsourcing, statistical machine translation, GIZA++, word embeddings, and named entity. A large multi-target annotated corpus is also constructed using a semi-automatic approach to test accuracy of the Urdu semantic tagger, proposed corpus is also used to train and test supervised multi-target Machine Learning classifiers. The results show that Random k-labEL Disjoint Pruned Sets and Classifier Chain multi-target classifiers outperform all other classifiers on the proposed corpus with a Hamming Loss of 0.06\% and Accuracy of 0.94\%. The best lexical coverage of 88.59\%, 99.63\%, 96.71\% and 89.63\% are obtained on several test corpora. The developed Urdu semantic tagger shows encouraging precision on the proposed test corpus of 79.47\%

    Evaluation in late modern English history writing

    Get PDF
    Die vorliegende Arbeit beschäftigt sich mit der Identifikation und der Klassifikation von evaluativer Lexis in den Werken von britischen Geschichtsschreibern der Spätneuzeit. Im Fokus stehen die von den Historikern im Verlauf von 200 Jahren (ca. 1700-1914) verwendeten sprachlichen Mittel, welche zur Bewertung von historischen Ereignissen und Akteuren eingesetzt werden, und die durch die Mittel realisierten Funktionen. Zentral ist die Betrachtung von Evaluation als linguistischem Mittel der Signifikanzherstellung in neuzeitlicher englischer Geschichtsschreibung. Die Arbeit stützt sich in ihren theoretische Ansätzen u.a. auf das Appraisal Framework (Martin & White 2005) und erweitert dieses, um es unter Einsatz von sowohl korpuslinguistisch-quantitativer als auch qualitativer Methoden auf ein großes Korpus historischer Primärwerke aus dem 18.-19. Jahrhundert anzuwenden. Sie verortet sich sowohl in der historischen/diachronen Diskursforschung als auch in der korpusunterstützten Diskursanalyse (Partington et al. 2013) und liefert eine erste linguistische Beschreibung des historiographischen Registers in der wichtigen Periode seiner allmählichen Verwissenschaftlichung und Institutionalisierung. Indem sie eine interdisziplinäre Perspektive einnimmt, vermittelt diese Arbeit zwischen historischer Theoriebildung und linguistischer Theorie und Methodik.This paper is concerned with the identification and classification of evaluative lexis in the works of British historians of the Late Modern period. The focus is on the linguistic resources used by historians over the course of 200 years (c. 1700-1914) to evaluate historical events and actors, and the functions realised through these resources. Central to the study is the consideration of evaluation as a linguistic means of signaling historical significance in Late Modern English historiography. The work draws on the Appraisal Framework (Martin & White 2005) and extends it in order to apply it to a large corpus of primary historical works from the 18th-19th centuries using both corpus-linguistic-quantitative and qualitative methods. Situating itself in both historical/diachronic discourse research and corpus-assisted discourse analysis (Partington et al. 2013), it provides the first linguistic account of the historiographical register in the pivotal period of its gradual scientification and institutionalisation. By adopting an interdisciplinary perspective, this work mediates between historiographic theorising and linguistic theory and methodology

    Meaning in poetry: semantic annotation of verse with the Historical Thesaurus of English

    Get PDF
    This thesis addresses the current gap in semantic annotation of poetry by presenting the first semantic tagging system specifically designed to disambiguate senses in a diachronic corpus of poetry. The ‘ambiguity tagger’ developed for this purpose utilises the hierarchical taxonomy of the Historical Thesaurus of English (HTE; Kay 2011: 42) to assign conceptual ‘tags’ to lexical items that denote the meaning of the word in context, with multiple meanings assigned to ambiguous words. The ambiguity tagger encompasses a configurable pipeline for semantic annotation, thus presenting a more flexible alternative to existing applications (Piao et al. 2005a; Rayson 2009a; Piao et al. 2017). To train the tagger, a corpus was curated from the Oxford Book of English Verse, containing poetry from the early 16th to the late 19th century (OBEV; Quiller-Couch 1919/1999). As the ambiguity tagger allows multiple meanings to be assigned to individual words in the corpus, without restricting the maximum number of senses, the semantic metadata produced by the tagger is unique in its breadth. Correspondingly, the analysis sections of the thesis look at different techniques for interpreting the data, using case studies from the OBEV corpus. Both macro- and micro-level approaches to analysing the data are explored, highlighting the benefits of the ambiguity tagger at different levels of critical analysis. To further explore the capabilities of semantic annotation with HTE data, this research extends the interpretative analysis of the semantic metadata gained through the ambiguity tagger by presenting a systematic approach for analysing the significant co-occurrence of concepts in the text. This process borrows the framework for identifying significantly co-occurring words (collocates) and extends this into a measure of ‘semantic collocation’, thus significantly expanding on existing research in this field (Alexander et al. 2015a; Archer & Malory 2015; 2017). By shifting the focus from lexical collocation to the significant co-occurrence of ‘meaning’ in texts, this approach reveals a pattern of previously inaccessible textual data for analysis and marks a further methodological contribution of this research

    AIUCD2018 - Book of Abstracts

    Get PDF
    Questo volume raccoglie gli abstract dei paper presentati al Settimo Convegno Annuale AIUCD 2018 (Bari, 31 gennaio – 2 febbraio 2018) dal titolo "Patrimoni culturali nell’era digitale. Memorie, culture umanistiche e tecnologia" (Cultural Heritage in the Digital Age. Memory, Humanities and Technologies). Gli abstract pubblicati in questo volume hanno ottenuto il parere favorevole da parte di valutatori esperti della materia, attraverso un processo di revisione anonima mediante double-blind peer review sotto la responsabilità del Comitato Scientifico di AIUCD. Il programma della conferenza AIUCD 2018 è disponibile online all'indirizzo http://www.aiucd2018.uniba.it/

    AIUCD2018 - Book of Abstracts

    Get PDF
    Questo volume raccoglie gli abstract dei paper presentati al Settimo Convegno Annuale AIUCD 2018 (Bari, 31 gennaio – 2 febbraio 2018) dal titolo "Patrimoni culturali nell’era digitale. Memorie, culture umanistiche e tecnologia" (Cultural Heritage in the Digital Age. Memory, Humanities and Technologies). Gli abstract pubblicati in questo volume hanno ottenuto il parere favorevole da parte di valutatori esperti della materia, attraverso un processo di revisione anonima mediante double-blind peer review sotto la responsabilità del Comitato Scientifico di AIUCD. Il programma della conferenza AIUCD 2018 è disponibile online all'indirizzo http://www.aiucd2018.uniba.it/
    corecore