218 research outputs found

    Workshop Proceedings of the 12th edition of the KONVENS conference

    Get PDF
    The 2014 issue of KONVENS is even more a forum for exchange: its main topic is the interaction between Computational Linguistics and Information Science, and the synergies such interaction, cooperation and integrated views can produce. This topic at the crossroads of different research traditions which deal with natural language as a container of knowledge, and with methods to extract and manage knowledge that is linguistically represented is close to the heart of many researchers at the Institut fĂŒr Informationswissenschaft und Sprachtechnologie of UniversitĂ€t Hildesheim: it has long been one of the institute’s research topics, and it has received even more attention over the last few years

    A machine learning approach for Urdu text sentiment analysis

    Get PDF
    Product evaluations, ratings, and other sorts of online expressions have risen in popularity as a result of the emergence of social networking sites and blogs. Sentiment analysis has emerged as a new area of study for computational linguists as a result of this rapidly expanding data set. From around a decade ago, this has been a topic of discussion for English speakers. However, the scientific community completely ignores other important languages, such as Urdu. Morphologically, Urdu is one of the most complex languages in the world. For this reason, a variety of unique characteristics, such as the language's unusual morphology and unrestricted word order, make the Urdu language processing a difficult challenge to solve. This research provides a new framework for the categorization of Urdu language sentiments. The main contributions of the research are to show how important this multidimensional research problem is as well as its technical parts, such as the parsing algorithm, corpus, lexicon, etc. A new approach for Urdu text sentiment analysis including data gathering, pre-processing, feature extraction, feature vector formation, and finally, sentiment classification has been designed to deal with Urdu language sentiments. The result and discussion section provides a comprehensive comparison of the proposed work with the standard baseline method in terms of precision, recall, f-measure, and accuracy of three different types of datasets. In the overall comparison of the models, the proposed work shows an encouraging achievement in terms of accuracy and other metrics. Last but not least, this section also provides the featured trend and possible direction of the current work

    The text classification pipeline: Starting shallow, going deeper

    Get PDF
    An increasingly relevant and crucial subfield of Natural Language Processing (NLP), tackled in this PhD thesis from a computer science and engineering perspective, is the Text Classification (TC). Also in this field, the exceptional success of deep learning has sparked a boom over the past ten years. Text retrieval and categorization, information extraction and summarization all rely heavily on TC. The literature has presented numerous datasets, models, and evaluation criteria. Even if languages as Arabic, Chinese, Hindi and others are employed in several works, from a computer science perspective the most used and referred language in the literature concerning TC is English. This is also the language mainly referenced in the rest of this PhD thesis. Even if numerous machine learning techniques have shown outstanding results, the classifier effectiveness depends on the capability to comprehend intricate relations and non-linear correlations in texts. In order to achieve this level of understanding, it is necessary to pay attention not only to the architecture of a model but also to other stages of the TC pipeline. In an NLP framework, a range of text representation techniques and model designs have emerged, including the large language models. These models are capable of turning massive amounts of text into useful vector representations that effectively capture semantically significant information. The fact that this field has been investigated by numerous communities, including data mining, linguistics, and information retrieval, is an aspect of crucial interest. These communities frequently have some overlap, but are mostly separate and do their research on their own. Bringing researchers from other groups together to improve the multidisciplinary comprehension of this field is one of the objectives of this dissertation. Additionally, this dissertation makes an effort to examine text mining from both a traditional and modern perspective. This thesis covers the whole TC pipeline in detail. However, the main contribution is to investigate the impact of every element in the TC pipeline to evaluate the impact on the final performance of a TC model. It is discussed the TC pipeline, including the traditional and the most recent deep learning-based models. This pipeline consists of State-Of-The-Art (SOTA) datasets used in the literature as benchmark, text preprocessing, text representation, machine learning models for TC, evaluation metrics and current SOTA results. In each chapter of this dissertation, I go over each of these steps, covering both the technical advancements and my most significant and recent findings while performing experiments and introducing novel models. The advantages and disadvantages of various options are also listed, along with a thorough comparison of the various approaches. At the end of each chapter, there are my contributions with experimental evaluations and discussions on the results that I have obtained during my three years PhD course. The experiments and the analysis related to each chapter (i.e., each element of the TC pipeline) are the main contributions that I provide, extending the basic knowledge of a regular survey on the matter of TC.An increasingly relevant and crucial subfield of Natural Language Processing (NLP), tackled in this PhD thesis from a computer science and engineering perspective, is the Text Classification (TC). Also in this field, the exceptional success of deep learning has sparked a boom over the past ten years. Text retrieval and categorization, information extraction and summarization all rely heavily on TC. The literature has presented numerous datasets, models, and evaluation criteria. Even if languages as Arabic, Chinese, Hindi and others are employed in several works, from a computer science perspective the most used and referred language in the literature concerning TC is English. This is also the language mainly referenced in the rest of this PhD thesis. Even if numerous machine learning techniques have shown outstanding results, the classifier effectiveness depends on the capability to comprehend intricate relations and non-linear correlations in texts. In order to achieve this level of understanding, it is necessary to pay attention not only to the architecture of a model but also to other stages of the TC pipeline. In an NLP framework, a range of text representation techniques and model designs have emerged, including the large language models. These models are capable of turning massive amounts of text into useful vector representations that effectively capture semantically significant information. The fact that this field has been investigated by numerous communities, including data mining, linguistics, and information retrieval, is an aspect of crucial interest. These communities frequently have some overlap, but are mostly separate and do their research on their own. Bringing researchers from other groups together to improve the multidisciplinary comprehension of this field is one of the objectives of this dissertation. Additionally, this dissertation makes an effort to examine text mining from both a traditional and modern perspective. This thesis covers the whole TC pipeline in detail. However, the main contribution is to investigate the impact of every element in the TC pipeline to evaluate the impact on the final performance of a TC model. It is discussed the TC pipeline, including the traditional and the most recent deep learning-based models. This pipeline consists of State-Of-The-Art (SOTA) datasets used in the literature as benchmark, text preprocessing, text representation, machine learning models for TC, evaluation metrics and current SOTA results. In each chapter of this dissertation, I go over each of these steps, covering both the technical advancements and my most significant and recent findings while performing experiments and introducing novel models. The advantages and disadvantages of various options are also listed, along with a thorough comparison of the various approaches. At the end of each chapter, there are my contributions with experimental evaluations and discussions on the results that I have obtained during my three years PhD course. The experiments and the analysis related to each chapter (i.e., each element of the TC pipeline) are the main contributions that I provide, extending the basic knowledge of a regular survey on the matter of TC

    Semantic Tagging for the Urdu Language:Annotated Corpus and Multi-Target Classification Methods

    Get PDF
    Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as natural language processing, corpus linguistics, information retrieval, and data science. An important aspect of such automatic information extraction and analysis is the annotation of language data using semantic tagging tools. Different semantic tagging tools have been designed to carry out various levels of semantic analysis, for instance, named entity recognition and disambiguation, sentiment analysis, word sense disambiguation, content analysis, and semantic role labelling. Common to all of these tasks, in the supervised setting, is the requirement for a manually semantically annotated corpus, which acts as a knowledge base from which to train and test potential word and phrase-level sense annotations. Many benchmark corpora have been developed for various semantic tagging tasks, but most are for English and other European languages. There is a dearth of semantically annotated corpora for the Urdu language, which is widely spoken and used around the world. To fill this gap, this study presents a large benchmark corpus and methods for the semantic tagging task for the Urdu language. The proposed corpus contains 8,000 tokens in the following domains or genres: news, social media, Wikipedia, and historical text (each domain having 2K tokens). The corpus has been manually annotated with 21 major semantic fields and 232 sub-fields with the USAS (UCREL Semantic Analysis System) semantic taxonomy which provides a comprehensive set of semantic fields for coarse-grained annotation. Each word in our proposed corpus has been annotated with at least one and up to nine semantic field tags to provide a detailed semantic analysis of the language data, which allowed us to treat the problem of semantic tagging as a supervised multi-target classification task. To demonstrate how our proposed corpus can be used for the development and evaluation of Urdu semantic tagging methods, we extracted local, topical and semantic features from the proposed corpus and applied seven different supervised multi-target classifiers to them. Results show an accuracy of 94% on our proposed corpus which is free and publicly available to download

    Artificial intelligence for understanding the Hadith

    Get PDF
    My research aims to utilize Artificial Intelligence to model the meanings of Classical Arabic Hadith, which are the reports of the life and teachings of the Prophet Muhammad. The goal is to find similarities and relatedness between Hadith and other religious texts, specifically the Quran. These findings can facilitate downstream tasks, such as Islamic question- answering systems, and enhance understanding of these texts to shed light on new interpretations. To achieve this goal, a well-structured Hadith corpus should be created, with the Matn (Hadith teaching) and Isnad (chain of narrators) segmented. Hence, a preliminary task is conducted to build a segmentation tool using machine learning models that automatically deconstruct the Hadith into Isnad and Matn with 92.5% accuracy. This tool is then used to create a well-structured corpus of the canonical Hadith books. After building the Hadith corpus, Matns are extracted to investigate different methods of representing their meanings. Two main methods are tested: a knowledge-based approach and a deep-learning-based approach. To apply the former, existing Islamic ontologies are enumerated, most of which are intended for the Quran. Since the Quran and the Hadith are in the same domain, the extent to which these ontologies cover the Hadith is examined using a corpus-based evaluation. Results show that the most comprehensive Quran ontology covers only 26.8% of Hadith concepts, and extending it is expensive. Therefore, the second approach is investigated by building and evaluating various deep-learning models for a binary classification task of detecting relatedness between the Hadith and the Quran. Results show that the likelihood of the current models reaching a human- level understanding of such texts remains somewhat elusive

    A distributional investigation of German verbs

    Get PDF
    Diese Dissertation bietet eine empirische Untersuchung deutscher Verben auf der Grundlage statistischer Beschreibungen, die aus einem großen deutschen Textkorpus gewonnen wurden. In einem kurzen Überblick ĂŒber linguistische Theorien zur lexikalischen Semantik von Verben skizziere ich die Idee, dass die Verbbedeutung wesentlich von seiner Argumentstruktur (der Anzahl und Art der Argumente, die zusammen mit dem Verb auftreten) und seiner Aspektstruktur (Eigenschaften, die den zeitlichen Ablauf des vom Verb denotierten Ereignisses bestimmen) abhĂ€ngt. Anschließend erstelle ich statistische Beschreibungen von Verben, die auf diesen beiden unterschiedlichen Bedeutungsfacetten basieren. Insbesondere untersuche ich verbale Subkategorisierung, SelektionsprĂ€ferenzen und Aspekt. Alle diese Modellierungsstrategien werden anhand einer gemeinsamen Aufgabe, der Verbklassifikation, bewertet. Ich zeige, dass im Rahmen von maschinellem Lernen erworbene Merkmale, die verbale lexikalische Aspekte erfassen, fĂŒr eine Anwendung von Vorteil sind, die Argumentstrukturen betrifft, nĂ€mlich semantische Rollenkennzeichnung. DarĂŒber hinaus zeige ich, dass Merkmale, die die verbale Argumentstruktur erfassen, bei der Aufgabe, ein Verb nach seiner Aspektklasse zu klassifizieren, gut funktionieren. Diese Ergebnisse bestĂ€tigen, dass diese beiden Facetten der Verbbedeutung auf grundsĂ€tzliche Weise zusammenhĂ€ngen.This dissertation provides an empirical investigation of German verbs conducted on the basis of statistical descriptions acquired from a large corpus of German text. In a brief overview of the linguistic theory pertaining to the lexical semantics of verbs, I outline the idea that verb meaning is composed of argument structure (the number and types of arguments that co-occur with a verb) and aspectual structure (properties describing the temporal progression of an event referenced by the verb). I then produce statistical descriptions of verbs according to these two distinct facets of meaning: In particular, I examine verbal subcategorisation, selectional preferences, and aspectual type. All three of these modelling strategies are evaluated on a common task, automatic verb classification. I demonstrate that automatically acquired features capturing verbal lexical aspect are beneficial for an application that concerns argument structure, namely semantic role labelling. Furthermore, I demonstrate that features capturing verbal argument structure perform well on the task of classifying a verb for its aspectual type. These findings suggest that these two facets of verb meaning are related in an underlying way

    24th Nordic Conference on Computational Linguistics (NoDaLiDa)

    Get PDF

    Named Entity Recognition in Speech-to-Text Transcripts

    Get PDF
    Traditionally, named entity recognition (NER) research use properly capitalized data for training and testing give little insight to how these models may perform in scenarios where proper capitalization is not in place. In this thesis, I explore the capabilities of five fine-tuning BERT based models for NER in all lowercase text. Furthermore, I aim to measure the performance for classifying named entity types correctly, as well as just simply detecting that a named entity is present, so that capitalization errors may be corrected. The performance is assessed using all lowercase data from the NorNE dataset, and the Norwegian Parliamentary Speech Corpus. Findings suggest that the fine-tuned BERT models are highly capable of detecting non-capitalized named entities, but do not perform as well as traditional NER models that are trained and tested on properly capitalized text.Masteroppgave i informasjonsvitenskapINFO390MASV-INF

    A corpus-based contrastive analysis of modal adverbs of certainty in English and Urdu

    Get PDF
    This study uses the corpus-based contrastive approach to explore the syntactic patterns and semantic and pragmatic meanings of modal adverbs of certainty (MACs) in English and Urdu. MACs are a descriptive category of epistemic modal adverb that semantically express a degree of certainty. Due to the paucity of research to date on Urdu MACs, the study draws on existing literature on English MACs for cross-linguistic description of characteristics of English and Urdu MACs. A framework is constructed based on Boye’s (2012) description of syntactic characteristics of MACs, in terms of clause type and position within the clause; and on Simon-Vandenbergen and Aijmer’s (2007) description of their functional characteristics including both semantic (e.g. certainty, possibility) and pragmatic (e.g. authority, politeness) functions. Following Boye’s (2012) model, MACs may be grouped according to meaning: high certainty support – HCS (e.g. certainly); probability support – PS (e.g. perhaps); probability support for negative content – PSNC (e.g. perhaps not); and high certainty support for negative content – HCSNC (e.g. certainly not). Methodologically, the framework identified as suitable is one that primarily follows earlier studies that relied on corpus-based methods and parallel and comparable corpora for cross-linguistic comparative or contrastive analysis of some linguistic element or pattern. An approach to grammatical description based on such works as Quirk et al. (1985) and Biber et al. (1999) is likewise identified as suitable for this study. An existing parallel corpus (EMILLE) and newly created comparable monolingual corpora of English and Urdu are utilised. The novel comparable corpora are web-based, comprised of news and chat forum texts; the data is POS-tagged. Using the parallel corpus, Urdu MACs equivalent to the English MACs preidentified from the existing literature are identified. Then, the comparable corpora are used to extract data on the relative frequencies of MACs and their distribution across various text types. This quantitative analysis demonstrates that in both languages all four semantic categories of MAC are found in all text types, but the distribution across text types is not uniform. HCS MACs, although diverse, are considerably lower in frequency than PS MACs in both English and Urdu. HCSNC and PSNC MACs are notably rarer than HCS and PS MACs in both languages. The analysis demonstrates striking similarities in the syntactic positioning of MACs in English and Urdu, with minor differences. Except for Urdu PSNC MACs, all categories most frequently occur in clause medial position, in both independent and dependent clauses, in both languages. This difference is because hƍ nahÄ«áč saktā ‘possibly not’ is most frequent in clause final position. MACs in both languages most often have scope over the whole clause in which they occur; semantically, the core function of MACs is to express speaker’s certainty and high confidence (for HCS and HCSNC) or low certainty and low confidence (for PS and PSNC) in the truth of a proposition. These groups thus primarily function as certainty markers and probability markers, respectively. In both languages, speakers also use MACs short responses to questions, and in responses to their own rhetorical questions. HCS and PS MACs in clause final position may in addition function as tags which prompt a response from the interlocutor. When they cooccur with modal verbs, MACs emphasise or downtone, but do not entirely change, the modal verb’s epistemic or deontic meaning. In both languages, all MACs preferentially occur in the then-clause of a conditional sentence. Pragmatically, MACs are used for emphasis, expectation, counter-expectation and politeness. Additionally, HCS and HCSNC MACs are used to express solidarity and authority, and PS and PSNC MACs are used as hedges. Readings of expectation, hedge, politeness, and solidarity may be relevant simultaneously. Interestingly, reduplication for emphasis, common in Urdu, is only observed for one Urdu MAC, ĆŒarĆ«r ‘definitely’, whereas all English MACs reduplicate for emphasis in at least some cases. Another difference is that, in Urdu, the sequence ƛāyad nahÄ«áč yaqÄ«nān ‘not perhaps, certainly’ expresses speaker authority within a response to a previous speaker, but no English MAC exhibits this behaviour. Despite overall similarity, minor dissimilarities in the use of English and Urdu MACs are observable, in the use of MACs as replies to questions, and in their use within interrogative clauses. This analysis supports the contention that, cross-linguistically, despite linguistic variation, the conceptual structures and functional-communicative considerations that shape natural languages are largely universal. This study makes two main contributions. First, conducting a descriptive analysis of English and Urdu MACs using a corpus-based contrastive method both illuminates this specific question in modality but also sets a precedent for future corpus-based descriptive studies of Urdu. The second is its inclusion of priorly considered distinct categories of modal adverbs of certainty and possibility in a single category of modal adverbs that are used to express a degree of certainty, i.e. MACs. From the practical standpoint, an additional contribution of this study is the creation and open release of a large Urdu corpus designed for comparable corpus research, the Lancaster Urdu Web Corpus, fulfilling a need for such a corpus in the field
    • 

    corecore