30 research outputs found

    Experiments to investigate the utility of nearest neighbour metrics based on linguistically informed features for detecting textual plagiarism

    Get PDF
    Plagiarism detection is a challenge for linguistic models — most current implemented models use simple occurrence statistics for linguistic items. In this paper we report two experiments related to plagiarism detection where we use a model for distributional semantics and of sentence stylistics to compare sentence by sentence the likelihood of a text being partly plagiarised. The result of the comparison are displayed for visual inspection by a plagiarism assessor

    Contents

    Get PDF
    Proceedings of the 18th Nordic Conference of Computational Linguistics NODALIDA 2011. Editors: Bolette Sandford Pedersen, Gunta Nešpore and Inguna Skadiņa. NEALT Proceedings Series, Vol. 11 (2011), iii-vii. © 2011 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/16955

    Finding the online cry for help : automatic text classification for suicide prevention

    Get PDF
    Successful prevention of suicide, a serious public health concern worldwide, hinges on the adequate detection of suicide risk. While online platforms are increasingly used for expressing suicidal thoughts, manually monitoring for such signals of distress is practically infeasible, given the information overload suicide prevention workers are confronted with. In this thesis, the automatic detection of suicide-related messages is studied. It presents the first classification-based approach to online suicidality detection, and focuses on Dutch user-generated content. In order to evaluate the viability of such a machine learning approach, we developed a gold standard corpus, consisting of message board and blog posts. These were manually labeled according to a newly developed annotation scheme, grounded in suicide prevention practice. The scheme provides for the annotation of a post's relevance to suicide, and the subject and severity of a suicide threat, if any. This allowed us to derive two tasks: the detection of suicide-related posts, and of severe, high-risk content. In a series of experiments, we sought to determine how well these tasks can be carried out automatically, and which information sources and techniques contribute to classification performance. The experimental results show that both types of messages can be detected with high precision. Therefore, the amount of noise generated by the system is minimal, even on very large datasets, making it usable in a real-world prevention setting. Recall is high for the relevance task, but at around 60%, it is considerably lower for severity. This is mainly attributable to implicit references to suicide, which often go undetected. We found a variety of information sources to be informative for both tasks, including token and character ngram bags-of-words, features based on LSA topic models, polarity lexicons and named entity recognition, and suicide-related terms extracted from a background corpus. To improve classification performance, the models were optimized using feature selection, hyperparameter, or a combination of both. A distributed genetic algorithm approach proved successful in finding good solutions for this complex search problem, and resulted in more robust models. Experiments with cascaded classification of the severity task did not reveal performance benefits over direct classification (in terms of F1-score), but its structure allows the use of slower, memory-based learning algorithms that considerably improved recall. At the end of this thesis, we address a problem typical of user-generated content: noise in the form of misspellings, phonetic transcriptions and other deviations from the linguistic norm. We developed an automatic text normalization system, using a cascaded statistical machine translation approach, and applied it to normalize the data for the suicidality detection tasks. Subsequent experiments revealed that, compared to the original data, normalized data resulted in fewer and more informative features, and improved classification performance. This extrinsic evaluation demonstrates the utility of automatic normalization for suicidality detection, and more generally, text classification on user-generated content

    Experiments to investigate the utility of nearest neighbour metrics based on linguistically informed features for detecting textual plagiarism

    No full text
    Abstract Plagiarism detection is a challenge for linguistic models -most current implemented models use simple occurrence statistics for linguistic items. In this paper we report two experiments related to plagiarism detection where we use a model for distributional semantics and of sentence stylistics to compare sentence by sentence the likelihood of a text being partly plagiarised. The result of the comparison are displayed for visual inspection by a plagiarism assessor

    A Corpus Driven Computational Intelligence Framework for Deception Detection in Financial Text

    Get PDF
    Financial fraud rampages onwards seemingly uncontained. The annual cost of fraud in the UK is estimated to be as high as £193bn a year [1] . From a data science perspective and hitherto less explored this thesis demonstrates how the use of linguistic features to drive data mining algorithms can aid in unravelling fraud. To this end, the spotlight is turned on Financial Statement Fraud (FSF), known to be the costliest type of fraud [2]. A new corpus of 6.3 million words is composed of102 annual reports/10-K (narrative sections) from firms formally indicted for FSF juxtaposed with 306 non-fraud firms of similar size and industrial grouping. Differently from other similar studies, this thesis uniquely takes a wide angled view and extracts a range of features of different categories from the corpus. These linguistic correlates of deception are uncovered using a variety of techniques and tools. Corpus linguistics methodology is applied to extract keywords and to examine linguistic structure. N-grams are extracted to draw out collocations. Readability measurement in financial text is advanced through the extraction of new indices that probe the text at a deeper level. Cognitive and perceptual processes are also picked out. Tone, intention and liquidity are gauged using customised word lists. Linguistic ratios are derived from grammatical constructs and word categories. An attempt is also made to determine ‘what’ was said as opposed to ‘how’. Further a new module is developed to condense synonyms into concepts. Lastly frequency counts from keywords unearthed from a previous content analysis study on financial narrative are also used. These features are then used to drive machine learning based classification and clustering algorithms to determine if they aid in discriminating a fraud from a non-fraud firm. The results derived from the battery of models built typically exceed classification accuracy of 70%. The above process is amalgamated into a framework. The process outlined, driven by empirical data demonstrates in a practical way how linguistic analysis could aid in fraud detection and also constitutes a unique contribution made to deception detection studies

    The Automatic Acquisition of Knowledge about Discourse Connectives

    Get PDF
    Institute for Communicating and Collaborative SystemsThis thesis considers the automatic acquisition of knowledge about discourse connectives. It focuses in particular on their semantic properties, and on the relationships that hold between them. There is a considerable body of theoretical and empirical work on discourse connectives. For example, Knott (1996) motivates a taxonomy of discourse connectives based on relationships between them, such as HYPONYMY and EXCLUSIVE, which are defined in terms of substitution tests. Such work requires either great theoretical insight or manual analysis of large quantities of data. As a result, to date no manual classification of English discourse connectives has achieved complete coverage. For example, Knott gives relationships between only about 18% of pairs obtained from a list of 350 discourse connectives. This thesis explores the possibility of classifying discourse connectives automatically, based on their distributions in texts. This thesis demonstrates that state-of-the-art techniques in lexical acquisition can successfully be applied to acquiring information about discourse connectives. Central to this thesis is the hypothesis that distributional similarity correlates positively with semantic similarity. Support for this hypothesis has previously been found for word classes such as nouns and verbs (Miller and Charles, 1991; Resnik and Diab, 2000, for example), but there has been little exploration of the degree to which it also holds for discourse connectives. We investigate the hypothesis through a number of machine learning experiments. These experiments all use unsupervised learning techniques, in the sense that they do not require any manually annotated data, although they do make use of an automatic parser. First, we show that a range of semantic properties of discourse connectives, such as polarity and veridicality (whether or not the semantics of a connective involves some underlying negation, and whether the connective implies the truth of its arguments, respectively), can be acquired automatically with a high degree of accuracy. Second, we consider the tasks of predicting the similarity and substitutability of pairs of discourse connectives. To assist in this, we introduce a novel information theoretic function based on variance that, in combination with distributional similarity, is useful for learning such relationships. Third, we attempt to automatically construct taxonomies of discourse connectives capturing substitutability relationships. We introduce a probability model of taxonomies, and show that this can improve accuracy on learning substitutability relationships. Finally, we develop an algorithm for automatically constructing or extending such taxonomies which uses beam search to help find the optimal taxonomy

    24th Nordic Conference on Computational Linguistics (NoDaLiDa)

    Get PDF

    Tune your brown clustering, please

    Get PDF
    Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal

    Transforming Research Methods in the Social Sciences

    Get PDF
    Social science researchers in the global South, and in South Africa particularly, utilise research methods in innovative ways in order to respond to contexts characterised by diversity, racial and political tensions, socioeconomic disparities and gender inequalities. These methods often remain undocumented – a gap that this book starts to address. Written by experts from various methodological fields, Transforming Research Methods in the Social Sciences is a comprehensive collation of original essays and cutting-edge research that demonstrates the variety of novel techniques and research methods available to researchers responding to these context-bound issues. It is particularly relevant for study and research in the fields of applied psychology, sociology, ethnography, biography and anthropology. In addition to their unique combination of conceptual and application issues, the chapters also include discussions on ethical considerations relevant to the method in similar global South contexts. Transforming Research Methods in the Social Sciences has much to offer to researchers, professionals and others involved in social science research both locally and internationally

    Linguistic Threat Assessment: Understanding Targeted Violence through Computational Linguistics

    Get PDF
    Language alluding to possible violence is widespread online, and security professionals are increasingly faced with the issue of understanding and mitigating this phenomenon. The volume of extremist and violent online data presents a workload that is unmanageable for traditional, manual threat assessment. Computational linguistics may be of particular relevance to understanding threats of grievance-fuelled targeted violence on a large scale. This thesis seeks to advance knowledge on the possibilities and pitfalls of threat assessment through automated linguistic analysis. Based on in-depth interviews with expert threat assessment practitioners, three areas of language are identified which can be leveraged for automation of threat assessment, namely, linguistic content, style, and trajectories. Implementations of each area are demonstrated in three subsequent quantitative chapters. First, linguistic content is utilised to develop the Grievance Dictionary, a psycholinguistic dictionary aimed at measuring concepts related to grievance-fuelled violence in text. Thereafter, linguistic content is supplemented with measures of linguistic style in order to examine the feasibility of author profiling (determining gender, age, and personality) in abusive texts. Lastly, linguistic trajectories are measured over time in order to assess the effect of an external event on an extremist movement. Collectively, the chapters in this thesis demonstrate that linguistic automation of threat assessment is indeed possible. The concluding chapter describes the limitations of the proposed approaches and illustrates where future potential lies to improve automated linguistic threat assessment. Ideally, developers of computational implementations for threat assessment strive for explainability and transparency. Furthermore, it is argued that computational linguistics holds particular promise for large-scale measurement of grievance-fuelled language, but is perhaps less suited to prediction of actual violent behaviour. Lastly, researchers and practitioners involved in threat assessment are urged to collaboratively and critically evaluate novel computational tools which may emerge in the future
    corecore