114 research outputs found

    A new approach to the study of Translationese: Machinelearning the difference between original and translated text.

    Get PDF
    Abstract In this article we describe an approach to the identification of 'translationese' based on monolingual comparable corpora and machine learning techniques for text categorization. The article reports on experiments in which support vector machines (SVMs) are employed to recognize translated text in a corpus of Italian articles from the geopolitical domain. An ensemble of SVMs reaches 86.7% accuracy with 89.3% precision and 83.3% recall on this task. A preliminary analysis of the features used by the SVMs suggests that the distribution of function words and morphosyntactic categories in general, and personal pronouns and adverbs in particular, are among the cues used by the SVMs to perform the discrimination task. A follow-up experiment shows that the performance attained by SVMs is well above the average performance of ten human subjects, including five professional translators, on the same task. Our results offer solid evidence supporting the translationese hypothesis, and our method seems to have promising applications in translation studies and in quantitative style analysis in general. Implications for the machine learning/text categorization community are equally important, both because this is a novel application and especially because we provide explicit evidence that a relatively knowledge-poor machine learning algorithm can outperform human beings in a text classification task

    Versification and Authorship Attribution

    Get PDF
    The technique known as contemporary stylometry uses different methods, including machine learning, to discover a poem’s author based on features like the frequencies of words and character n-grams. However, there is one potential textual fingerprint stylometry tends to ignore: versification, or the very making of language into verse. Using poetic texts in three different languages (Czech, German, and Spanish), Petr Plecháč asks whether versification features like rhythm patterns and types of rhyme can help determine authorship. He then tests its findings on two unsolved literary mysteries. In the first, Plecháč distinguishes the parts of the Elizabethan verse play The Two Noble Kinsmen written by William Shakespeare from those written by his coauthor, John Fletcher. In the second, he seeks to solve a case of suspected forgery: how authentic was a group of poems first published as the work of the nineteenth-century Russian author Gavriil Stepanovich Batenkov? This book of poetic investigation should appeal to literary sleuths the world over.illustrato

    Data Science and Knowledge Discovery

    Get PDF
    Data Science (DS) is gaining significant importance in the decision process due to a mix of various areas, including Computer Science, Machine Learning, Math and Statistics, domain/business knowledge, software development, and traditional research. In the business field, DS's application allows using scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data to support the decision process. After collecting the data, it is crucial to discover the knowledge. In this step, Knowledge Discovery (KD) tasks are used to create knowledge from structured and unstructured sources (e.g., text, data, and images). The output needs to be in a readable and interpretable format. It must represent knowledge in a manner that facilitates inferencing. KD is applied in several areas, such as education, health, accounting, energy, and public administration. This book includes fourteen excellent articles which discuss this trending topic and present innovative solutions to show the importance of Data Science and Knowledge Discovery to researchers, managers, industry, society, and other communities. The chapters address several topics like Data mining, Deep Learning, Data Visualization and Analytics, Semantic data, Geospatial and Spatio-Temporal Data, Data Augmentation and Text Mining

    A study on plagiarism detection and plagiarism direction identification using natural language processing techniques

    Get PDF
    Ever since we entered the digital communication era, the ease of information sharing through the internet has encouraged online literature searching. With this comes the potential risk of a rise in academic misconduct and intellectual property theft. As concerns over plagiarism grow, more attention has been directed towards automatic plagiarism detection. This is a computational approach which assists humans in judging whether pieces of texts are plagiarised. However, most existing plagiarism detection approaches are limited to super cial, brute-force stringmatching techniques. If the text has undergone substantial semantic and syntactic changes, string-matching approaches do not perform well. In order to identify such changes, linguistic techniques which are able to perform a deeper analysis of the text are needed. To date, very limited research has been conducted on the topic of utilising linguistic techniques in plagiarism detection. This thesis provides novel perspectives on plagiarism detection and plagiarism direction identi cation tasks. The hypothesis is that original texts and rewritten texts exhibit signi cant but measurable di erences, and that these di erences can be captured through statistical and linguistic indicators. To investigate this hypothesis, four main research objectives are de ned. First, a novel framework for plagiarism detection is proposed. It involves the use of Natural Language Processing techniques, rather than only relying on the vii traditional string-matching approaches. The objective is to investigate and evaluate the in uence of text pre-processing, and statistical, shallow and deep linguistic techniques using a corpus-based approach. This is achieved by evaluating the techniques in two main experimental settings. Second, the role of machine learning in this novel framework is investigated. The objective is to determine whether the application of machine learning in the plagiarism detection task is helpful. This is achieved by comparing a thresholdsetting approach against a supervised machine learning classi er. Third, the prospect of applying the proposed framework in a large-scale scenario is explored. The objective is to investigate the scalability of the proposed framework and algorithms. This is achieved by experimenting with a large-scale corpus in three stages. The rst two stages are based on longer text lengths and the nal stage is based on segments of texts. Finally, the plagiarism direction identi cation problem is explored as supervised machine learning classi cation and ranking tasks. Statistical and linguistic features are investigated individually or in various combinations. The objective is to introduce a new perspective on the traditional brute-force pair-wise comparison of texts. Instead of comparing original texts against rewritten texts, features are drawn based on traits of texts to build a pattern for original and rewritten texts. Thus, the classi cation or ranking task is to t a piece of text into a pattern. The framework is tested by empirical experiments, and the results from initial experiments show that deep linguistic analysis contributes to solving the problems we address in this thesis. Further experiments show that combining shallow and viii deep techniques helps improve the classi cation of plagiarised texts by reducing the number of false negatives. In addition, the experiment on plagiarism direction detection shows that rewritten texts can be identi ed by statistical and linguistic traits. The conclusions of this study o er ideas for further research directions and potential applications to tackle the challenges that lie ahead in detecting text reuse.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Computational Stylistics in Poetry, Prose, and Drama

    Get PDF
    The contributions in this edited volume approach poetry, narrative, and drama from the perspective of Computational Stylistics. They exemplify methods of computational textual analysis and explore the possibility of computational generation of literary texts. The volume presents a range of computational and Natural Language Processing applications to literary studies, such as motif detection, network analysis, machine learning, and deep learning

    Computational Stylistics in Poetry, Prose, and Drama

    Get PDF
    The contributions in this edited volume approach poetry, narrative, and drama from the perspective of Computational Stylistics. They exemplify methods of computational textual analysis and explore the possibility of computational generation of literary texts. The volume presents a range of computational and Natural Language Processing applications to literary studies, such as motif detection, network analysis, machine learning, and deep learning

    Syntax inside the grammar

    Get PDF
    This volume collects novel contributions to comparative generative linguistics that “rethink” existing approaches to an extensive range of phenomena, domains, and architectural questions in linguistic theory. At the heart of the contributions is the tension between descriptive and explanatory adequacy which has long animated generative linguistics and which continues to grow thanks to the increasing amount and diversity of data available to us. The chapters address research questions on the relation of syntax to other aspects of grammar and linguistics more generally, including studies on language acquisition, variation and change, and syntactic interfaces. Many of these contributions show the influence of research by Ian Roberts and collaborators and give the reader a sense of the lively nature of current discussion of topics in synchronic and diachronic comparative syntax ranging from the core verbal domain to higher, propositional domains

    Syntactic architecture and its consequences I

    Get PDF
    This volume collects novel contributions to comparative generative linguistics that “rethink” existing approaches to an extensive range of phenomena, domains, and architectural questions in linguistic theory. At the heart of the contributions is the tension between descriptive and explanatory adequacy which has long animated generative linguistics and which continues to grow thanks to the increasing amount and diversity of data available to us. The chapters address research questions on the relation of syntax to other aspects of grammar and linguistics more generally, including studies on language acquisition, variation and change, and syntactic interfaces. Many of these contributions show the influence of research by Ian Roberts and collaborators and give the reader a sense of the lively nature of current discussion of topics in synchronic and diachronic comparative syntax ranging from the core verbal domain to higher, propositional domains. This book is complemented by volume II available at https://langsci-press.org/catalog/book/276 and volume III available at https://langsci-press.org/catalog/book/277

    Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020

    Get PDF
    On behalf of the Program Committee, a very warm welcome to the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020). This edition of the conference is held in Bologna and organised by the University of Bologna. The CLiC-it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after six years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges

    Morphosyntactic Variation in Medieval Celtic Languages

    Get PDF
    This book showcases the state of the art in corpus-based linguistic analysis of Celtic languages (specifically, Old/Middle Irish, Middle Welsh, and Cornish). It explores corpus approaches to morphosyntactic variation in the medieval Celtic languages and, for the first time, situates them in the broader field of computational and corpus linguistics by providing descriptions of tools for processing the data to create electronic corpora
    • …
    corecore