280 research outputs found

    Towards Document Plagiarism Detection Based on the Relevance and Fragmentation of the Reused Text

    Full text link

    Composing Measures for Computing Text Similarity

    Get PDF
    We present a comprehensive study of computing similarity between texts. We start from the observation that while the concept of similarity is well grounded in psychology, text similarity is much less well-defined in the natural language processing community. We thus define the notion of text similarity and distinguish it from related tasks such as textual entailment and near-duplicate detection. We then identify multiple text dimensions, i.e. characteristics inherent to texts that can be used to judge text similarity, for which we provide empirical evidence. We discuss state-of-the-art text similarity measures previously proposed in the literature, before continuing with a thorough discussion of common evaluation metrics and datasets. Based on the analysis, we devise an architecture which combines text similarity measures in a unified classification framework. We apply our system in two evaluation settings, for which it consistently outperforms prior work and competing systems: (a) an intrinsic evaluation in the context of the Semantic Textual Similarity Task as part of the Semantic Evaluation (SemEval) exercises, and (b) an extrinsic evaluation for the detection of text reuse. As a basis for future work, we introduce DKPro Similarity, an open source software package which streamlines the development of text similarity measures and complete experimental setups

    Collaboration in Designing a Pedagogical Approach in Information Literacy

    Get PDF
    ​This Open Access book combines expertise in information literacy with expertise in education and teaching to share tips and tricks for the development of good information literacy teaching and training in universities and libraries. It draws on research, knowledge and pedagogical practice from academia, to teach students how to sift through information to be able to distinguish the important and correct from the unusable. It discusses basic concepts and models of information literacy, as well as strategies for accessing, locating and retrieving information and methods suitable for the assessment and management of information. The book explains many concepts connected to information literacy and discusses pedagogical issues with a view to supporting the practitioner. Each chapter examines one aspect of information literacy, discusses the pedagogical challenges involved and provides suggestions for best practice

    On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism

    Full text link
    Barrón Cedeño, LA. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16012Palanci

    Final FLaReNet deliverable: Language Resources for the Future - The Future of Language Resources

    Get PDF
    Language Technologies (LT), together with their backbone, Language Resources (LR), provide an essential support to the challenge of Multilingualism and ICT of the future. The main task of language technologies is to bridge language barriers and to help creating a new environment where information flows smoothly across frontiers and languages, no matter the country, and the language, of origin. To achieve this goal, all players involved need to act as a community able to join forces on a set of shared priorities. However, until now the field of Language Resources and Technology has long suffered from an excess of individuality and fragmentation, with a lack of coherence concerning the priorities for the field, the direction to move, not to mention a common timeframe. The context encountered by the FLaReNet project was thus represented by an active field needing a coherence that can only be given by sharing common priorities and endeavours. FLaReNet has contributed to the creation of this coherence by gathering a wide community of experts and making them participate in the definition of an exhaustive set of recommendations

    A study on plagiarism detection and plagiarism direction identification using natural language processing techniques

    Get PDF
    Ever since we entered the digital communication era, the ease of information sharing through the internet has encouraged online literature searching. With this comes the potential risk of a rise in academic misconduct and intellectual property theft. As concerns over plagiarism grow, more attention has been directed towards automatic plagiarism detection. This is a computational approach which assists humans in judging whether pieces of texts are plagiarised. However, most existing plagiarism detection approaches are limited to super cial, brute-force stringmatching techniques. If the text has undergone substantial semantic and syntactic changes, string-matching approaches do not perform well. In order to identify such changes, linguistic techniques which are able to perform a deeper analysis of the text are needed. To date, very limited research has been conducted on the topic of utilising linguistic techniques in plagiarism detection. This thesis provides novel perspectives on plagiarism detection and plagiarism direction identi cation tasks. The hypothesis is that original texts and rewritten texts exhibit signi cant but measurable di erences, and that these di erences can be captured through statistical and linguistic indicators. To investigate this hypothesis, four main research objectives are de ned. First, a novel framework for plagiarism detection is proposed. It involves the use of Natural Language Processing techniques, rather than only relying on the vii traditional string-matching approaches. The objective is to investigate and evaluate the in uence of text pre-processing, and statistical, shallow and deep linguistic techniques using a corpus-based approach. This is achieved by evaluating the techniques in two main experimental settings. Second, the role of machine learning in this novel framework is investigated. The objective is to determine whether the application of machine learning in the plagiarism detection task is helpful. This is achieved by comparing a thresholdsetting approach against a supervised machine learning classi er. Third, the prospect of applying the proposed framework in a large-scale scenario is explored. The objective is to investigate the scalability of the proposed framework and algorithms. This is achieved by experimenting with a large-scale corpus in three stages. The rst two stages are based on longer text lengths and the nal stage is based on segments of texts. Finally, the plagiarism direction identi cation problem is explored as supervised machine learning classi cation and ranking tasks. Statistical and linguistic features are investigated individually or in various combinations. The objective is to introduce a new perspective on the traditional brute-force pair-wise comparison of texts. Instead of comparing original texts against rewritten texts, features are drawn based on traits of texts to build a pattern for original and rewritten texts. Thus, the classi cation or ranking task is to t a piece of text into a pattern. The framework is tested by empirical experiments, and the results from initial experiments show that deep linguistic analysis contributes to solving the problems we address in this thesis. Further experiments show that combining shallow and viii deep techniques helps improve the classi cation of plagiarised texts by reducing the number of false negatives. In addition, the experiment on plagiarism direction detection shows that rewritten texts can be identi ed by statistical and linguistic traits. The conclusions of this study o er ideas for further research directions and potential applications to tackle the challenges that lie ahead in detecting text reuse.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    The dawn of the human-machine era: a forecast of new and emerging language technologies

    Get PDF
    New language technologies are coming, thanks to the huge and competing private investment fuelling rapid progress; we can either understand and foresee their effects, or be taken by surprise and spend our time trying to catch up. This report scketches out some transformative new technologies that are likely to fundamentally change our use of language. Some of these may feel unrealistically futuristic or far-fetched, but a central purpose of this report - and the wider LITHME network - is to illustrate that these are mostly just the logical development and maturation of technologies currently in prototype. But will everyone benefit from all these shiny new gadgets? Throughout this report we emphasise a range of groups who will be disadvantaged and issues of inequality. Important issues of security and privacy will accompany new language technologies. A further caution is to re-emphasise the current limitations of AI. Looking ahead, we see many intriguing opportunities and new capabilities, but a range of other uncertainties and inequalities. New devices will enable new ways to talk, to translate, to remember, and to learn. But advances in technology will reproduce existing inequalities among those who cannot afford these devices, among the world's smaller languages, and especially for sign language. Debates over privacy and security will flare and crackle with every new immersive gadget. We will move together into this curious new world with a mix of excitement and apprehension - reacting, debating, sharing and disagreeing as we always do. Plug in, as the human-machine era dawn

    Identification of microservices from monolithic applications through topic modelling

    Get PDF
    Dissertação de mestrado em Informatics EngineeringMicroservices emerged as one of the most popular architectural patterns in the recent years given the increased need to scale, grow and flexibilize software projects accompanied by the growth in cloud computing and DevOps. Many software applications are being submitted to a process of migration from its monolithic architecture to a more modular, scalable and flexible architecture of microservices. This process is slow and, depending on the project’s complexity, it may take months or even years to complete. This dissertation proposes a new approach on microservices identification by resorting to topic modelling in order to identify services according to domain terms. This approach in combination with clustering techniques produces a set of services based on the original software. The proposed methodology is implemented as an open-source tool for exploration of monolithic architectures and identification of microservices. An extensive quantitative analysis using the state of the art metrics on independence of functionality and modularity of services was conducted on 200 open-source projects collected from GitHub. Cohesion at message and domain level metrics showed medians of roughly 0.6. Interfaces per service exhibited a median of 1.5 with a compact interquartile range. Structural and conceptual modularity revealed medians of 0.2 and 0.4 respectively. Further analysis to understand if the methodology works better for smaller/larger projects revealed an overall stability and similar performance across metrics. Our first results are positive demonstrating beneficial identification of services due to overall metrics’ results.Os microserviços emergiram como um dos padrões arquiteturais mais populares na atualidade dado o aumento da necessidade em escalar, crescer e flexibilizar projetos de software, acompanhados da crescente da computação na cloud e DevOps. Muitas aplicações estão a ser submetidas a processos de migração de uma arquitetura monolítica para uma arquitetura mais modular, escalável e flexivel de microserviços. Este processo de migração é lento, e dependendo da complexidade do projeto, poderá levar vários meses ou mesmo anos a completar. Esta dissertação propõe uma nova abordagem na identificação de microserviços recorrendo a modelação de tópicos de forma a identificar serviços de acordo com termos de domínio de um projeto de software. Esta abordagem em combinação com técnicas de clustering produz um conjunto de serviços baseado no projeto de software original. A metodologia proposta é implementada como uma ferramenta open-source para exploração de arquiteturas monolíticas e identificação de microserviços. Uma análise quantitativa extensa recorrendo a métricas de independência de funcionalidade e modularidade de serviços foi conduzida em 200 aplicações open-source recolhidas do GitHub. Métricas de coesão ao nível da mensagem e domínio revelaram medianas em torno de 0.6. Interfaces por serviço demonstraram uma mediana de 1.5 com um intervalo interquartil compacto. Métricas de modularidade estrutural e conceptual revelaram medianas de 0.2 e 0.4 respetivamente. Uma análise mais aprofundada para tentar perceber se a metodologia funciona melhor para projetos de diferentes dimensões/características revelaram uma estabilidade geral do funcionamento do método. Os primeiros resultados são positivos demonstrando identificações de serviços benéficos tendo em conta que os valores das métricas são de uma forma global positivos e promissores
    • …
    corecore