7,364 research outputs found

    Viewing functions as token sequences to highlight similarities in source code

    Get PDF
    International audienceThe detection of similarities in source code has applications not only in software re-engineering (to eliminate redundancies) but also in software plagiarism detection. This latter can be a challenging problem since more or less extensive edits may have been performed on the original copy: insertion or removal of useless chunks of code, rewriting of expressions, transposition of code, inlining and outlining of functions, etc. In this paper, we propose a new similarity detection technique not only based on token sequence matching but also on the factorization of the function call graphs. The factorization process merges shared chunks (factors) of codes to cope, in particular, with inlining and outlining. The resulting call graph offers a view of the similarities with their nesting relations. It is useful to infer metrics quantifying similarity at a function level

    A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges

    Full text link
    Measuring and evaluating source code similarity is a fundamental software engineering activity that embraces a broad range of applications, including but not limited to code recommendation, duplicate code, plagiarism, malware, and smell detection. This paper proposes a systematic literature review and meta-analysis on code similarity measurement and evaluation techniques to shed light on the existing approaches and their characteristics in different applications. We initially found over 10000 articles by querying four digital libraries and ended up with 136 primary studies in the field. The studies were classified according to their methodology, programming languages, datasets, tools, and applications. A deep investigation reveals 80 software tools, working with eight different techniques on five application domains. Nearly 49% of the tools work on Java programs and 37% support C and C++, while there is no support for many programming languages. A noteworthy point was the existence of 12 datasets related to source code similarity measurement and duplicate codes, of which only eight datasets were publicly accessible. The lack of reliable datasets, empirical evaluations, hybrid methods, and focuses on multi-paradigm languages are the main challenges in the field. Emerging applications of code similarity measurement concentrate on the development phase in addition to the maintenance.Comment: 49 pages, 10 figures, 6 table

    Koodikloonien hyödyntäminen asiakaskohtaisten erojen havaitsemiseksi tuotteistusprosessissa

    Get PDF
    The topic for this thesis was inspired by two case studies. The case studies are applications that are conceptually but not technically products. Their code bases contain customer-specific branches. The development strategy with the case studies has been forking an existing branch and customizing it to the needs of the new client. Code reuse and forking can be an efficient or even a necessary development strategy due to time pressure. However, code duplication may result in harder maintainability of the code base which in turn increases the maintenance costs. Finding similar code fragments is researched in the field of code clone detection. Code clones are code fragments that are either the same or similar. The similarity can be categorized into 4 types. Type I clones are exact matches that differ only in layout, whitespace or comments. In addition to type I changes, type II clones can differ in identifier names and types or literal values. Furthermore, type III clones can have statements added, deleted or modified within the code fragments under comparison. Type IV clones are functionally similar clones. There are different kinds of techniques and tools for both detecting and visualizing clones. Different techniques find different sets of clone types. Code clone visualizations present both the overview of the cloning situation, and the details in the source code level. The branches of the same product of the case studies can be considered as clones of each other. They are expected to remind type III clones. They essentially originate from the same code base, but each one has added, deleted and modified statements within the corresponding files between the other branches. Identifying these changes facilitate forming an overall picture of how much the branches truly differ. The transformation process from development of customer-specific software to product software is called productization. In order to productize, the differences in the branches must be determined. Each customization needs to be considered in the productization process to avoid reducing the value of the product. We defined a process how to utilize code clone visualizations to explore differences between customer-specific branches. Conclusion of this thesis is that utilization of code clones clearly expedites the productization process. The visualizations aid to locate the differences much faster than manually. Code clone detection is applied to fade out the uninteresting differences between the branches. Hence, the method aids to navigate to the truly interesting customizations that require manual inspection. The method also provides a general view of the cloning situation, which eases the task of estimating the workload. The process is applicable in situations, where the diverged code bases are expected to remind each other structurally, yet contain so many changes that a manual comparison of the branches with file comparison tools would be too time-consuming.Motivaatio diplomityön tekemiselle syntyi kahden tapaustutkimuksen johdosta. Ne käsittelevät sovelluksia, jotka ovat käsitteellisellä tasolla tuotteita, mutta eivät teknisesti. Niiden lähdekoodit sisältävät asiakaskohtaisia haaroja. Kehitysstrategia sovellusten kohdalla on ollut haarauttaa koodipohja asiakaskohtaiseksi koodipohjaksi ja muokata se asiakastoiveiden mukaiseksi. Koodin uusiokäyttö voi olla tehokas tai jopa tarvittava kehitysstrategia aikataulupaineiden johdosta. Toisteinen koodi voi kuitenkin hankaloittaa sovellusten ylläpitoa ja täten nostaa ylläpitokustannuksia. Samankaltaisten koodin osien etsimistä on tutkittu koodikloonien tutkimusalalla. Koodikloonit ovat koodin osia, jotka ovat joko samoja tai samankaltaisia. Samankaltaisuus voidaan luokitella neljään tyyppiin. Tyypin I kloonit eroavat vain ulkoasun, tyhjätilamerkkien tai kommenttien osalta. Tyypin II kloonit voivat erota myös muuttujien nimien tai tyyppien osalta tai literaalien arvoissa. Tyypin III klooneissa voi olla lisättyjä, poistettuja tai muuttuneita lauseita välissä. Tyypin IV kloonit ovat toiminnaltaan samankaltaisia. Koodikloonien tunnistamiseen ja visualisointiin on erilaisia menetelmiä. Eri tekniikat löytävät eri tyyppisiä klooneja. Koodiklooneista voidaan visualisoida sekä kokonaiskuva kloonaustilanteesta että yksityiskohdat lähdekooditasolla. Saman tuotteen haarat tapaustutkimuksissamme voidaan ajatella olevan tyypin III klooneja toisistaan. Ne periytyvät alun perin samasta koodipohjasta, mutta jokaisessa on lisättyjä, poistettuja ja muutettuja lauseita toisiaan vastaavien tiedostojen välillä. Nämä muutokset halutaan havaita, jotta voimme saada kokonaiskuvan siitä, kuinka paljon haarat todellisuudessa eroavat toisistaan. Tutkimuksen kohteena oli tuotteistusprosessi, jossa asiakaskohtaisesti räätälöidyt koodipohjat pyrittiin muuntamaan yhdeksi tuotteeksi. Tavoitteena oli selvittää kaikkien koodipohjien asiakaskohtaisesti räätälöidyt osat, jotta ne tulisivat huomioitua tuotteistusprosessissa. Jokainen räätälöinti voi olla tuotteen arvoa nostava tekijä. Kehitimme prosessin, jonka mukaisesti kloonien visualisointeja voidaan käyttää tuotteistusprosessissa. Tutkimuksessa havaittiin, että koodikloonien hyödyntäminen nopeutti selkeästi tutkimuskohteiden tuotteistusprosessia. Visualiointien avulla erot löydetään huomattavasti nopeammin kuin manuaalisesti. Kloonien tunnistusmenetelmiä käytetään tässä yhteydessä häivyttämään koodipohjasta epäkiinnostavat erot. Täten menetelmä ohjaa niiden erojen äärelle, joiden tarkastelu oikeasti vaatii manuaalista tulkintaa. Menetelmä antaa myös kokonaiskuvan tilanteesta, mikä helpottaa tuotteistamiseen tarvittavien työmääräarvioiden tekemistä. Menetelmä sopii tilanteisiin, jossa toisistaan erkaantuneet koodipohjat muistuttavat vielä rakenteeltaan toisiaan, mutta sisältävät niin paljon muutoksia, että käsin tehtävä koodihaarojen vertailu tiedostojen vertailuun tarkoitetulla työkalulla olisi liian aikaa vievää

    Knowledge in Physics through Mathematics, Image and Language

    Get PDF
    This thesis explores the nature of knowledge in physics and the discourse that organises it. In particular, it focuses on the affordances of mathematics, image and language for construing the highly technical meanings that constitute this knowledge. It shows that each of these resources play a crucial role in physics’ ability to generate generalised theory whilst maintaining relevance to the empirical physical world. First, to understand how mathematics contributes to knowledge-building, the thesis presents a detailed descriptive model from the perspective of Systemic Functional Semiotics that considers mathematics on its own terms. The description builds on O’Halloran’s (2005) grammar in order to understand mathematics’ intrinsic functionality and theoretical architecture. In doing so, it takes an axial perspective (Martin 2013) that considers the paradigmatic and syntagmatic axes in Systemic Functional theory as the theoretical primitives from which metafunction, strata, rank and all other theoretical categories can be derived. It shows that, when not transposing categories from English but rather deriving them from axial principles, mathematics’ theoretical architecture is considerably different to that of any resource previously seen. Looking metafunctionally, mathematics displays a highly elaborated logical component within the ideational metafunction, but shows no evidence for a discrete interpersonal metafunction. Looking at the levels within the grammar, it displays two interacting hierarchies: a rank scale based on constituency and a nesting scale based on iterative layering. Finally, it shows distinct and predictable texts patterns in its interaction with language. From this, the description is able to use genre as a unifying semiotic that strongly predicts the grammatical patterns that occur throughout physics discourse. By developing these models, the thesis offers an understanding of mathematics’ unique functionality and the reasons it is consistently used in physics. Second, the thesis interprets the images of physics from the perspective of the Systemic Functional dimension of field. It shows that much of the power of images comes from the large number of distinct meanings that can be encapsulated in a single snapshot. In one image, large taxonomies, long sequences of activity, extensive arrays of data and various levels of specificity can all be presented. This allows various components of physics’ knowledge to be related and coordinated, and aids physics in building a coherent and integrated knowledge structure. Following the descriptive component of the thesis, the specific functionalities of mathematics, image and language are interpreted through the Legitimation Code Theory dimension of Semantics. This provides an understanding of the organisation of physics’ knowledge structure as a whole. It shows how the interaction of mathematics, language and image underpins physics’ ability to progressively build ever more elaborated technical meanings, to make empirical predictions from theoretical models and to abstract theoretical generalisations from empirical data. By interpreting the mathematics, image and language used in physics from the complementary perspectives of Systemic Functional Semiotics and Legitimation Code Theory, the thesis offers a detailed model of how physics manages to make sense of and predict the vast physical world

    Interpersonal Metadiscourse in Compositions Written by Iranian ESP Students

    Get PDF
    The aim of this study was to investigate two types of Hyland's interpersonal metadiscourse (MD) used in compositions written by male and female students. Twelve students including 5 males and 7 females aged between 26 -33 who have been studying chemistry engineering in Islamic Azad University, Shahreza Branch were selected. Without any instruction, they were given a topic to write an eighty-word composition in ten minutes. Compositions were collected and were analyzed quantitatively and qualitatively. Data was analyzed quantitatively in the result section and discussed qualitatively in discussion and conclusion sections. Findings showed that students employed all types of metadiscourse except for two subcategories of interactive MD namely endophoric markers and evidentials. Self mentions were the most frequently used, and hedges and boosters were the least in both males and females. Differences between genders in using MD with different degrees of occurrence are present in the overall interpersonal metadiscourse

    Interpersonal Metadiscourse in Compositions Written by Iranian ESP Students

    Get PDF
    The aim of this study was to investigate two types of Hyland's interpersonal metadiscourse (MD) used in compositions written by male and female students. Twelve students including 5 males and 7 females aged between 26 -33 who have been studying chemistry engineering in Islamic Azad University, Shahreza Branch were selected. Without any instruction, they were given a topic to write an eighty-word composition in ten minutes. Compositions were collected and were analyzed quantitatively and qualitatively. Data was analyzed quantitatively in the result section and discussed qualitatively in discussion and conclusion sections. Findings showed that students employed all types of metadiscourse except for two subcategories of interactive MD namely endophoric markers and evidentials. Self mentions were the most frequently used, and hedges and boosters were the least in both males and females. Differences between genders in using MD with different degrees of occurrence are present in the overall interpersonal metadiscourse
    • …
    corecore