366 research outputs found

    Learning Transfers over Several Programming Languages

    Full text link
    Large language models (LLMs) have recently become remarkably good at improving developer productivity for high-resource programming languages. These models use two kinds of data: large amounts of unlabeled code samples for pretraining and relatively smaller amounts of labeled code samples for fine-tuning or in-context learning. Unfortunately, many programming languages are low-resource, lacking labeled samples for most tasks and often even lacking unlabeled samples. Therefore, users of low-resource languages (e.g., legacy or new languages) miss out on the benefits of LLMs. Cross-lingual transfer learning uses data from a source language to improve model performance on a target language. It has been well-studied for natural languages, but has received little attention for programming languages. This paper reports extensive experiments on four tasks using a transformer-based LLM and 11 to 41 programming languages to explore the following questions. First, how well cross-lingual transfer works for a given task across different language pairs. Second, given a task and target language, how to best choose a source language. Third, the characteristics of a language pair that are predictive of transfer performance, and fourth, how that depends on the given task.Comment: 16 pages, 5 figures, 5 table

    Reverse-Engineering and Analysis of Access Control Models in Web Applications

    Get PDF
    RÉSUMÉ De nos jours, les applications Web sont omniprésentes et gèrent des quantités toujours plus importantes de données confidentielles. Afin de protéger ces données contre les attaques d'usagers mal intentionnés, des mécanismes de sécurité doivent être mis en place. Toutefois, sécuriser un logiciel est une tâche extrêmement ardue puisqu'une seule brèche est souvent suffisante pour compromettre la sécurité d'un système tout entier. Il n'est donc pas surprenant de constater que jour après jour les nouvelles font état de cyber attaques et de fuites de données confidentielles dans les systèmes informatiques. Afin de donner au lecteur une vague idée de l'ampleur du problème, considérons que différents organismes spécialisés en sécurité informatique rapportent qu'entre 85% et 98% des sites Web contiennent au moins une vulnérabilité sérieuse. Dans le cadre de cette thèse, nous nous concentrerons sur un aspect particulier de la sécurité logicielle, à savoir les modèles de contrôle d'accès. Les modèles de contrôle d'accès définissent les actions qu'un usager peut et ne peut pas faire dans un système. Malheureusement, années après années, les failles dans les modèles de contrôle d'accès trônent au sommet des palmarès des failles les plus communes et les plus critiques dans les applications Web. Toutefois, contrairement à d'autres types de faille de sécurité comme les injections SQL (SQLi) et le cross-site scripting (XSS), les failles de contrôle d'accès ont comparativement reçu peu d'attention de la communauté de recherche scientifique. Par ce travail de recherche, nous espérons renverser cette tendance. Bien que la sécurité des applications et les modèles de contrôle d'accès constituent les principaux thèmes sous-jacents de cette thèse, notre travail de recherche est aussi fortement teinté par le génie logiciel. Vous observerez en effet que notre travail s'applique toujours à des applications réelles et que les approches que nous développons sont toujours construites de manière à minimiser le fardeau de travail supplémentaire pour les développeurs. En d'autres mots, cette thèse porte sur la sécurité des applications en pratique. Dans le contexte de cette thèse, nous aborderons l'imposant défi d'investiguer des modèles de contrôle d'accès non spécifiés et souvent non documentés, tels que rencontrés dans les applications Web en code ouvert. En effet, les failles de contrôle d'accès se manifestent lorsqu'un usager est en mesure de faire des actions qu'il ne devrait pas pouvoir faire ou d'accéder à des données auxquelles il ne devrait pas avoir accès. En absence de spécifications de sécurité, déterminer qui devrait avoir les autorisations pour effectuer certaines actions ou accéder à certaines données n'est pas simple. Afin de surmonter ce défi, nous avons d'abord développé une nouvelle approche, appelée analyse de Traversement de Patrons de Sécurité (TPS), afin de faire la rétro-ingénierie de modèles de contrôle d'accès à partir du code source d'applications Web et ce, d'une manière rapide, précise et évolutive. Les résultats de l'analyse TPS donnent un portrait du modèle de contrôle d'accès tel qu'implémenté dans une application et servent de point de départ à des analyses plus poussées. Par exemple, les applications Web réelles comprennent souvent des centaines de privilèges qui protègent plusieurs centaines de fonctions et modules différents. En conséquence, les modèles de contrôle d'accès, tel qu'extraits par l'analyse TPS, peuvent être difficiles à interpréter du point de vue du développeur, principalement à cause de leurs taille. Afin de surmonter cette limitation, nous avons exploré comment l'analyse formelle de concepts peut faciliter la compréhension des modèles extraits en fournissant un support visuel ainsi qu'un cadre formel de raisonnement. Les résultats ont en effet démontrés que l'analyse formelle de concepts permet de mettre en lumière plusieurs propriétés des modèles de contrôle d'accès qui sont enfouies profondément dans le code des applications, qui sont invisibles aux administrateurs et aux développeurs, et qui peuvent causer des incompréhensions et des failles de sécurité. Au fil de nos investigations et de nos observations de plusieurs modèles de contrôle d'accès, nous avons aussi identifié des patrons récurrents, problématiques et indépendants des applications qui mènent à des failles de contrôle d'accès. La seconde partie de cette thèse présente les approches que nous avons développées afin de tirer profit des résultats de l'analyse TPS pour identifier automatiquement plusieurs types de failles de contrôle d'accès communes comme les vulnérabilités de navigation forcée, les erreurs sémantiques et les failles basées sur les clones à protection incohérentes. Chacune de ces approches interprète en effet les résultats de l'analyse TPS sous des angles différents afin d'identifier différents types de vulnérabilités dans les modèles de contrôle d'accès. Les vulnérabilités de navigation forcée se produisent lorsque des ressources sensibles ne sont pas adéquatement protégées contre les accès direct à leur URL. En utilisant les résultats de l'analyse TPS, nous avons montré comment nous sommes en mesure de détecter ces vulnérabilités de manière précise et très rapide (jusqu'à 890 fois plus rapidement que l'état de l'art). Les erreurs sémantiques se produisent quand des ressources sensibles sont protégées par des privilèges qui sont sémantiquement incorrects. Afin d'illustrer notre propos, dans le contexte d'une application Web, protéger l'accès à des ressources administratives avec un privilège destiné à restreindre le téléversement de fichiers est un exemple d'erreur sémantique. À notre connaissance, nous avons été les premiers à nous attaquer à ce problème et à identifier avec succès des erreurs sémantiques dans des modèles de contrôle d'accès. Nous avons obtenu de tels résultats en interprétant les résultats de l'analyse TPS à la lumière d'une technique de traitement de la langue naturelle appelée Latent Dirichlet Allocation. Finalement, en investiguant les résultats de l'analyse TPS à la lumière des informations fournies par une analyse de clones logiciels, nous avons été en mesure d'identifier davantage de nouvelles failles de contrôle d'accès. En résumé, nous avons exploré l'intuition selon laquelle il est attendu que les clones logiciels, qui sont des blocs de code syntaxiquement similaires, effectuent des opérations similaires dans un système et, conséquemment, qu'ils soient protégés de manière similaire. En investiguant les clones qui ne sont pas protégés de manière similaire, nous avons effectivement été en mesure de détecter et rapporter plusieurs nouvelles failles de sécurité dans les systèmes étudiés. En dépit des progrès significatifs que nous avons accomplis dans cette thèse, la recherche sur les modèles de contrôle d'accès et les failles de contrôle d'accès, spécialement d'un point de vue pratique n'en est encore qu'à ses débuts. D'un point de vue de génie logiciel, il reste encore beaucoup de travail à accomplir en ce qui concerne l'extraction, la modélisation, la compréhension et les tests de modèles de contrôle d'accès. Tout au long de cette thèse, nous discuterons comment les travaux présentés peuvent soutenir ces activités et suggérerons plusieurs avenues de recherche à explorer.----------ABSTRACT Nowadays, Web applications are ubiquitous and deal with increasingly large amounts of confidential data. In order to protect these data from malicious users, security mechanisms must be put in place. Securing software, however, is an extremely difficult task since a single breach is often sufficient to compromise the security of a system. Therefore, it is not surprising that day after day, we hear about cyberattacks and confidential data leaks in the news. To give the reader an idea, various reports suggest that between 85% and 98% of websites contain at least one serious vulnerability. In this thesis, we focus on one particular aspect of software security that is access control models. Access control models are critical security components that define the actions a user can and cannot do in a system. Year after year, several security organizations report access control flaws among the most prevalent and critical flaws in Web applications. However, contrary to other types of security flaws such as SQL injection (SQLi) and cross-site scripting (XSS), access control flaws comparatively received little attention from the research community. This research work attempts to reverse this trend. While application security and access control models are the main underlying themes of this thesis, our research work is also strongly anchored in software engineering. You will observe that our work is always based on real-world Web applications and that the approaches we developed are always built in such a way as to minimize the amount of work on that is required from developers. In other words, this thesis is about practical software security. In the context of this thesis, we tackle the highly challenging problem of investigating unspecified and often undocumented access control models in open source Web applications. Indeed, access control flaws occur when some user is able to perform operations he should not be able to do or access data he should be denied access to. In the absence of security specifications, determining who should have the authorization to perform specific operations or access specific data is not straightforward. In order to overcome this challenge, we first developed a novel approach, called the Security Pattern Traversal (SPT) analysis, to reverse-engineer access control models from the source code of applications in a fast, precise and scalable manner. Results from SPT analysis give a portrait of the access control model as implemented in an application and serve as a baseline for further analyzes. For example, real-world Web application, often define several hundred privileges that protect hundreds of different functions and modules. As a consequence, access control models, as reverse-engineered by SPT analysis, can be difficult to interpret from a developer point of view, due to their size. In order to provide better support to developers, we explored how Formal Concept Analysis (FCA) could facilitate comprehension by providing visual support as well as automated reasoning about the extracted access control models. Results indeed revealed how FCA could highlight properties about implemented access control models that are buried deep into the source code of applications, that are invisible to administrators and developers, and that can cause misunderstandings and vulnerabilities. Through investigation and observation of several Web applications, we also identified recurring and cross-application error-prone patterns in access control models. The second half of this thesis presents the approaches we developed to leverage SPT results to automatically capture these patterns that lead to access control flaws such as forced browsing vulnerabilities, semantic errors and security-discordant clone based errors. Each of these approaches interpret SPT analysis results from different angles to identify different kinds of access control flaws in Web applications. Forced browsing vulnerabilities occur when security-sensitive resources are not protected against direct access to their URL. Using results from SPT, we showed how we can detect such vulnerabilities in a precise and very fast (up to 890 times faster than state of the art) way. Semantic errors occur when security-sensitive resources are protected by semantically wrong privileges. To give the reader an idea, in the context of a Web application, protecting access to administrative resources with a privilege that is designed to restrict file uploads is an example of semantic error. To our knowledge, we were the first to tackle this problem and to successfully detect semantic errors in access control models. We achieved such results by interpreting results from SPT in the light of a natural language processing technique called Latent Dirichlet Allocation. Finally, by investigating SPT results in the light of software clones, we were able to detect yet other novel access control flaws. Simply put, we explored the intuition that code clones, that are blocks of code that are syntactically similar, are expected to perform similar operations in a system and, consequently, be protected by similar privileges. By investigating clones that are protected in different ways, called security-discordant clones, we were able to report several novel access control flaws in the investigated systems. Despite the significant advancements that were made through this thesis, research on access control models and access control flaws, especially from a practical, application-centric point of view, is still in the early stages. From a software engineering perspective, a lot of work remains to be done from the extraction, modelling, understanding and testing perspectives. Throughout this thesis we discuss how the presented work can help in these perspectives and suggest further lines of research

    A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges

    Full text link
    Measuring and evaluating source code similarity is a fundamental software engineering activity that embraces a broad range of applications, including but not limited to code recommendation, duplicate code, plagiarism, malware, and smell detection. This paper proposes a systematic literature review and meta-analysis on code similarity measurement and evaluation techniques to shed light on the existing approaches and their characteristics in different applications. We initially found over 10000 articles by querying four digital libraries and ended up with 136 primary studies in the field. The studies were classified according to their methodology, programming languages, datasets, tools, and applications. A deep investigation reveals 80 software tools, working with eight different techniques on five application domains. Nearly 49% of the tools work on Java programs and 37% support C and C++, while there is no support for many programming languages. A noteworthy point was the existence of 12 datasets related to source code similarity measurement and duplicate codes, of which only eight datasets were publicly accessible. The lack of reliable datasets, empirical evaluations, hybrid methods, and focuses on multi-paradigm languages are the main challenges in the field. Emerging applications of code similarity measurement concentrate on the development phase in addition to the maintenance.Comment: 49 pages, 10 figures, 6 table

    Revisiting the challenges and surveys in text similarity matching and detection methods

    Get PDF
    The massive amount of information from the internet has revolutionized the field of natural language processing. One of the challenges was estimating the similarity between texts. This has been an open research problem although various studies have proposed new methods over the years. This paper surveyed and traced the primary studies in the field of text similarity. The aim was to give a broad overview of existing issues, applications, and methods of text similarity research. This paper identified four issues and several applications of text similarity matching. It classified current studies based on intrinsic, extrinsic, and hybrid approaches. Then, we identified the methods and classified them into lexical-similarity, syntactic-similarity, semantic-similarity, structural-similarity, and hybrid. Furthermore, this study also analyzed and discussed method improvement, current limitations, and open challenges on this topic for future research directions

    Model-Driven Engineering in the Large: Refactoring Techniques for Models and Model Transformation Systems

    Get PDF
    Model-Driven Engineering (MDE) is a software engineering paradigm that aims to increase the productivity of developers by raising the abstraction level of software development. It envisions the use of models as key artifacts during design, implementation and deployment. From the recent arrival of MDE in large-scale industrial software development – a trend we refer to as MDE in the large –, a set of challenges emerges: First, models are now developed at distributed locations, by teams of teams. In such highly collaborative settings, the presence of large monolithic models gives rise to certain issues, such as their proneness to editing conflicts. Second, in large-scale system development, models are created using various domain-specific modeling languages. Combining these models in a disciplined manner calls for adequate modularization mechanisms. Third, the development of models is handled systematically by expressing the involved operations using model transformation rules. Such rules are often created by cloning, a practice related to performance and maintainability issues. In this thesis, we contribute three refactoring techniques, each aiming to tackle one of these challenges. First, we propose a technique to split a large monolithic model into a set of sub-models. The aim of this technique is to enable a separation of concerns within models, promoting a concern-based collaboration style: Collaborators operate on the submodels relevant for their task at hand. Second, we suggest a technique to encapsulate model components by introducing modular interfaces in a set of related models. The goal of this technique is to establish modularity in these models. Third, we introduce a refactoring to merge a set of model transformation rules exhibiting a high degree of similarity. The aim of this technique is to improve maintainability and performance by eliminating the drawbacks associated with cloning. The refactoring creates variability-based rules, a novel type of rule allowing to capture variability by using annotations. The refactoring techniques contributed in this work help to reduce the manual effort during the refactoring of models and transformation rules to a large extent. As indicated in a series of realistic case studies, the output produced by the techniques is comparable or, in the case of transformation rules, partly even preferable to the result of manual refactoring, yielding a promising outlook on the applicability in real-world settings

    Grand Challenges of Traceability: The Next Ten Years

    Full text link
    In 2007, the software and systems traceability community met at the first Natural Bridge symposium on the Grand Challenges of Traceability to establish and address research goals for achieving effective, trustworthy, and ubiquitous traceability. Ten years later, in 2017, the community came together to evaluate a decade of progress towards achieving these goals. These proceedings document some of that progress. They include a series of short position papers, representing current work in the community organized across four process axes of traceability practice. The sessions covered topics from Trace Strategizing, Trace Link Creation and Evolution, Trace Link Usage, real-world applications of Traceability, and Traceability Datasets and benchmarks. Two breakout groups focused on the importance of creating and sharing traceability datasets within the research community, and discussed challenges related to the adoption of tracing techniques in industrial practice. Members of the research community are engaged in many active, ongoing, and impactful research projects. Our hope is that ten years from now we will be able to look back at a productive decade of research and claim that we have achieved the overarching Grand Challenge of Traceability, which seeks for traceability to be always present, built into the engineering process, and for it to have "effectively disappeared without a trace". We hope that others will see the potential that traceability has for empowering software and systems engineers to develop higher-quality products at increasing levels of complexity and scale, and that they will join the active community of Software and Systems traceability researchers as we move forward into the next decade of research

    Grand Challenges of Traceability: The Next Ten Years

    Full text link
    In 2007, the software and systems traceability community met at the first Natural Bridge symposium on the Grand Challenges of Traceability to establish and address research goals for achieving effective, trustworthy, and ubiquitous traceability. Ten years later, in 2017, the community came together to evaluate a decade of progress towards achieving these goals. These proceedings document some of that progress. They include a series of short position papers, representing current work in the community organized across four process axes of traceability practice. The sessions covered topics from Trace Strategizing, Trace Link Creation and Evolution, Trace Link Usage, real-world applications of Traceability, and Traceability Datasets and benchmarks. Two breakout groups focused on the importance of creating and sharing traceability datasets within the research community, and discussed challenges related to the adoption of tracing techniques in industrial practice. Members of the research community are engaged in many active, ongoing, and impactful research projects. Our hope is that ten years from now we will be able to look back at a productive decade of research and claim that we have achieved the overarching Grand Challenge of Traceability, which seeks for traceability to be always present, built into the engineering process, and for it to have "effectively disappeared without a trace". We hope that others will see the potential that traceability has for empowering software and systems engineers to develop higher-quality products at increasing levels of complexity and scale, and that they will join the active community of Software and Systems traceability researchers as we move forward into the next decade of research

    Toward an Effective Automated Tracing Process

    Get PDF
    Traceability is defined as the ability to establish, record, and maintain dependency relations among various software artifacts in a software system, in both a forwards and backwards direction, throughout the multiple phases of the project’s life cycle. The availability of traceability information has been proven vital to several software engineering activities such as program comprehension, impact analysis, feature location, software reuse, and verification and validation (V&V). The research on automated software traceability has noticeably advanced in the past few years. Various methodologies and tools have been proposed in the literature to provide automatic support for establishing and maintaining traceability information in software systems. This movement is motivated by the increasing attention traceability has been receiving as a critical element of any rigorous software development process. However, despite these major advances, traceability implementation and use is still not pervasive in industry. In particular, traceability tools are still far from achieving performance levels that are adequate for practical applications. Such low levels of accuracy require software engineers working with traceability tools to spend a considerable amount of their time verifying the generated traceability information, a process that is often described as tedious, exhaustive, and error-prone. Motivated by these observations, and building upon a growing body of work in this area, in this dissertation we explore several research directions related to enhancing the performance of automated tracing tools and techniques. In particular, our work addresses several issues related to the various aspects of the IR-based automated tracing process, including trace link retrieval, performance enhancement, and the role of the human in the process. Our main objective is to achieve performance levels, in terms of accuracy, efficiency, and usability, that are adequate for practical applications, and ultimately to accomplish a successful technology transfer from research to industry
    corecore