1,704 research outputs found

    Untangling Fine-Grained Code Changes

    Get PDF
    After working for some time, developers commit their code changes to a version control system. When doing so, they often bundle unrelated changes (e.g., bug fix and refactoring) in a single commit, thus creating a so-called tangled commit. Sharing tangled commits is problematic because it makes review, reversion, and integration of these commits harder and historical analyses of the project less reliable. Researchers have worked at untangling existing commits, i.e., finding which part of a commit relates to which task. In this paper, we contribute to this line of work in two ways: (1) A publicly available dataset of untangled code changes, created with the help of two developers who accurately split their code changes into self contained tasks over a period of four months; (2) a novel approach, EpiceaUntangler, to help developers share untangled commits (aka. atomic commits) by using fine-grained code change information. EpiceaUntangler is based and tested on the publicly available dataset, and further evaluated by deploying it to 7 developers, who used it for 2 weeks. We recorded a median success rate of 91% and average one of 75%, in automatically creating clusters of untangled fine-grained code changes

    ChangeBeadsThreader: An Interactive Environment for Tailoring Automatically Untangled Changes

    Full text link
    To improve the usability of a revision history, change untangling, which reconstructs the history to ensure that changes in each commit belong to one intentional task, is important. Although there are several untangling approaches based on the clustering of fine-grained editing operations of source code, they often produce unsuitable result for a developer, and manual tailoring of the result is necessary. In this paper, we propose ChangeBeadsThreader (CBT), an interactive environment for splitting and merging change clusters to support the manual tailoring of untangled changes. CBT provides two features: 1) a two-dimensional space where fine-grained change history is visualized to help users find the clusters to be merged and 2) an augmented diff view that enables users to confirm the consistency of the changes in a specific cluster for finding those to be split. These features allow users to easily tailor automatically untangled changes.Comment: 5 pages, SANER 202

    Flexeme: Untangling Commits Using Lexical Flows

    Get PDF

    A Fine-grained Data Set and Analysis of Tangling in Bug Fixing Commits

    Get PDF
    Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs. Objective: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits. Methods: We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus. Results: We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case. Conclusion: Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise.Comment: Status: Accepted at Empirical Software Engineerin

    Aide à l'Intégration de Branches Grâce à la Réification des Changements

    Get PDF
    Developers typically change codebases in parallel from each other, which results in diverging codebases. Such diverging codebases must be integrated when finished. Integrating diverging codebases involves difficult activities. For example, two changes that are correct independently can introduce subtle bugs when integrated together. Integration can be difficult with existing tools, which, instead of dealing with the evolution of the actual program entities being changed, handle code changes as lines of text in files. Tools are important: software development tools have greatly improved from generic text editors to IDEs by providing high-level code manipulation such as automatic refactorings and code completion. This improvement was possible by the reification of program entities. Nevertheless, integration tools did not benefit from a similarreification of change entities to improve productivity in integration.In this work we first conducted a study to learn which integration activities are important and have little tool support. We discovered that one of such activities is the detection of tangled commits (that contain unrelated tasks such as a bug fix and a refactoring). Then we proposed Epicea, a reified change model and associated IDE tools, and EpiceaUntangler, an approach to help developers share untangled commits based on Epicea. The results of our evaluations with real-world studies show the usefulness of our approaches.Les développeurs changent le code source en parallèle les uns des autres, ce qui fait diverger les bases de code. Ces divergences se doivent d'être réintégrées.L'intégration de bases de code divergentes est une activité complexe. Par exemple, réunir deux bases de code indépendamment correctes peut générer des problèmes. L'intégration peut être difficile avec les outils existants, qui, au lieu de gérer l'évolution des entités réelles du programme modifié, gère les changements de code au niveau des lignes de texte dans les fichiers sources.Les outils sont importants: les outils de développement de logiciels se sont grandement améliorés en passant par exemple d'éditeurs de textegénériques à des IDEs qui fournissent de la manipulation de code de haut niveau tels que la refactorisation automatique et la complétion de code. Cette amélioration a été possible grâce à la réification des entités de programme. Néanmoins, les outils d'intégration n'ont pas profité d'une réification similaire des entités de changement pour améliorer l'intégration.Dans cette thèse nous avons d'abord conduit une étude auprès de développeurs pourcomprendre quelles sont les activités menées durant une intégration quisont peu supportées par les outils. L'une d'elle est la détection de commits mêlés (qui contiennent des tâches non liées telles qu'une correction de bug et une refactorisation).Ensuite, nous proposons Epicea, un modèle de changement réifié et des outils d'IDE associés, et EpiceaUntangler, une approche pour aider les développeurs à démêler les commits en se basant sur Epicea.Les résultats de nos évaluations avec des études de cas issues du monde réel montrent l’utilité de nos approches

    RotationNet: Joint Object Categorization and Pose Estimation Using Multiviews from Unsupervised Viewpoints

    Full text link
    We propose a Convolutional Neural Network (CNN)-based model "RotationNet," which takes multi-view images of an object as input and jointly estimates its pose and object category. Unlike previous approaches that use known viewpoint labels for training, our method treats the viewpoint labels as latent variables, which are learned in an unsupervised manner during the training using an unaligned object dataset. RotationNet is designed to use only a partial set of multi-view images for inference, and this property makes it useful in practical scenarios where only partial views are available. Moreover, our pose alignment strategy enables one to obtain view-specific feature representations shared across classes, which is important to maintain high accuracy in both object categorization and pose estimation. Effectiveness of RotationNet is demonstrated by its superior performance to the state-of-the-art methods of 3D object classification on 10- and 40-class ModelNet datasets. We also show that RotationNet, even trained without known poses, achieves the state-of-the-art performance on an object pose estimation dataset. The code is available on https://github.com/kanezaki/rotationnetComment: 24 pages, 23 figures. Accepted to CVPR 201

    ProtoMD: A Prototyping Toolkit for Multiscale Molecular Dynamics

    Full text link
    ProtoMD is a toolkit that facilitates the development of algorithms for multiscale molecular dynamics (MD) simulations. It is designed for multiscale methods which capture the dynamic transfer of information across multiple spatial scales, such as the atomic to the mesoscopic scale, via coevolving microscopic and coarse-grained (CG) variables. ProtoMD can be also be used to calibrate parameters needed in traditional CG-MD methods. The toolkit integrates `GROMACS wrapper' to initiate MD simulations, and `MDAnalysis' to analyze and manipulate trajectory files. It facilitates experimentation with a spectrum of coarse-grained variables, prototyping rare events (such as chemical reactions), or simulating nanocharacterization experiments such as terahertz spectroscopy, AFM, nanopore, and time-of-flight mass spectroscopy. ProtoMD is written in python and is freely available under the GNU General Public License from github.com/CTCNano/proto_md

    Toward Provenance-Based Security for Configuration Languages

    Get PDF
    Large system installations are increasingly configured using high-level, mostly-declarative languages. Often, different users contribute data that is compiled centrally and distributed to individual systems. Although the systems themselves have been developed with reliability and availability in mind, the configuration compilation process can lead to unforeseen vulnerabilities because of the lack of access control on the different components combined to build the final configuration. Even if simple change-based access controls are applied to validate changes to the final version, changes can be lost or incorrectly attributed. Based on the growing literature on provenance for database queries and other models of computation, we identify a potential application area for provenance to securing configuration languages.

    Improving Software Project Health Using Machine Learning

    Get PDF
    In recent years, systems that would previously live on different platforms have been integrated under a single umbrella. The increased use of GitHub, which offers pull-requests, issue trackingand version history, and its integration with other solutions such as Gerrit, or Travis, as well as theresponse from competitors, created development environments that favour agile methodologiesby increasingly automating non-coding tasks: automated build systems, automated issue triagingetc. In essence, source-code hosting platforms shifted to continuous integration/continuousdelivery (CI/CD) as a service. This facilitated a shift in development paradigms, adherents ofagile methodology can now adopt a CI/CD infrastructure more easily. This has also created large,publicly accessible sources of source-code together with related project artefacts: GHTorrent andsimilar datasets now offer programmatic access to the whole of GitHub. Project health encompasses traceability, documentation, adherence to coding conventions,tasks that reduce maintenance costs and increase accountability, but may not directly impactfeatures. Overfocus on health can slow velocity (new feature delivery) so the Agile Manifestosuggests developers should travel light — forgo tasks focused on a project health in favourof higher feature velocity. Obviously, injudiciously following this suggestion can undermine aproject’s chances for success. Simultaneously, this shift to CI/CD has allowed the proliferation of Natural Language orNatural Language and Formal Language textual artefacts that are programmatically accessible:GitHub and their competitors allow API access to their infrastructure to enable the creation ofCI/CD bots. This suggests that approaches from Natural Language Processing and MachineLearning are now feasible and indeed desirable. This thesis aims to (semi-)automate tasks forthis new paradigm and its attendant infrastructure by bringing to the foreground the relevant NLPand ML techniques. Under this umbrella, I focus on three synergistic tasks from this domain: (1) improving theissue-pull-request traceability, which can aid existing systems to automatically curate the issuebacklog as pull-requests are merged; (2) untangling commits in a version history, which canaid the beforementioned traceability task as well as improve the usability of determining a faultintroducing commit, or cherry-picking via tools such as git bisect; (3) mixed-text parsing, to allowbetter API mining and open new avenues for project-specific code-recommendation tools
    • …
    corecore