1,704 research outputs found
Untangling Fine-Grained Code Changes
After working for some time, developers commit their code changes to a
version control system. When doing so, they often bundle unrelated changes
(e.g., bug fix and refactoring) in a single commit, thus creating a so-called
tangled commit. Sharing tangled commits is problematic because it makes review,
reversion, and integration of these commits harder and historical analyses of
the project less reliable. Researchers have worked at untangling existing
commits, i.e., finding which part of a commit relates to which task. In this
paper, we contribute to this line of work in two ways: (1) A publicly available
dataset of untangled code changes, created with the help of two developers who
accurately split their code changes into self contained tasks over a period of
four months; (2) a novel approach, EpiceaUntangler, to help developers share
untangled commits (aka. atomic commits) by using fine-grained code change
information. EpiceaUntangler is based and tested on the publicly available
dataset, and further evaluated by deploying it to 7 developers, who used it for
2 weeks. We recorded a median success rate of 91% and average one of 75%, in
automatically creating clusters of untangled fine-grained code changes
ChangeBeadsThreader: An Interactive Environment for Tailoring Automatically Untangled Changes
To improve the usability of a revision history, change untangling, which
reconstructs the history to ensure that changes in each commit belong to one
intentional task, is important. Although there are several untangling
approaches based on the clustering of fine-grained editing operations of source
code, they often produce unsuitable result for a developer, and manual
tailoring of the result is necessary. In this paper, we propose
ChangeBeadsThreader (CBT), an interactive environment for splitting and merging
change clusters to support the manual tailoring of untangled changes. CBT
provides two features: 1) a two-dimensional space where fine-grained change
history is visualized to help users find the clusters to be merged and 2) an
augmented diff view that enables users to confirm the consistency of the
changes in a specific cluster for finding those to be split. These features
allow users to easily tailor automatically untangled changes.Comment: 5 pages, SANER 202
A Fine-grained Data Set and Analysis of Tangling in Bug Fixing Commits
Context: Tangled commits are changes to software that address multiple
concerns at once. For researchers interested in bugs, tangled commits mean that
they actually study not only bugs, but also other concerns irrelevant for the
study of bugs.
Objective: We want to improve our understanding of the prevalence of tangling
and the types of changes that are tangled within bug fixing commits.
Methods: We use a crowd sourcing approach for manual labeling to validate
which changes contribute to bug fixes for each line in bug fixing commits. Each
line is labeled by four participants. If at least three participants agree on
the same label, we have consensus.
Results: We estimate that between 17% and 32% of all changes in bug fixing
commits modify the source code to fix the underlying problem. However, when we
only consider changes to the production code files this ratio increases to 66%
to 87%. We find that about 11% of lines are hard to label leading to active
disagreements between participants. Due to confirmed tangling and the
uncertainty in our data, we estimate that 3% to 47% of data is noisy without
manual untangling, depending on the use case.
Conclusion: Tangled commits have a high prevalence in bug fixes and can lead
to a large amount of noise in the data. Prior research indicates that this
noise may alter results. As researchers, we should be skeptics and assume that
unvalidated data is likely very noisy, until proven otherwise.Comment: Status: Accepted at Empirical Software Engineerin
Aide à l'Intégration de Branches Grâce à la Réification des Changements
Developers typically change codebases in parallel from each other, which results in diverging codebases. Such diverging codebases must be integrated when finished. Integrating diverging codebases involves difficult activities. For example, two changes that are correct independently can introduce subtle bugs when integrated together. Integration can be difficult with existing tools, which, instead of dealing with the evolution of the actual program entities being changed, handle code changes as lines of text in files. Tools are important: software development tools have greatly improved from generic text editors to IDEs by providing high-level code manipulation such as automatic refactorings and code completion. This improvement was possible by the reification of program entities. Nevertheless, integration tools did not benefit from a similarreification of change entities to improve productivity in integration.In this work we first conducted a study to learn which integration activities are important and have little tool support. We discovered that one of such activities is the detection of tangled commits (that contain unrelated tasks such as a bug fix and a refactoring). Then we proposed Epicea, a reified change model and associated IDE tools, and EpiceaUntangler, an approach to help developers share untangled commits based on Epicea. The results of our evaluations with real-world studies show the usefulness of our approaches.Les développeurs changent le code source en parallèle les uns des autres, ce qui fait diverger les bases de code. Ces divergences se doivent d'être réintégrées.L'intégration de bases de code divergentes est une activité complexe. Par exemple, réunir deux bases de code indépendamment correctes peut générer des problèmes. L'intégration peut être difficile avec les outils existants, qui, au lieu de gérer l'évolution des entités réelles du programme modifié, gère les changements de code au niveau des lignes de texte dans les fichiers sources.Les outils sont importants: les outils de développement de logiciels se sont grandement améliorés en passant par exemple d'éditeurs de textegénériques à des IDEs qui fournissent de la manipulation de code de haut niveau tels que la refactorisation automatique et la complétion de code. Cette amélioration a été possible grâce à la réification des entités de programme. Néanmoins, les outils d'intégration n'ont pas profité d'une réification similaire des entités de changement pour améliorer l'intégration.Dans cette thèse nous avons d'abord conduit une étude auprès de développeurs pourcomprendre quelles sont les activités menées durant une intégration quisont peu supportées par les outils. L'une d'elle est la détection de commits mêlés (qui contiennent des tâches non liées telles qu'une correction de bug et une refactorisation).Ensuite, nous proposons Epicea, un modèle de changement réifié et des outils d'IDE associés, et EpiceaUntangler, une approche pour aider les développeurs à démêler les commits en se basant sur Epicea.Les résultats de nos évaluations avec des études de cas issues du monde réel montrent l’utilité de nos approches
RotationNet: Joint Object Categorization and Pose Estimation Using Multiviews from Unsupervised Viewpoints
We propose a Convolutional Neural Network (CNN)-based model "RotationNet,"
which takes multi-view images of an object as input and jointly estimates its
pose and object category. Unlike previous approaches that use known viewpoint
labels for training, our method treats the viewpoint labels as latent
variables, which are learned in an unsupervised manner during the training
using an unaligned object dataset. RotationNet is designed to use only a
partial set of multi-view images for inference, and this property makes it
useful in practical scenarios where only partial views are available. Moreover,
our pose alignment strategy enables one to obtain view-specific feature
representations shared across classes, which is important to maintain high
accuracy in both object categorization and pose estimation. Effectiveness of
RotationNet is demonstrated by its superior performance to the state-of-the-art
methods of 3D object classification on 10- and 40-class ModelNet datasets. We
also show that RotationNet, even trained without known poses, achieves the
state-of-the-art performance on an object pose estimation dataset. The code is
available on https://github.com/kanezaki/rotationnetComment: 24 pages, 23 figures. Accepted to CVPR 201
ProtoMD: A Prototyping Toolkit for Multiscale Molecular Dynamics
ProtoMD is a toolkit that facilitates the development of algorithms for
multiscale molecular dynamics (MD) simulations. It is designed for multiscale
methods which capture the dynamic transfer of information across multiple
spatial scales, such as the atomic to the mesoscopic scale, via coevolving
microscopic and coarse-grained (CG) variables. ProtoMD can be also be used to
calibrate parameters needed in traditional CG-MD methods. The toolkit
integrates `GROMACS wrapper' to initiate MD simulations, and `MDAnalysis' to
analyze and manipulate trajectory files. It facilitates experimentation with a
spectrum of coarse-grained variables, prototyping rare events (such as chemical
reactions), or simulating nanocharacterization experiments such as terahertz
spectroscopy, AFM, nanopore, and time-of-flight mass spectroscopy. ProtoMD is
written in python and is freely available under the GNU General Public License
from github.com/CTCNano/proto_md
Toward Provenance-Based Security for Configuration Languages
Large system installations are increasingly configured using high-level, mostly-declarative languages. Often, different users contribute data that is compiled centrally and distributed to individual systems. Although the systems themselves have been developed with reliability and availability in mind, the configuration compilation process can lead to unforeseen vulnerabilities because of the lack of access control on the different components combined to build the final configuration. Even if simple change-based access controls are applied to validate changes to the final version, changes can be lost or incorrectly attributed. Based on the growing literature on provenance for database queries and other models of computation, we identify a potential application area for provenance to securing configuration languages.
Improving Software Project Health Using Machine Learning
In recent years, systems that would previously live on different platforms have been integrated under a single umbrella. The increased use of GitHub, which offers pull-requests, issue trackingand version history, and its integration with other solutions such as Gerrit, or Travis, as well as theresponse from competitors, created development environments that favour agile methodologiesby increasingly automating non-coding tasks: automated build systems, automated issue triagingetc. In essence, source-code hosting platforms shifted to continuous integration/continuousdelivery (CI/CD) as a service. This facilitated a shift in development paradigms, adherents ofagile methodology can now adopt a CI/CD infrastructure more easily. This has also created large,publicly accessible sources of source-code together with related project artefacts: GHTorrent andsimilar datasets now offer programmatic access to the whole of GitHub. Project health encompasses traceability, documentation, adherence to coding conventions,tasks that reduce maintenance costs and increase accountability, but may not directly impactfeatures. Overfocus on health can slow velocity (new feature delivery) so the Agile Manifestosuggests developers should travel light — forgo tasks focused on a project health in favourof higher feature velocity. Obviously, injudiciously following this suggestion can undermine aproject’s chances for success. Simultaneously, this shift to CI/CD has allowed the proliferation of Natural Language orNatural Language and Formal Language textual artefacts that are programmatically accessible:GitHub and their competitors allow API access to their infrastructure to enable the creation ofCI/CD bots. This suggests that approaches from Natural Language Processing and MachineLearning are now feasible and indeed desirable. This thesis aims to (semi-)automate tasks forthis new paradigm and its attendant infrastructure by bringing to the foreground the relevant NLPand ML techniques. Under this umbrella, I focus on three synergistic tasks from this domain: (1) improving theissue-pull-request traceability, which can aid existing systems to automatically curate the issuebacklog as pull-requests are merged; (2) untangling commits in a version history, which canaid the beforementioned traceability task as well as improve the usability of determining a faultintroducing commit, or cherry-picking via tools such as git bisect; (3) mixed-text parsing, to allowbetter API mining and open new avenues for project-specific code-recommendation tools
- …