14 research outputs found
Characteristics of Useful Code Reviews: An Empirical Study at Microsoft
Abstract-Over the past decade, both open source and commercial software projects have adopted contemporary peer code review practices as a quality control mechanism. Prior research has shown that developers spend a large amount of time and effort performing code reviews. Therefore, identifying factors that lead to useful code reviews can benefit projects by increasing code review effectiveness and quality. In a three-stage mixed research study, we qualitatively investigated what aspects of code reviews make them useful to developers, used our findings to build and verify a classification model that can distinguish between useful and not useful code review feedback, and finally we used this classifier to classify review comments enabling us to empirically investigate factors that lead to more effective code review feedback. In total, we analyzed 1.5 millions review comments from five Microsoft projects and uncovered many factors that affect the usefulness of review feedback. For example, we found that the proportion of useful comments made by a reviewer increases dramatically in the first year that he or she is at Microsoft but tends to plateau afterwards. In contrast, we found that the more files that are in a change, the lower the proportion of comments in the code review that will be of value to the author of the change. Based on our findings, we provide recommendations for practitioners to improve effectiveness of code reviews
ToxiSpanSE: An Explainable Toxicity Detection in Code Review Comments
Background: The existence of toxic conversations in open-source platforms can
degrade relationships among software developers and may negatively impact
software product quality. To help mitigate this, some initial work has been
done to detect toxic comments in the Software Engineering (SE) domain. Aims:
Since automatically classifying an entire text as toxic or non-toxic does not
help human moderators to understand the specific reason(s) for toxicity, we
worked to develop an explainable toxicity detector for the SE domain. Method:
Our explainable toxicity detector can detect specific spans of toxic content
from SE texts, which can help human moderators by automatically highlighting
those spans. This toxic span detection model, ToxiSpanSE, is trained with the
19,651 code review (CR) comments with labeled toxic spans. Our annotators
labeled the toxic spans within 3,757 toxic CR samples. We explored several
types of models, including one lexicon-based approach and five different
transformer-based encoders. Results: After an extensive evaluation of all
models, we found that our fine-tuned RoBERTa model achieved the best score with
0.88 , 0.87 precision, and 0.93 recall for toxic class tokens, providing an
explainable toxicity classifier for the SE domain. Conclusion: Since ToxiSpanSE
is the first tool to detect toxic spans in the SE domain, this tool will pave a
path to combat toxicity in the SE community
Automated Identification of Sexual Orientation and Gender Identity Discriminatory Texts from Issue Comments
In an industry dominated by straight men, many developers representing other
gender identities and sexual orientations often encounter hateful or
discriminatory messages. Such communications pose barriers to participation for
women and LGBTQ+ persons. Due to sheer volume, manual inspection of all
communications for discriminatory communication is infeasible for a large-scale
Free Open-Source Software (FLOSS) community. To address this challenge, this
study aims to develop an automated mechanism to identify Sexual orientation and
Gender identity Discriminatory (SGID) texts from software developers'
communications. On this goal, we trained and evaluated SGID4SE ( Sexual
orientation and Gender Identity Discriminatory text identification for (4)
Software Engineering texts) as a supervised learning-based SGID detection tool.
SGID4SE incorporates six preprocessing steps and ten state-of-the-art
algorithms. SGID4SE implements six different strategies to improve the
performance of the minority class. We empirically evaluated each strategy and
identified an optimum configuration for each algorithm. In our ten-fold
cross-validation-based evaluations, a BERT-based model boosts the best
performance with 85.9% precision, 80.0% recall, and 82.9% F1-Score for the SGID
class. This model achieves 95.7% accuracy and 80.4% Matthews Correlation
Coefficient. Our dataset and tool establish a foundation for further research
in this direction
Towards Automated Classification of Code Review Feedback to Support Analytics
Background: As improving code review (CR) effectiveness is a priority for
many software development organizations, projects have deployed CR analytics
platforms to identify potential improvement areas. The number of issues
identified, which is a crucial metric to measure CR effectiveness, can be
misleading if all issues are placed in the same bin. Therefore, a finer-grained
classification of issues identified during CRs can provide actionable insights
to improve CR effectiveness. Although a recent work by Fregnan et al. proposed
automated models to classify CR-induced changes, we have noticed two potential
improvement areas -- i) classifying comments that do not induce changes and ii)
using deep neural networks (DNN) in conjunction with code context to improve
performances. Aims: This study aims to develop an automated CR comment
classifier that leverages DNN models to achieve a more reliable performance
than Fregnan et al. Method: Using a manually labeled dataset of 1,828 CR
comments, we trained and evaluated supervised learning-based DNN models
leveraging code context, comment text, and a set of code metrics to classify CR
comments into one of the five high-level categories proposed by Turzo and Bosu.
Results: Based on our 10-fold cross-validation-based evaluations of multiple
combinations of tokenization approaches, we found a model using CodeBERT
achieving the best accuracy of 59.3%. Our approach outperforms Fregnan et al.'s
approach by achieving 18.7% higher accuracy. Conclusion: Besides facilitating
improved CR analytics, our proposed model can be useful for developers in
prioritizing code review feedback and selecting reviewers
A Comparison of Nano-Patterns vs. Software Metrics in Vulnerability Prediction
Context: Software security is an imperative aspect of software quality. Early detection of vulnerable code during development can better ensure the security of the codebase and minimize testing efforts. Although traditional software metrics are used for early detection of vulnerabilities, they do not clearly address the granularity level of the issue to precisely pinpoint vulnerabilities. The goal of this study is to employ method-level traceable patterns (nano-patterns) in vulnerability prediction and empirically compare their performance with traditional software metrics. The concept of nano-patterns is similar to design patterns, but these constructs can be automatically recognized and extracted from source code. If nano-patterns can better predict vulnerable methods compared to software metrics, they can be used in developing vulnerability prediction models with better accuracy. Aims: This study explores the performance of method-level patterns in vulnerability prediction. We also compare them with method-level software metrics. Method: We studied vulnerabilities reported for two major releases of Apache Tomcat (6 and 7), Apache CXF, and two stand-alone Java web applications. We used three machine learning techniques to predict vulnerabilities using nano-patterns as features. We applied the same techniques using method-level software metrics as features and compared their performance with nano-patterns. Results: We found that nano-patterns show lower false negative rates for classifying vulnerable methods (for Tomcat 6, 21% vs 34.7%) and therefore, have higher recall in predicting vulnerable code than the software metrics used. On the other hand, software metrics show higher precision than nano-patterns (79.4% vs 76.6%). Conclusion: In summary, we suggest developers use nano-patterns as features for vulnerability prediction to augment existing approaches as these code constructs outperform standard metrics in terms of prediction recall
When Are OSS Developers More Likely to Introduce Vulnerable Code Changes? A Case Study
Part 8: Case Studies and Demonstrations of Open Source ProjectsInternational audienceWe analyzed peer code review data of the Android Open Source Project (AOSP) to understand whether code changes that introduce security vulnerabilities, referred to as vulnerable code changes (VCC), occur at certain intervals. Using a systematic manual analysis process, we identified 60 VCCs. Our results suggest that AOSP developers were more likely to write VCCs prior to AOSP releases, while during the post-release period they wrote fewer VCCs