Search CORE

9 research outputs found

Going Big: A Large-Scale Study on What Big Data Developers Ask

Author: Gyöngyi Zoltán
Hindle A.
Khatchadourian Raffi
McCallum Andrew Kachites
Publication venue: CUNY Academic Works
Publication date: 26/08/2019
Field of study

Software developers are increasingly required to write big data code. However, they find big data software development challenging. To help these developers it is necessary to understand big data topics that they are interested in and the difficulty of finding answers for questions in these topics. In this work, we conduct a large-scale study on Stackoverflow to understand the interest and difficulties of big data developers. To conduct the study, we develop a set of big data tags to extract big data posts from Stackoverflow; use topic modeling to group these posts into big data topics; group similar topics into categories to construct a topic hierarchy; analyze popularity and difficulty of topics and their correlations; and discuss implications of our findings for practice, research and education of big data software development and investigate their coincidence with the findings of previous work

City University of New York

Crossref

Insights into Software Development Approaches: Mining Q&A Repositories

Author: Akbar Muhammad Azeem
Fahmideh Mahdi
Khan Arif Ali
Khan Javed Ali
Zhou Peng
Publication venue
Publication date: 02/05/2023
Field of study

Context: Software practitioners adopt approaches like DevOps, Scrum, and Waterfall for high-quality software development. However, limited research has been conducted on exploring software development approaches concerning practitioners discussions on Q&A forums. Objective: We conducted an empirical study to analyze developers discussions on Q&A forums to gain insights into software development approaches in practice. Method: We analyzed 13,903 developers posts across Stack Overflow (SO), Software Engineering Stack Exchange (SESE), and Project Management Stack Exchange (PMSE) forums. A mixed method approach, consisting of the topic modeling technique (i.e., Latent Dirichlet Allocation (LDA)) and qualitative analysis, is used to identify frequently discussed topics of software development approaches, trends (popular, difficult topics), and the challenges faced by practitioners in adopting different software development approaches. Findings: We identified 15 frequently mentioned software development approaches topics on Q&A sites and observed an increase in trends for the top-3 most difficult topics requiring more attention. Finally, our study identified 49 challenges faced by practitioners while deploying various software development approaches, and we subsequently created a thematic map to represent these findings. Conclusions: The study findings serve as a useful resource for practitioners to overcome challenges, stay informed about current trends, and ultimately improve the quality of software products they develop

arXiv.org e-Print Archive

Safe Automated Refactoring for Intelligent Parallelization of Java 8 Streams

Author: Ahmed Syed
Bagherzadeh Mehdi
Khatchadourian Raffi T
Tang Yiming
Publication venue: CUNY Academic Works
Publication date: 23/07/2019
Field of study

Streaming APIs are becoming more pervasive in mainstream Object-Oriented programming languages and platforms. For example, the Stream API introduced in Java 8 allows for functional-like, MapReduce-style operations in processing both finite, e.g., collections, and infinite data structures. However, using this API efficiently involves subtle considerations such as determining when it is best for stream operations to run in parallel, when running operations in parallel can be less efficient, and when it is safe to run in parallel due to possible lambda expression side-effects. Also, streams may not run all operations in parallel depending on particular collectors used in reductions. In this paper, we present an automated refactoring approach that assists developers in writing efficient stream code in a semantics-preserving fashion. The approach, based on a novel data ordering and typestate analysis, consists of preconditions and transformations for automatically determining when it is safe and possibly advantageous to convert sequential streams to parallel, unorder or de-parallelize already parallel streams, and optimize streams involving complex reductions. The approach was implemented as a plug-in to the popular Eclipse IDE, uses the WALA and SAFE analysis frameworks, and was evaluated on 11 Java projects consisting of ∼642K lines of code. We found that 57 of 157 candidate streams (36.31%) were refactorable, and an average speedup of 3.49 on performance tests was observed. The results indicate that the approach is useful in optimizing stream code to their full potential

City University of New York

Challenges and Barriers of Using Low Code Software for Machine Learning

Author: Alamin Md Abdullah Al
Uddin Gias
Publication venue
Publication date: 08/11/2022
Field of study

As big data grows ubiquitous across many domains, more and more stakeholders seek to develop Machine Learning (ML) applications on their data. The success of an ML application usually depends on the close collaboration of ML experts and domain experts. However, the shortage of ML engineers remains a fundamental problem. Low-code Machine learning tools/platforms (aka, AutoML) aim to democratize ML development to domain experts by automating many repetitive tasks in the ML pipeline. This research presents an empirical study of around 14k posts (questions + accepted answers) from Stack Overflow (SO) that contained AutoML-related discussions. We examine how these topics are spread across the various Machine Learning Life Cycle (MLLC) phases and their popularity and difficulty. This study offers several interesting findings. First, we find 13 AutoML topics that we group into four categories. The MLOps topic category (43% questions) is the largest, followed by Model (28% questions), Data (27% questions), Documentation (2% questions). Second, Most questions are asked during Model training (29%) (i.e., implementation phase) and Data preparation (25%) MLLC phase. Third, AutoML practitioners find the MLOps topic category most challenging, especially topics related to model deployment & monitoring and Automated ML pipeline. These findings have implications for all three AutoML stakeholders: AutoML researchers, AutoML service vendors, and AutoML developers. Academia and Industry collaboration can improve different aspects of AutoML, such as better DevOps/deployment support and tutorial-based documentation

arXiv.org e-Print Archive

Towards the detection and analysis of performance regression introducing code changes

Author: ALShoaibi Deema Adeeb
Publication venue: RIT Scholar Works
Publication date: 01/11/2022
Field of study

In contemporary software development, developers commonly conduct regression testing to ensure that code changes do not affect software quality. Performance regression testing is an emerging research area from the regression testing domain in software engineering. Performance regression testing aims to maintain the system\u27s performance. Conducting performance regression testing is known to be expensive. It is also complex, considering the increase of committed code and developing team members working simultaneously. Many automated regression testing techniques have been proposed in prior research. However, challenges in the practice of locating and resolving performance regression still exist. Directing regression testing to the commit level provides solutions to locate the root cause, yet it hinders the development process. This thesis outlines motivations and solutions to address locating performance regression root causes. First, we challenge a deterministic state-of-art approach by expanding the testing data to find improvement areas. The deterministic approach was found to be limited in searching for the best regression-locating rule. Thus, we presented two stochastic approaches to develop models that can learn from historical commits. The goal of the first stochastic approach is to view the research problem as a search-based optimization problem seeking to reach the highest detection rate. We are applying different multi-objective evolutionary algorithms and conducting a comparison between them. This thesis also investigates whether simplifying the search space by combining objectives would achieve comparative results. The second stochastic approach addresses the severity of class imbalance any system could have since code changes introducing regression are rare but costly. We formulate the identification of problematic commits that introduce performance regression as a binary classification problem that handles class imbalance. Further, the thesis provides an exploratory study on the challenges developers face in resolving performance regression. The study is based on the questions posted on a technical form directed to performance regression. We collected around 2k questions discussing the regression of software execution time, and all were manually analyzed. The study resulted in a categorization of the challenges. We also discussed the difficulty level of performance regression issues within the development community. This study provides insights to help developers during the software design and implementation to avoid regression causes

RIT Scholar Works

Actor Concurrency Bugs: A Comprehensive Study on Symptoms, Root Causes, API Usages, and Differences

Author: Bagherzadeh Mehdi
Fireman Nicholas
Khatchadourian Raffi T
Shawesh Anas
Publication venue: CUNY Academic Works
Publication date: 15/11/2020
Field of study

Actor concurrency is becoming increasingly important in the development of real-world software systems. Although actor concurrency may be less susceptible to some multithreaded concurrency bugs, such as low-level data races and deadlocks, it comes with its own bugs that may be different. However, the fundamental characteristics of actor concurrency bugs, including their symptoms, root causes, API usages, examples, and differences when they come from different sources are still largely unknown. Actor software development can significantly benefit from a comprehensive qualitative and quantitative understanding of these characteristics, which is the focus of this work, to foster better API documentations, development practices, testing, debugging, repairing, and verification frameworks. To conduct this study, we take the following major steps. First, we construct a set of 184 real-world Stackoverflow and Github actor bugs by manual analysis of 3,924 Stackoverflow questions, answers, and comments and 3,315 Github commits, messages, original and modified code snippets, issues, and pull requests. Second, we manually study these actor bugs and their fixes to understand and classify their symptoms, root causes, and API usages. Third, we study the differences between the commonalities and distributions of symptoms, root causes, and API usages of Stackoverflow and Github actor bugs. Fourth, we discuss real-world examples of bugs with these root causes and symptoms. Finally, we investigate the relation of our findings with the findings of previous work and discuss the implications of our findings using the anecdotal evidence of our actor bug examples. A few findings of our study are: (1) Symptoms of actor bugs can be classified into 5 categories with Error and Incorrect Exceptions being the most and least common symptoms (2) Root causes of actor bugs can be classified into 10 categories with Logic and Untyped Communication being the most and least common root causes (3) A small number of API packages are responsible for most of API usages by actor bugs (4) Stackoverflow and Github actors can differ significantly in the commonality and distribution of symptoms, root causes, and API usages (5) Actor developers may need help not only with complex, unknown, or semantic bugs in the development code but also with simple, well-known, well-documented, or syntactic bugs in the test code. While some of our findings are in agreement with the findings of the previous work, others are in sharp contrast

City University of New York

Towards an Improved Understanding of Software Vulnerability Assessment Using Data-Driven Approaches

Author: Le Triet Huynh Minh
Publication venue
Publication date: 01/01/2022
Field of study

Software Vulnerabilities (SVs) can expose software systems to cyber-attacks, potentially causing enormous financial and reputational damage for organizations. There have been significant research efforts to detect these SVs so that developers can promptly fix them. However, fixing SVs is complex and time-consuming in practice, and thus developers usually do not have sufficient time and resources to fix all SVs at once. As a result, developers often need SV information, such as exploitability, impact, and overall severity, to prioritize fixing more critical SVs. Such information required for fixing planning and prioritization is typically provided in the SV assessment step of the SV lifecycle. Recently, data-driven methods have been increasingly proposed to automate SV assessment tasks. However, there are still numerous shortcomings with the existing studies on data-driven SV assessment that would hinder their application in practice. This PhD thesis aims to contribute to the growing literature in data-driven SV assessment by investigating and addressing the constant changes in SV data as well as the lacking considerations of source code and developers’ needs for SV assessment that impede the practical applicability of the field. Particularly, we have made the following five contributions in this thesis. (1) We systematize the knowledge of data-driven SV assessment to reveal the best practices of the field and the main challenges affecting its application in practice. Subsequently, we propose various solutions to tackle these challenges to better support the real-world applications of data-driven SV assessment. (2) We first demonstrate the existence of the concept drift (changing data) issue in descriptions of SV reports that current studies have mostly used for predicting the Common Vulnerability Scoring System (CVSS) metrics. We augment report-level SV assessment models with subwords of terms extracted from SV descriptions to help the models more effectively capture the semantics of ever-increasing SVs. (3) We also identify that SV reports are usually released after SV fixing. Thus, we propose using vulnerable code to enable earlier SV assessment without waiting for SV reports. We are the first to use Machine Learning techniques to predict CVSS metrics on the function level leveraging vulnerable statements directly causing SVs and their context in code functions. The performance of our function-level SV assessment models is promising, opening up research opportunities in this new direction. (4) To facilitate continuous integration of software code nowadays, we present a novel deep multi-task learning model, DeepCVA, to simultaneously and efficiently predict multiple CVSS assessment metrics on the commit level, specifically using vulnerability-contributing commits. DeepCVA is the first work that enables practitioners to perform SV assessment as soon as vulnerable changes are added to a codebase, supporting just-in-time prioritization of SV fixing. (5) Besides code artifacts produced from a software project of interest, SV assessment tasks can also benefit from SV crowdsourcing information on developer Question and Answer (Q&A) websites. We automatically retrieve large-scale security/SVrelated posts from these Q&A websites. We then apply a topic modeling technique on these posts to distill developers’ real-world SV concerns that can be used for data-driven SV assessment. Overall, we believe that this thesis has provided evidence-based knowledge and useful guidelines for researchers and practitioners to automate SV assessment using data-driven approaches.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 202

arXiv.org e-Print Archive

Adelaide Research & Scholarship