59 research outputs found

    Influence, information and team outcomes in large scale software development

    Get PDF

    Predicting Good Configurations for GitHub and Stack Overflow Topic Models

    Full text link
    Software repositories contain large amounts of textual data, ranging from source code comments and issue descriptions to questions, answers, and comments on Stack Overflow. To make sense of this textual data, topic modelling is frequently used as a text-mining tool for the discovery of hidden semantic structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used topic model that aims to explain the structure of a corpus by grouping texts. LDA requires multiple parameters to work well, and there are only rough and sometimes conflicting guidelines available on how these parameters should be set. In this paper, we contribute (i) a broad study of parameters to arrive at good local optima for GitHub and Stack Overflow text corpora, (ii) an a-posteriori characterisation of text corpora related to eight programming languages, and (iii) an analysis of corpus feature importance via per-corpus LDA configuration. We find that (1) popular rules of thumb for topic modelling parameter configuration are not applicable to the corpora used in our experiments, (2) corpora sampled from GitHub and Stack Overflow have different characteristics and require different configurations to achieve good model fit, and (3) we can predict good configurations for unseen corpora reliably. These findings support researchers and practitioners in efficiently determining suitable configurations for topic modelling when analysing textual data contained in software repositories.Comment: to appear as full paper at MSR 2019, the 16th International Conference on Mining Software Repositorie

    On negative results when using sentiment analysis tools for software engineering research

    Get PDF
    Recent years have seen an increasing attention to social aspects of software engineering, including studies of emotions and sentiments experienced and expressed by the software developers. Most of these studies reuse existing sentiment analysis tools such as SentiStrength and NLTK. However, these tools have been trained on product reviews and movie reviews and, therefore, their results might not be applicable in the software engineering domain. In this paper we study whether the sentiment analysis tools agree with the sentiment recognized by human evaluators (as reported in an earlier study) as well as with each other. Furthermore, we evaluate the impact of the choice of a sentiment analysis tool on software engineering studies by conducting a simple study of differences in issue resolution times for positive, negative and neutral texts. We repeat the study for seven datasets (issue trackers and Stack Overflow questions) and different sentiment analysis tools and observe that the disagreement between the tools can lead to diverging conclusions. Finally, we perform two replications of previously published studies and observe that the results of those studies cannot be confirmed when a different sentiment analysis tool is used

    Open source software ecosystems : a systematic mapping

    Get PDF
    Context: Open source software (OSS) and software ecosystems (SECOs) are two consolidated research areas in software engineering. OSS influences the way organizations develop, acquire, use and commercialize software. SECOs have emerged as a paradigm to understand dynamics and heterogeneity in collaborative software development. For this reason, SECOs appear as a valid instrument to analyze OSS systems. However, there are few studies that blend both topics together. Objective: The purpose of this study is to evaluate the current state of the art in OSS ecosystems (OSSECOs) research, specifically: (a) what the most relevant definitions related to OSSECOs are; (b) what the particularities of this type of SECO are; and (c) how the knowledge about OSSECO is represented. Method: We conducted a systematic mapping following recommended practices. We applied automatic and manual searches on different sources and used a rigorous method to elicit the keywords from the research questions and selection criteria to retrieve the final papers. As a result, 82 papers were selected and evaluated. Threats to validity were identified and mitigated whenever possible. Results: The analysis allowed us to answer the research questions. Most notably, we did the following: (a) identified 64 terms related to the OSSECO and arranged them into a taxonomy; (b) built a genealogical tree to understand the genesis of the OSSECO term from related definitions; (c) analyzed the available definitions of SECO in the context of OSS; and (d) classified the existing modelling and analysis techniques of OSSECOs. Conclusion: As a summary of the systematic mapping, we conclude that existing research on several topics related to OSSECOs is still scarce (e.g., modelling and analysis techniques, quality models, standard definitions, etc.). This situation calls for further investigation efforts on how organizations and OSS communities actually understand OSSECOs.Peer ReviewedPostprint (author's final draft

    Social aspects of collaboration in online software communities

    Get PDF

    Assessing the Quality of the Steps to Reproduce in Bug Reports

    Full text link
    A major problem with user-written bug reports, indicated by developers and documented by researchers, is the (lack of high) quality of the reported steps to reproduce the bugs. Low-quality steps to reproduce lead to excessive manual effort spent on bug triage and resolution. This paper proposes Euler, an approach that automatically identifies and assesses the quality of the steps to reproduce in a bug report, providing feedback to the reporters, which they can use to improve the bug report. The feedback provided by Euler was assessed by external evaluators and the results indicate that Euler correctly identified 98% of the existing steps to reproduce and 58% of the missing ones, while 73% of its quality annotations are correct.Comment: In Proceedings of the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE '19), August 26-30, 2019, Tallinn, Estoni

    On The Relationship Between The Vocabulary Of Bug Reports And Source Code

    Get PDF
    The use of text retrieval techniques on concept location and bug localization yields remarkable benefits. The artifacts found in source code and bug reports contain important information related to the bug localization process. When locating the bugs, it is a programmer\u27s task to formulate effective queries such that most of the predicted terms in the query appear in the relevant defect code, but not in most of the non-relevant source files. These queries are built based on the textual content found in the bug reports, especially the bug title and the description. A large body of research uses bug descriptions to evaluate bug localization techniques using text retrieval. All these studies are conducted under the implicit assumption that the bug description and the relevant source code files share important terms. This paper presents an empirical study that explores this conjecture. We found that bug reports share more terms with the patched classes than with the other classes in the software system. Moreover, the study revealed that the class names are more likely to share terms with the bug descriptions than other code locations. We also found that more verbose parts of the source code, such as, comments share more words. Furthermore, we discovered that the shared terms may be better predictors for bug localization than some other text retrieval techniques, such as, LSI

    Identifying reusable knowledge in developer instant messaging communication.

    Get PDF
    Context and background: Software engineering is a complex and knowledge-intensive activity. Required knowledge (e.g., about technologies, frameworks, and design decisions) changes fast and the knowledge needs of those who design, code, test and maintain software constantly evolve. On the other hand, software developers use a wide range of processes, practices and tools where developers explicitly and implicitly “produce” and capture different types of knowledge. Problem: Software developers use instant messaging tools (e.g., Slack, Microsoft Teams and Gitter) to discuss development-related problems, share experiences and to collaborate in projects. This communication takes place in chat rooms that accumulate potentially relevant knowledge to be reused by other developers. Therefore, in this research we analyze whether there is reusable knowledge in developer instant messaging communication by exploring (a) which instant messaging platforms can be a source of reusable knowledge, and (b) software engineering themes that represent the main discussions of developers in instant messaging communication. We also analyze how this reusable knowledge can be identified with the use of topic modeling (a natural language processing technique to discover abstract topics in text) by (c) surveying the literature on how topic modeling has been applied in software engineering research, and (d) evaluating how topic models perform with developer instant messages. Method: First, we conducted a Field Study through an exploratory case study and a reflexive thematic analysis to check whether there is reusable knowledge in developer instant messaging communication, and if so, what this knowledge (main themes discussed) is. Then, we conducted a Sample Study to explore how reusable knowledge in developer instant messaging communication can we identified. In this study, we applied a literature survey and software repository mining (i.e. short text topic modeling). Findings and contributions: We (a) developed a comparison framework for instant messaging tools, (b) identified a map of the main themes discussed in chat rooms of an instant messaging tool (Gitter, a platform used by software developers), (c) provided a comprehensive literature review that offers insights and references on the use of topic modeling in software engineering, and (d) provided an evaluation of the performance of topic models applied to developer instant messages based on topic coherence metrics and human judgment for topic quality

    Measuring and tracking quality factors in Free and Open Source Software projects

    Get PDF
    Free and Open Source Software (FOSS) has gained increased interest in the computer software industry, but assessing its quality remains a challenge. FOSS development is frequently carried out by globally distributed development teams, and all stages of development are publicly visible. Several product and process-level quality factors can be measured using the public data. This thesis presents a theoretical background for software quality and metrics and their application in a FOSS environment. Information available from FOSS projects in three information spaces are presented, and a quality model suitable for use in a FOSS context is constructed. The model includes both process and product quality metrics, and takes into account the tools and working methods commonly used in FOSS projects. A subset of the constructed quality model is applied to three FOSS projects, highlighting both theoretical and practical concerns in implementing automatic metric collection and analysis. The experiment shows that useful quality information can be extracted from the vast amount of data available. In particular, projects vary in their growth rate, complexity, modularity and team structure.Intresset för fri och öppen programvara (Free and Open Source Software, FOSS) har ökat inom programvaruindustrin, men dess kvalitetsmätning är alltjämt en utmaning. FOSS-utveckling utförs ofta av globalt utspridda utvecklingsteam, och alla skeden av utvecklingen är synliga för allmänheten. Flera kvalitetsfaktorer på produkt- och processnivå kan mätas med hjälp av allmänt tillgängliga data. Denna avhandling presenterar en teoretisk bakgrund för programvarukvalitet och -metrik och deras användning i en FOSS-miljö. Avhandlingen presenterar information som kan samlas från tre informationsområden i FOSS-projekt, och konstruerar en kvalitetsmodell som är lämplig för användning i FOSS-sammanhang. Modellen innehåller mätare för både process- och produktkvalitetsfaktorer, och beaktar de verktyg och arbetsmetoder som ofta används i FOSS-projekt. En delmängd av den konstruerade kvalitetsmodellen appliceras på tre FOSS-projekt, vilket belyser både teoretiska och praktiska problemområden för implementering av automatisk metrikinsamling och -analys. Experimentet visar att det är möjligt att utvinna användbar information ur den väldiga datamängd som finns tillgänglig. Särskilt kan nämnas att projekten varierar i tillväxthastighet, komplexitet, modularitet och gruppstruktur
    corecore