5 research outputs found

    Python Coding Style Compliance on Stack Overflow

    Get PDF
    Software developers all over the world use Stack Overflow (SO) to interact and exchange code snippets. Research also uses SO to harvest code snippets for use with recommendation systems. However, previous work has shown that code on SO may have quality issues, such as security or license problems. We analyse Python code on SO to determine its coding style compliance. From 1,962,535 code snippets tagged with 'python', we extracted 407,097 snippets of at least 6 statements of Python code. Surprisingly, 93.87% of the extracted snippets contain style violations, with an average of 0.7 violations per statement and a huge number of snippets with a considerably higher ratio. Researchers and developers should, therefore, be aware that code snippets on SO may not representative of good coding style. Furthermore, while user reputation seems to be unrelated to coding style compliance, for posts with vote scores in the range between -10 and 20, we found a strong correlation (r = -0.87, p <; 10^-7) between the vote score a post received and the average number of violations per statement for snippets in such posts

    A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects

    Full text link
    Background: Meeting the growing industry demand for Data Science requires cross-disciplinary teams that can translate machine learning research into production-ready code. Software engineering teams value adherence to coding standards as an indication of code readability, maintainability, and developer expertise. However, there are no large-scale empirical studies of coding standards focused specifically on Data Science projects. Aims: This study investigates the extent to which Data Science projects follow code standards. In particular, which standards are followed, which are ignored, and how does this differ to traditional software projects? Method: We compare a corpus of 1048 Open-Source Data Science projects to a reference group of 1099 non-Data Science projects with a similar level of quality and maturity. Results: Data Science projects suffer from a significantly higher rate of functions that use an excessive numbers of parameters and local variables. Data Science projects also follow different variable naming conventions to non-Data Science projects. Conclusions: The differences indicate that Data Science codebases are distinct from traditional software codebases and do not follow traditional software engineering conventions. Our conjecture is that this may be because traditional software engineering conventions are inappropriate in the context of Data Science projects.Comment: 11 pages, 7 figures. To appear in ESEM 2020. Updated based on peer revie

    Analysis of the Impact of Tags on Stack Overflow Questions

    Get PDF
    User queries on Stack Overflow commonly suffer from either inadequate length or inadequate clarity with regards to the languages and/or tools they are meant for. Although the site makes use of a tagging system for classifying questions, tags are used minimally (if at all). To investigate the impact of tags in the quality of results returned by the queries, in this research we propose a new query expansion solution. Our technique assigns tags to queries based on how well they match the queries’ topics. We evaluated our technique on eight sets of queries categorized by overall length and programming language. We examined the retrieval results by adding varying numbers of tags to the queries, and monitored the recall and precision rates. Our results indicate that queries yield considerably higher recall and precision rates with extra tags than without. We further conclude that tags are a particularly effective means of enhancement when the original queries do not already return sufficient yields to begin with

    Automated recommendation, reuse, and generation of unit tests for software systems

    Get PDF
    This thesis presents a body of work relating to the automated discovery, reuse, and generation of unit tests for software systems with the goal of improving the efficiency of the software engineering process and the quality of the produced software. We start with a novel approach to test-to-code traceability link establishment, called TCTracer, which utilises multilevel information and an ensemble of static and dynamic techniques to achieve state-of-the-art accuracy when establishing links between tests and tested functions and test classes and tested classes. This approach is utilised to provide test-to-code traceability links which facilitate multiple other parts of the work. We then move on to test reuse where we first define an abstract framework, called Rashid, for using connections between artefacts to identify new artefacts for reuse and utilise this framework in Relatest, an approach for producing test recommendations for new functions. Relatest instantiates Rashid by using TCTracer to establish connections between tests and functions and code similarity measures to establish connections between similar functions. This information is used to create lists of recommendations for new functions. We then present an investigation into the automated transplantation of tests which attempts to remove the manual effort required to transform Relatest recommendations and insert them into another project. Finally, we move on to test generation where we utilise neural networks to generate unit test code by learning from existing function-to-test pairs. The first approach, TestNMT, investigates using recurrent neural networks to generate whole JUnit tests and the second approach, ReAssert, utilises a transformer-based architecture to generate JUnit asserts. In total, this thesis addresses the problem by developing approaches for the discovery, reuse, and utilisation of existing functions and tests, including the establishment of relationships between these artefacts, developing mechanisms to aid automated test reuse and learning from existing tests to generate new tests
    corecore