1,368 research outputs found

    Ten Simple Rules for Taking Advantage of Git and GitHub.

    Get PDF
    Bioinformatics is a broad discipline in which one common denominator is the need to produce and/or use software that can be applied to biological data in different contexts. To enable and ensure the replicability and traceability of scientific claims, it is essential that the scientific publication, the corresponding datasets, and the data analysis are made publicly available [1,2]. All software used for the analysis should be either carefully documented (e.g., for commercial software) or, better yet, openly shared and directly accessible to others [3,4]. The rise of openly available software and source code alongside concomitant collaborative development is facilitated by the existence of several code repository services such as SourceForge, Bitbucket, GitLab, and GitHub, among others. These resources are also essential for collaborative software projects because they enable the organization and sharing of programming tasks between different remote contributors. Here, we introduce the main features of GitHub, a popular web-based platform that offers a free and integrated environment for hosting the source code, documentation, and project-related web content for open-source projects. GitHub also offers paid plans for private repositories (see Box 1) for individuals and businesses as well as free plans including private repositories for research and educational use.Biotechnology and Biological Sciences Research CouncilThis is the final version of the article. It first appeared from Public Library of Science via https://doi.org/10.1371/journal.pcbi.1004947

    Ten Simple Rules for Reproducible Research in Jupyter Notebooks

    Full text link
    Reproducibility of computational studies is a hallmark of scientific methodology. It enables researchers to build with confidence on the methods and findings of others, reuse and extend computational pipelines, and thereby drive scientific progress. Since many experimental studies rely on computational analyses, biologists need guidance on how to set up and document reproducible data analyses or simulations. In this paper, we address several questions about reproducibility. For example, what are the technical and non-technical barriers to reproducible computational studies? What opportunities and challenges do computational notebooks offer to overcome some of these barriers? What tools are available and how can they be used effectively? We have developed a set of rules to serve as a guide to scientists with a specific focus on computational notebook systems, such as Jupyter Notebooks, which have become a tool of choice for many applications. Notebooks combine detailed workflows with narrative text and visualization of results. Combined with software repositories and open source licensing, notebooks are powerful tools for transparent, collaborative, reproducible, and reusable data analyses

    RefDiff: Detecting Refactorings in Version Histories

    Full text link
    Refactoring is a well-known technique that is widely adopted by software engineers to improve the design and enable the evolution of a system. Knowing which refactoring operations were applied in a code change is a valuable information to understand software evolution, adapt software components, merge code changes, and other applications. In this paper, we present RefDiff, an automated approach that identifies refactorings performed between two code revisions in a git repository. RefDiff employs a combination of heuristics based on static analysis and code similarity to detect 13 well-known refactoring types. In an evaluation using an oracle of 448 known refactoring operations, distributed across seven Java projects, our approach achieved precision of 100% and recall of 88%. Moreover, our evaluation suggests that RefDiff has superior precision and recall than existing state-of-the-art approaches.Comment: Paper accepted at 14th International Conference on Mining Software Repositories (MSR), pages 1-11, 201

    GIT Profiling

    Get PDF
    This dissertation was written with the objective of creating and objectively defining software developers profiles. In order to support the proposed profiles, data was extracted and transformed from GIT repositories by an automated process. This automation was achieved by having an application that runs a combination of commands from GIT and Git Quick Stats and allows the consumption of the transformed output of these commands via an Application Programming Interface (API). A client application was also developed that would aid in the validation of the profiles and would provide a dashboard like User Interface (UI). The client application is able to query a Representational State Transfer (REST) service endpoint to get the available information from the serve and run a keyword match algorithm counts the hits on a certain profile. The keywords serve as a dictionary of terms that can be found in commit messages, file names or comments. The final dashboard is able to represent the repository and profiles information while also, providing a way to compare multiple repositories. The profiles are also presented as trends since the developer has more than just one type of contribution. In an effort to increase the confidence of the results that were the outcome of this automated process, manual checks were made to ensure that the right conclusions were reached regarding the profiles definitions. The design and architecture of the applications developed follows a traditional client and server approach which allowed for the separation of the responsibilities as was described above. In order to validate that the application was behaving correctly, metrics regarding the execution times and memory consumption were collected. Limitations with the developed work were also described since there are external systems that are usually used in conjunction with GIT repositories that contain information that could be used to increase the accuracy of the profiles. On a more technical level, some improvements to the overall architecture were also suggested that could enhance the final experience. Finally, some future work was also theorised that included the seniority or expansion of profiles by integrating external systems that contain more information.Esta dissertação foi escrita com o objectivo de criar e definir objectivamente perfis de desenvolvedores de software. Com a intenção de suportar as afirmações sobre os perfis propostos, foram recolhidos dados de repositórios GIT. Para o efeito, um algoritmo que permite extrair e categorizar informação de forma automatizada foi desenvolvido. Esta automatização foi conseguida com a criação de uma aplicação que consegue executar uma combinação de comandos GIT e uma biblioteca externa com o nome de Git Quick Stats. Esta biblioteca permite que seja obtida informação legível para leitura humana de forma eficiente e rápida. A aplicação corre num determinado número de repositórios de diferentes tamanhos e, trata de todas as operações pesadas que envolvem a extracção e transformação dos dados. Os dados transformados são disponibilizados para o cliente através de uma API Para o efeito de visualização e categorização dos dados extraídos de um determinado repositório, uma aplicação cliente foi desenvolvida com o intuito de ajudar na validação dos perfis e na visualização dos mesmos num dashboard simples. Esta aplicação cliente consegue fazer pedidos a um serviço do tipo REST com o objectivo de obter os dados transformados. Após a obtenção destes dados, a aplicação executa um algoritmo baseado em verificação de palavras-chave e guarda o número de vezes que uma palavra de um determinado perfil é encontrada. As palavras-chaves servem como um dicionário de termos que podem ser encontrados em mensagens de commit, nomes de ficheiros e comentários que, é expectável que, existam quando um perfil se encontra a ser avaliado. O número de perfis detectados e a sua precisão estão relacionados com o volume e qualidade deste dicionário de termos. Por consequência, este dicionário foi modificado ao longo do tempo de escrita desta dissertação. O dashboard final, que foi desenvolvido para esta dissertação, suporta a visualização dos dados de repositórios de forma legível, representação de perfis e comparação de vários repositórios. A informação presente nas métricas GIT como o número de commits, número de colaboradores e o número de ficheiros é toda passível de ser visualizada. Em conjunto com esta informação, a possibilidade de verificar a distribuição de todos os perfis no repositório é assegurada por um gráfico do estilo radar. É também possível verificar estes perfis, por desenvolvedor, que contém os perfis que lhe foram atribuídos. Um desenvolvedor pode ter mais que um perfil visto que as suas contribuições podem ser em vários tipos diferentes e, por consequência, foi necessário criar uma divisão de perfis principais e secundários. O perfil principal é identificado quando se e comparar quais dos perfis atribuídos tem a maior pontuação e os restantes são considerados secundários. Para aumentar a confiança dos resultados obtidos pelo processos automáticos, foram feitas verificações manuais que garantem que as conclusões obtidas sobre os perfis são validas. A proposta final destes perfis foi criada tendo em conta também os papéis mais comuns dentro de projectos e equipas de software. O desenho e arquitectura das aplicações segue um modelo tradicional de cliente e servidor que permite a separação clara das responsabilidades descritas acima. Com o propósito de validar o comportamento destas aplicações, foram recopiados dados e métricas sobre os tempos de execução e consumos de memória. Isto foi feito devido aos objectivos da dissertação serem a definição dos perfis e o código relacionado com a obtenção dos mesmos. As limitações do trabalho desenvolvido estão descritas dado que existem sistemas externos que, normalmente, são usados em conjunto com repositórios GIT. Estes sistemas contem mais informação que pode ser usada para aumentar o grau de precisão dos perfis encontrados ou até auxiliar na criação de novos perfis. Foram também sugeridas alterações e melhorias as aplicações desenvolvidas. Por fim, foi também abordado o trabalho futuro sobre este tema. A expansão dos conceitos presentes nesta dissertação pode evoluir no sentido de relacionar os graus de experiência de um desenvolvedor ou na integração dos sistemas externos para aumentar a quantidade de dados disponíveis

    Ten simple rules for teaching sustainable software engineering

    Full text link
    Computational methods and associated software implementations are central to every field of scientific investigation. Modern biological research, particularly within systems biology, has relied heavily on the development of software tools to process and organize increasingly large datasets, simulate complex mechanistic models, provide tools for the analysis and management of data, and visualize and organize outputs. However, developing high-quality research software requires scientists to develop a host of software development skills, and teaching these skills to students is challenging. There has been a growing importance placed on ensuring reproducibility and good development practices in computational research. However, less attention has been devoted to informing the specific teaching strategies which are effective at nurturing in researchers the complex skillset required to produce high-quality software that, increasingly, is required to underpin both academic and industrial biomedical research. Recent articles in the Ten Simple Rules collection have discussed the teaching of foundational computer science and coding techniques to biology students. We advance this discussion by describing the specific steps for effectively teaching the necessary skills scientists need to develop sustainable software packages which are fit for (re-)use in academic research or more widely. Although our advice is likely to be applicable to all students and researchers hoping to improve their software development skills, our guidelines are directed towards an audience of students that have some programming literacy but little formal training in software development or engineering, typical of early doctoral students. These practices are also applicable outside of doctoral training environments, and we believe they should form a key part of postgraduate training schemes more generally in the life sciences.Comment: Prepared for submission to PLOS Computational Biology's 10 Simple Rules collectio

    ATLaS: Assistant Software for Life Scientists to Use in Calculations of Buffer Solutions

    Get PDF
    Many solutions such as percentage, molar and buffer solutions are used in all experiments conducted in life science laboratories. Although the preparation of the solutions is not difficult, miscalculations that can be made during intensive laboratory work negatively affect the experimental results. In order for the experiments to work correctly, the solutions must be prepared completely correctly. In this project, a software, ATLaS (Assistant Toolkit for Laboratory Solutions), has been developed to eliminate solution errors arising from calculations. Python programming language was used in the development of ATLaS. Tkinter and Pandas libraries were used in the program. ATLaS contains five main modules (1) Percent Solutions, (2) Molar Solutions, (3) Acid-Base Solutions, (4) Buffer Solutions and (5) Unit Converter. Main modules have sub-functions within themselves. With PyInstaller, the software was converted into a stand-alone executable file. The source code of ATLaS is available at https://github.com/cugur1978/ATLaS

    The importance of good coding practices for data scientists

    Full text link
    Many data science students and practitioners are reluctant to adopt good coding practices as long as the code "works". However, code standards are an important part of modern data science practice, and they play an essential role in the development of "data acumen". Good coding practices lead to more reliable code and often save more time than they cost, making them important even for beginners. We believe that principled coding practices are vital for statistics and data science. To install these practices within academic programs, it is important for instructors and programs to begin establishing these practices early, to reinforce them often, and to hold themselves to a higher standard while guiding students. We describe key aspects of coding practices (both good and bad), focusing primarily on the R language, though similar standards are applicable to other software environments. The lessons are organized into a top ten list

    Report on the Second Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE2)

    Get PDF
    This technical report records and discusses the Second Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE2). The report includes a description of the alternative, experimental submission and review process, two workshop keynote presentations, a series of lightning talks, a discussion on sustainability, and five discussions from the topic areas of exploring sustainability; software development experiences; credit & incentives; reproducibility & reuse & sharing; and code testing & code review. For each topic, the report includes a list of tangible actions that were proposed and that would lead to potential change. The workshop recognized that reliance on scientific software is pervasive in all areas of world-leading research today. The workshop participants then proceeded to explore different perspectives on the concept of sustainability. Key enablers and barriers of sustainable scientific software were identified from their experiences. In addition, recommendations with new requirements such as software credit files and software prize frameworks were outlined for improving practices in sustainable software engineering. There was also broad consensus that formal training in software development or engineering was rare among the practitioners. Significant strides need to be made in building a sense of community via training in software and technical practices, on increasing their size and scope, and on better integrating them directly into graduate education programs. Finally, journals can define and publish policies to improve reproducibility, whereas reviewers can insist that authors provide sufficient information and access to data and software to allow them reproduce the results in the paper. Hence a list of criteria is compiled for journals to provide to reviewers so as to make it easier to review software submitted for publication as a “Software Paper.

    Connecting the astronomical testbed community -- the CAOTIC project: Optimized teaching methods for software version control concepts

    Full text link
    Laboratory testbeds are an integral part of conducting research and developing technology for high-contrast imaging and extreme adaptive optics. There are a number of laboratory groups around the world that use and develop resources that are imminently required for their operations, such as software and hardware controls. The CAOTIC (Community of Adaptive OpTics and hIgh Contrast testbeds) project is aimed to be a platform for this community to connect, share information, and exchange resources in order to conduct more efficient research in astronomical instrumentation, while also encouraging best practices and strengthening cross-team connections. In these proceedings, we present the goals of the CAOTIC project, our new website, and we focus in particular on a new approach to teaching version control to scientists, which is a cornerstone of successful collaborations in astronomical instrumentation.Comment: 15 pages, 6 figures, 2 tables; SPIE proceedings Astronomical Telescopes + Instrumentation 2022, 12185-11
    corecore