1,370 research outputs found
Ten Simple Rules for Taking Advantage of Git and GitHub.
Bioinformatics is a broad discipline in which one common denominator is the need to produce and/or use software that can be applied to biological data in different contexts. To enable and ensure the replicability and traceability of scientific claims, it is essential that the scientific publication, the corresponding datasets, and the data analysis are made publicly available [1,2]. All software used for the analysis should be either carefully documented (e.g., for commercial software) or, better yet, openly shared and directly accessible to others [3,4]. The rise of openly available software and source code alongside concomitant collaborative development is facilitated by the existence of several code repository services such as SourceForge, Bitbucket, GitLab, and GitHub, among others. These resources are also essential for collaborative software projects because they enable the organization and sharing of programming tasks between different remote contributors. Here, we introduce the main features of GitHub, a popular web-based platform that offers a free and integrated environment for hosting the source code, documentation, and project-related web content for open-source projects. GitHub also offers paid plans for private repositories (see Box 1) for individuals and businesses as well as free plans including private repositories for research and educational use.Biotechnology and Biological Sciences Research CouncilThis is the final version of the article. It first appeared from Public Library of Science via https://doi.org/10.1371/journal.pcbi.1004947
Ten Simple Rules for Reproducible Research in Jupyter Notebooks
Reproducibility of computational studies is a hallmark of scientific
methodology. It enables researchers to build with confidence on the methods and
findings of others, reuse and extend computational pipelines, and thereby drive
scientific progress. Since many experimental studies rely on computational
analyses, biologists need guidance on how to set up and document reproducible
data analyses or simulations.
In this paper, we address several questions about reproducibility. For
example, what are the technical and non-technical barriers to reproducible
computational studies? What opportunities and challenges do computational
notebooks offer to overcome some of these barriers? What tools are available
and how can they be used effectively?
We have developed a set of rules to serve as a guide to scientists with a
specific focus on computational notebook systems, such as Jupyter Notebooks,
which have become a tool of choice for many applications. Notebooks combine
detailed workflows with narrative text and visualization of results. Combined
with software repositories and open source licensing, notebooks are powerful
tools for transparent, collaborative, reproducible, and reusable data analyses
RefDiff: Detecting Refactorings in Version Histories
Refactoring is a well-known technique that is widely adopted by software
engineers to improve the design and enable the evolution of a system. Knowing
which refactoring operations were applied in a code change is a valuable
information to understand software evolution, adapt software components, merge
code changes, and other applications. In this paper, we present RefDiff, an
automated approach that identifies refactorings performed between two code
revisions in a git repository. RefDiff employs a combination of heuristics
based on static analysis and code similarity to detect 13 well-known
refactoring types. In an evaluation using an oracle of 448 known refactoring
operations, distributed across seven Java projects, our approach achieved
precision of 100% and recall of 88%. Moreover, our evaluation suggests that
RefDiff has superior precision and recall than existing state-of-the-art
approaches.Comment: Paper accepted at 14th International Conference on Mining Software
Repositories (MSR), pages 1-11, 201
GIT Profiling
This dissertation was written with the objective of creating and objectively defining software developers profiles. In order to support the proposed profiles, data was extracted and transformed from GIT repositories by an automated process. This automation was achieved by having an application that runs a combination of commands from GIT and Git Quick Stats and allows the consumption of the transformed output of these commands via an Application Programming Interface (API). A client application was also developed that would aid in the validation of the profiles and would provide a dashboard like User Interface (UI). The client application is able to query a Representational State Transfer (REST) service endpoint to get the available information from the serve and run a keyword match algorithm counts the hits on a certain profile. The keywords serve as a dictionary of terms that can be found in commit messages, file names or comments. The final dashboard is able to represent the repository and profiles information while also, providing a way to compare multiple repositories. The profiles are also presented as trends since the developer has more than just one type of contribution. In an effort to increase the confidence of the results that were the outcome of this automated process, manual checks were made to ensure that the right conclusions were reached regarding the profiles definitions. The design and architecture of the applications developed follows a traditional client and server approach which allowed for the separation of the responsibilities as was described above. In order to validate that the application was behaving correctly, metrics regarding the execution times and memory consumption were collected. Limitations with the developed work were also described since there are external systems that are usually used in conjunction with GIT repositories that contain information that could be used to increase the accuracy of the profiles. On a more technical level, some improvements to the overall architecture were also suggested that could enhance the final experience. Finally, some future work was also theorised that included the seniority or expansion of profiles by integrating external systems that contain more information.Esta dissertação foi escrita com o objectivo de criar e definir objectivamente perfis de desenvolvedores de software. Com a intenção de suportar as afirmações sobre os perfis propostos, foram recolhidos dados de repositórios GIT. Para o efeito, um algoritmo que permite extrair e categorizar informação de forma automatizada foi desenvolvido. Esta automatização foi conseguida com a criação de uma aplicação que consegue executar uma combinação de comandos GIT e uma biblioteca externa com o nome de Git Quick Stats. Esta biblioteca permite que seja obtida informação legível para leitura humana de forma eficiente e rápida. A aplicação corre num determinado número de repositórios de diferentes tamanhos e, trata de todas as operações pesadas que envolvem a extracção e transformação dos dados. Os dados transformados são disponibilizados para o cliente através de uma API Para o efeito de visualização e categorização dos dados extraídos de um determinado repositório, uma aplicação cliente foi desenvolvida com o intuito de ajudar na validação dos perfis e na visualização dos mesmos num dashboard simples. Esta aplicação cliente consegue fazer pedidos a um serviço do tipo REST com o objectivo de obter os dados transformados. Após a obtenção destes dados, a aplicação executa um algoritmo baseado em verificação de palavras-chave e guarda o número de vezes que uma palavra de um determinado perfil é encontrada. As palavras-chaves servem como um dicionário de termos que podem ser encontrados em mensagens de commit, nomes de ficheiros e comentários que, é expectável que, existam quando um perfil se encontra a ser avaliado. O número de perfis detectados e a sua precisão estão relacionados com o volume e qualidade deste dicionário de termos. Por consequência, este dicionário foi modificado ao longo do tempo de escrita desta dissertação. O dashboard final, que foi desenvolvido para esta dissertação, suporta a visualização dos dados de repositórios de forma legível, representação de perfis e comparação de vários repositórios. A informação presente nas métricas GIT como o número de commits, número de colaboradores e o número de ficheiros é toda passível de ser visualizada. Em conjunto com esta informação, a possibilidade de verificar a distribuição de todos os perfis no repositório é assegurada por um gráfico do estilo radar. É também possível verificar estes perfis, por desenvolvedor, que contém os perfis que lhe foram atribuídos. Um desenvolvedor pode ter mais que um perfil visto que as suas contribuições podem ser em vários tipos diferentes e, por consequência, foi necessário criar uma divisão de perfis principais e secundários. O perfil principal é identificado quando se e comparar quais dos perfis atribuídos tem a maior pontuação e os restantes são considerados secundários. Para aumentar a confiança dos resultados obtidos pelo processos automáticos, foram feitas verificações manuais que garantem que as conclusões obtidas sobre os perfis são validas. A proposta final destes perfis foi criada tendo em conta também os papéis mais comuns dentro de projectos e equipas de software. O desenho e arquitectura das aplicações segue um modelo tradicional de cliente e servidor que permite a separação clara das responsabilidades descritas acima. Com o propósito de validar o comportamento destas aplicações, foram recopiados dados e métricas sobre os tempos de execução e consumos de memória. Isto foi feito devido aos objectivos da dissertação serem a definição dos perfis e o código relacionado com a obtenção dos mesmos. As limitações do trabalho desenvolvido estão descritas dado que existem sistemas externos que, normalmente, são usados em conjunto com repositórios GIT. Estes sistemas contem mais informação que pode ser usada para aumentar o grau de precisão dos perfis encontrados ou até auxiliar na criação de novos perfis. Foram também sugeridas alterações e melhorias as aplicações desenvolvidas. Por fim, foi também abordado o trabalho futuro sobre este tema. A expansão dos conceitos presentes nesta dissertação pode evoluir no sentido de relacionar os graus de experiência de um desenvolvedor ou na integração dos sistemas externos para aumentar a quantidade de dados disponíveis
Ten simple rules for teaching sustainable software engineering
Computational methods and associated software implementations are central to
every field of scientific investigation. Modern biological research,
particularly within systems biology, has relied heavily on the development of
software tools to process and organize increasingly large datasets, simulate
complex mechanistic models, provide tools for the analysis and management of
data, and visualize and organize outputs. However, developing high-quality
research software requires scientists to develop a host of software development
skills, and teaching these skills to students is challenging. There has been a
growing importance placed on ensuring reproducibility and good development
practices in computational research. However, less attention has been devoted
to informing the specific teaching strategies which are effective at nurturing
in researchers the complex skillset required to produce high-quality software
that, increasingly, is required to underpin both academic and industrial
biomedical research. Recent articles in the Ten Simple Rules collection have
discussed the teaching of foundational computer science and coding techniques
to biology students. We advance this discussion by describing the specific
steps for effectively teaching the necessary skills scientists need to develop
sustainable software packages which are fit for (re-)use in academic research
or more widely. Although our advice is likely to be applicable to all students
and researchers hoping to improve their software development skills, our
guidelines are directed towards an audience of students that have some
programming literacy but little formal training in software development or
engineering, typical of early doctoral students. These practices are also
applicable outside of doctoral training environments, and we believe they
should form a key part of postgraduate training schemes more generally in the
life sciences.Comment: Prepared for submission to PLOS Computational Biology's 10 Simple
Rules collectio
ATLaS: Assistant Software for Life Scientists to Use in Calculations of Buffer Solutions
Many solutions such as percentage, molar and buffer solutions are used in all experiments conducted in life science laboratories. Although the preparation of the solutions is not difficult, miscalculations that can be made during intensive laboratory work negatively affect the experimental results. In order for the experiments to work correctly, the solutions must be prepared completely correctly. In this project, a software, ATLaS (Assistant Toolkit for Laboratory Solutions), has been developed to eliminate solution errors arising from calculations. Python programming language was used in the development of ATLaS. Tkinter and Pandas libraries were used in the program. ATLaS contains five main modules (1) Percent Solutions, (2) Molar Solutions, (3) Acid-Base Solutions, (4) Buffer Solutions and (5) Unit Converter. Main modules have sub-functions within themselves. With PyInstaller, the software was converted into a stand-alone executable file. The source code of ATLaS is available at https://github.com/cugur1978/ATLaS
The importance of good coding practices for data scientists
Many data science students and practitioners are reluctant to adopt good
coding practices as long as the code "works". However, code standards are an
important part of modern data science practice, and they play an essential role
in the development of "data acumen". Good coding practices lead to more
reliable code and often save more time than they cost, making them important
even for beginners. We believe that principled coding practices are vital for
statistics and data science. To install these practices within academic
programs, it is important for instructors and programs to begin establishing
these practices early, to reinforce them often, and to hold themselves to a
higher standard while guiding students. We describe key aspects of coding
practices (both good and bad), focusing primarily on the R language, though
similar standards are applicable to other software environments. The lessons
are organized into a top ten list
Report on the Second Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE2)
This technical report records and discusses the Second Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE2). The report includes a description of the alternative, experimental submission and review process, two workshop keynote presentations, a series of lightning talks, a discussion on sustainability, and five discussions from the topic areas of exploring sustainability; software development experiences; credit & incentives; reproducibility & reuse & sharing; and code testing & code review. For each topic, the report includes a list of tangible actions that were proposed and that would lead to potential change. The workshop recognized that reliance on scientific software is pervasive in all areas of world-leading research today. The workshop participants then proceeded to explore different perspectives on the concept of sustainability. Key enablers and barriers of sustainable scientific software were identified from their experiences. In addition, recommendations with new requirements such as software credit files and software prize frameworks were outlined for improving practices in sustainable software engineering. There was also broad consensus that formal training in software development or engineering was rare among the practitioners. Significant strides need to be made in building a sense of community via training in software and technical practices, on increasing their size and scope, and on better integrating them directly into graduate education programs. Finally, journals can define and publish policies to improve reproducibility, whereas reviewers can insist that authors provide sufficient information and access to data and software to allow them reproduce the results in the paper. Hence a list of criteria is compiled for journals to provide to reviewers so as to make it easier to review software submitted for publication as a “Software Paper.
Connecting the astronomical testbed community -- the CAOTIC project: Optimized teaching methods for software version control concepts
Laboratory testbeds are an integral part of conducting research and
developing technology for high-contrast imaging and extreme adaptive optics.
There are a number of laboratory groups around the world that use and develop
resources that are imminently required for their operations, such as software
and hardware controls. The CAOTIC (Community of Adaptive OpTics and hIgh
Contrast testbeds) project is aimed to be a platform for this community to
connect, share information, and exchange resources in order to conduct more
efficient research in astronomical instrumentation, while also encouraging best
practices and strengthening cross-team connections. In these proceedings, we
present the goals of the CAOTIC project, our new website, and we focus in
particular on a new approach to teaching version control to scientists, which
is a cornerstone of successful collaborations in astronomical instrumentation.Comment: 15 pages, 6 figures, 2 tables; SPIE proceedings Astronomical
Telescopes + Instrumentation 2022, 12185-11
- …