9 research outputs found

    HydroShare – A Case Study of the Application of Modern Software Engineering to a Large Distributed Federally-Funded Scientific Software Development Project

    Get PDF
    HydroShare is an online collaborative system under development to support the open sharing of hydrologic data, analytical tools, and computer models. With HydroShare, scientists can easily discover, access, and analyze hydrologic data and thereby enhance the production and reproducibility of hydrologic scientific results. HydroShare also takes advantage of emerging social media functionality to enable users to enhance information about and collaboration around hydrologic data and models. HydroShare is being developed by an interdisciplinary collaborative team of domain scientists, university software developers, and professional software engineers from ten institutions located across the United States. While the combination of non–co-located, diverse stakeholders presents communication and management challenges, the interdisciplinary nature of the team is integral to the project’s goal of improving scientific software development and capabilities in academia. This chapter describes the challenges faced and lessons learned with the development of HydroShare, as well as the approach to software development that the HydroShare team adopted on the basis of the lessons learned. The chapter closes with recommendations for the application of modern software engineering techniques to large, collaborative, scientific software development projects, similar to the National Science Foundation (NSF)–funded HydroShare, in order to promote the successful application of the approach described herein by other teams for other projects

    Best Practices for Scientific Computing

    Get PDF
    Scientists spend an increasing amount of time building and using software. However, most scientists are never taught how to do this efficiently. As a result, many are unaware of tools and practices that would allow them to write more reliable and maintainable code with less effort. We describe a set of best practices for scientific software development that have solid foundations in research and experience, and that improve scientists' productivity and the reliability of their software.Comment: 18 page

    Efficient matrix-free implementation and automated verification of hybridizable discontinuous Galerkin finite element methods

    Get PDF
    This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Thesis: S.M., Massachusetts Institute of Technology, Department of Mechanical Engineering, 2019Cataloged from PDF version of thesis.Includes bibliographical references (pages 93-99).This work focuses on developing efficient and robust implementation methods for hybridizable discontinuous Galerkin (HDG) schemes for fluid and ocean dynamics. In the first part, we compare choices in weak formulations and their numerical consequences. We address details in making the leap from the mathematical formulation to the implementation, including the different spaces and mappings, discretization of the integral operators, boundary conditions, and assembly of the linear systems. We provide a flexible mapping procedure amenable to both quadrature-free and quadrature-based discretizations, and compare the accuracy of the two on different problem geometries. We verify the quadrature-free approach, demonstrating that optimal orders of convergence can be obtained, even on non-affine and curvilinear geometries. The second part of the work investigates the scalability of HDG schemes, identifying memory and time-to-solution bottlenecks. The form of the quadrature-free integral operators is exploited to develop a novel and efficient matrix-free approach to solving the global linear system that arises from HDG discretizations. Additional manipulations to improve numerical robustness are discussed. To mitigate the complexity of the implementation, we provide an automated and computationally efficient verification procedure for the HDG methodologies discussed, using a hierarchical approach to provide diagnostic information and isolate problems. Finally, challenges related to the effective visualization of high-order, discontinuous HDG-FEM data for fluid and ocean applications are illustrated and strategies are provided to address them.by Corbin Foucart.S.M.S.M. Massachusetts Institute of Technology, Department of Mechanical Engineerin

    The development of computational methods for large-scale comparisons and analyses of genome evolution

    Get PDF
    The last four decades have seen the development of a number of experimental methods for the deduction of the whole genome sequences of an ever-increasing number of organisms. These sequences have in the first instance, allowed their investigators the opportunity to examine the molecular primary structure of areas of scientific interest, but with the increased sampling of organisms across the phylogenetic tree and the improved quality and coverage of genome sequences and their associated annotations, the opportunity to undertake detailed comparisons both within and between taxonomic groups has presented itself. The work described in this thesis details the application of comparative bioinformatics analyses on inter- and intra-genomic datasets, to elucidate those genomic changes, which may underlie organismal adaptations and contribute to changes in the complexity of genome content and structure over time. The results contained herein demonstrate the power and flexibility of the comparative approach, utilising whole genome data, to elucidate the answers to some of the most pressing questions in the biological sciences today.As the volume of genomic data increases, both as a result of increased sampling of the tree of life and due to an increase in the quality and throughput of the sequencing methods, it has become clear that there is a necessity for computational analyses of these data. Manual analysis of this volume of data, which can extend beyond petabytes of storage space, is now impossible. Automated computational pipelines are therefore required to retrieve, categorise and analyse these data. Chapter two discusses the development of a computational pipeline named the Genome Comparison and Analysis Toolkit (GCAT). The pipeline was developed using the Perl programming language and is tightly integrated with the Ensembl Perl API allowing for the retrieval and analyses of their rich genomic resources. In the first instance the pipeline was tested for its robustness by retrieving and describing various components of genomic architecture across a number of taxonomic groups. Additionally, the need for programmatically independent means of accessing data and in particular the need for Semantic Web based protocols and tools for the sharing of genomics resources is highlighted. This is not just for the requirements of researchers, but for improved communication and sharing between computational infrastructure. A prototype Ensembl REST web service was developed in collaboration with the European Bioinformatics Institute (EBI) to provide a means of accessing Ensembl’s genomic data without having to rely on their Perl API. A comparison of the runtime and memory usage of the Ensembl Perl API and prototype REST API were made relative to baseline raw SQL queries, which highlights the overheads inherent in building wrappers around the SQL queries. Differences in the efficiency of the approaches were highlighted, and the importance of investing in the development of Semantic Web technologies as a tool to improve access to data for the wider scientific community are discussed.Data highlighted in chapter two led to the identification of relative differences in the intron structure of a number of organisms including teleost fish. Chapter three encompasses a published, peer-reviewed study. Inter-genomic comparisons were undertaken utilising the 5 available teleost genome sequences in order to examine and describe their intron content. The number and sizes of introns were compared across these fish and a frequency distribution of intron size was produced that identified a novel expansion in the Zebrafish lineage of introns in the size range of approximately 500-2,000 bp. Further hypothesis driven analyses of the introns across the whole distribution of intron sizes identified that the majority, but not all of the introns were largely comprised of repetitive elements. It was concluded that the introns in the Zebrafish peak were likely the result of an ancient expansion of repetitive elements that had since degraded beyond the ability of computational algorithms to identify them. Additional sampling throughout the teleost fish lineage will allow for more focused phylogenetically driven analyses to be undertaken in the future.In chapter four phylogenetic comparative analyses of gene duplications were undertaken across primate and rodent taxonomic groups with the intention of identifying significantly expanded or contracted gene families. Changes in the size of gene families may indicate adaptive evolution. A larger number of expansions, relative to time since common ancestor, were identified in the branch leading to modern humans than in any other primate species. Due to the unique nature of the human data in terms of quantity and quality of annotation, additional analyses were undertaken to determine whether the expansions were methodological artefacts or real biological changes. Novel approaches were developed to test the validity of the data including comparisons to other highly annotated genomes. No similar expansion was seen in mouse when comparing with rodent data, though, as assemblies and annotations were updated, there were differences in the number of significant changes, which brings into question the reliability of the underlying assembly and annotation data. This emphasises the importance of an understanding that computational predictions, in the absence of supporting evidence, may be unlikely to represent the actual genomic structure, and instead be more an artefact of the software parameter space. In particular, significant shortcomings are highlighted due to the assumptions and parameters of the models used by the CAFE gene family analysis software. We must bear in mind that genome assemblies and annotations are hypotheses that themselves need to be questioned and subjected to robust controls to increase the confidence in any conclusions that can be drawn from them.In addition functional genomics analyses were undertaken to identify the role of significantly changed genes and gene families in primates, testing against a hypothesis that would see the majority of changes involving immune, sensory or reproductive genes. Gene Ontology (GO) annotations were retrieved for these data, which enabled highlighting the broad GO groupings and more specific functional classifications of these data. The results showed that the majority of gene expansions were in families that may have arisen due to adaptation, or were maintained due to their necessary involvement in developmental and metabolic processes. Comparisons were made to previously published studies to determine whether the Ensembl functional annotations were supported by the de-novo analyses undertaken in those studies. The majority were not, with only a small number of previously identified functional annotations being present in the most recent Ensembl releases.The impact of gene family evolution on intron evolution was explored in chapter five, by analysing gene family data and intron characteristics across the genomes of 61 vertebrate species. General descriptive statistics and visualisations were produced, along with tests for correlation between change in gene family size and the number, size and density of their associated introns. There was shown to be very little impact of change in gene family size on the underlying intron evolution. Other, non-family effects were therefore considered. These analyses showed that introns were restricted to euchromatic regions, with heterochromatic regions such as the centromeres and telomeres being largely devoid of any such features. A greater involvement of spatial mechanisms such as recombination, GC-bias across GC-rich isochores and biased gene conversion was thus proposed to play more of a role, though depending largely on population genetic and life history traits of the organisms involved. Additional population level sequencing and comparative analyses across a divergent group of species with available recombination maps and life history data would be a useful future direction in understanding the processes involved

    Supporting the Quality Assurance of a Scientific Framework

    Get PDF
    The quality assurance of scientific software has to deal with special challenges of this type of software, including missing test oracles, the need for high performance computing, and the high priority of non-functional requirements. A scientific framework consists of common code, which provides solutions for several similar mathematical problems. The various possible uses of a scientific framework lead to a large variability in the framework. In addition to the challenges of scientific software, the quality assurance of a scientific framework needs to find a way of dealing with the large variability. In software product line engineering (SPLE), the idea is to develop a software platform and then use mass customization for the creation of a group of similar applications. In this thesis, we show how SPLE, in particular variability modeling, can be applied to support the quality assurance of scientific frameworks. One of the main contributions of this thesis is a process for the creation of reengineering variability models for a scientific framework based on its mathematical requirements. Reengineering means the adjustment of a software system to improve the software quality, mostly without changing the software’s functionality. In our research, the variability models are created for existing software and therefore we call them reengineering variability models. The created variability models are used for a systematic development of system test applications for the framework. Additionally, we developed a model-based method for test case derivation for the system test applications based on the variability models. Furthermore, we contribute a software product line test strategy for scientific frameworks. A test strategy strongly influences the test activities performed. Another main contribution of this thesis is the design of a quality assurance process for scientific frameworks, which combines the test activities of the test strategy with other quality assurance activities. We introduce a list of special characteristics for scientific software, which we use as rationale for the design of this process. We report on a case study, analyzing the feasibility and acceptance by developers for two parts of the design of the quality assurance process: variability model creation and desk-checking, a kind of lightweight review. Using FeatureIDE, an environment for feature-oriented software development as well as an automated test environment, we prototypically demonstrate the applicability of our approach
    corecore