136 research outputs found

    Software Engineering as Instrumentation for the Long Tail of Scientific Software

    Full text link
    The vast majority of the long tail of scientific software, the myriads of tools that implement the many analysis and visualization methods for different scientific fields, is highly specialized, purpose-built for a research project, and has to rely on community uptake and reuse for its continued development and maintenance. Although uptake cannot be controlled over even guaranteed, some of the key factors that influence whether new users or developers decide to adopt an existing tool or start a new one are about how easy or difficult it is to use or enhance a tool for a purpose for which it was not originally designed. The science of software engineering has produced techniques and practices that would reduce or remove a variety of barriers to community uptake of software, but for a variety of reasons employing trained software engineers as part of the development of long tail scientific software has proven to be challenging. As a consequence, community uptake of long tail tools is often far more difficult than it would need to be, even though opportunities for reuse abound. We discuss likely reasons why employing software engineering in the long tail is challenging, and propose that many of those obstacles could be addressed in the form of a cross-cutting non-profit center of excellence that makes software engineering broadly accessible as a shared service, conceptually and in its effect similar to shared instrumentation.Comment: 4 page

    Persistent BioPerl

    Get PDF
    I present BioSQL, a generic and highly extensible relational model for storing biological sequences, sequence clusters, genes, sequence features, sequence and feature annotation, and ontology terms. BioSQL also represents the interoperable persistence API among the Bio* life science programming toolkits (BioPerl, Biojava, Biopython, BioRuby), each of which has a language-binding to the BioSQL schema. I specifically present the Bioperl-db software, which in a transparent manner makes BioPerl objects persistent using BioSQL

    RNeXML: a package for reading and writing richly annotated phylogenetic, character, and trait data in R

    Full text link
    NeXML is a powerful and extensible exchange standard recently proposed to better meet the expanding needs for phylogenetic data and metadata sharing. Here we present the RNeXML package, which provides users of the R programming language with easy-to-use tools for reading and writing NeXML documents, including rich metadata, in a way that interfaces seamlessly with the extensive library of phylogenetic tools already available in the R ecosystem

    EvoIO: Community-driven standards for sustainable interoperability

    Get PDF
    Interoperability is the property that allows systems to work together independent of who created them, or how or for what purpose they were implemented. It is crucial for aggregating data from different online resources and for integrating different kinds of data. Interoperability is based on effective standards that become and remain broadly adopted. We argue that to develop and apply such standards for evolutionary and biodiversity data sustainably, we need a community-driven, open, and participatory approach. With the goal to build such an approach, the EvoIO collaboration emerged in 2009 from several NESCent-sponsored activities. EvoIO aims to be a nucleating center for developing, applying and disseminating interoperability technology that connects and coordinates between stakeholders, developers, and standards bodies.

Members of the EvoIO group have harnessed a variety of collaborative events to successfully build an initial stack of interoperability technologies that is owned by the community and open to participation. The stack addresses syntax, semantics, and programmable services, and at present includes the following components: NeXML (http://nexml.org), a NEXUS-inspired XML format that is validatable yet extensible; CDAO (http://www.evolutionaryontology.org), an ontology of comparative data analysis formalizing the semantics of evolutionary data and metadata; and PhyloWS (http://evoinfo.nescent.org/PhyloWS), a web- services interface standard for querying, retrieving, and referencing phylogenetic data on the web. Beyond demonstration prototypes, reference implementations of EvoIO stack technologies are starting to appear in production use. 

Aside from producing such information artefacts, EvoIO devotes much of its energy to applying principles of communication and organization that result in open and inclusive processes of community science. One of the key tools employed by EvoIO is the hackathon event format. Hackathons are highly collaborative, hands-on working meetings that catalyze practical innovation, train researchers, and foster cohesion as well as a sense of shared ownership in the results. In summary, we find that broad community participation, buy-in, and ownership are critical for developing interoperability in a sustainable fashion, and there are approaches and tools that can foster these effectively

    Phenex: Ontological Annotation of Phenotypic Diversity

    Get PDF
    Phenex is a platform-independent desktop application designed to facilitate efficient and consistent annotation of phenotypic variation using Entity-Quality syntax, drawing on terms from community ontologies for anatomical entities, phenotypic qualities, and taxonomic names. Despite the centrality of the phenotype to so much of biology, traditions for communicating information about phenotypes are idiosyncratic to different disciplines. Phenotypes seem to elude standardized descriptions due to the variety of traits that compose them and the difficulty of capturing the complex forms and subtle differences among organisms that we can readily observe. Consequently, phenotypes are refractory to attempts at data integration that would allow computational analyses across studies and study systems. Phenex addresses this problem by allowing scientists to employ standard ontologies and syntax to link computable phenotype annotations to evolutionary character matrices, as well as to link taxa and specimens to ontological identifiers. Ontologies have become a foundational technology for establishing shared semantics, and, more generally, for capturing and computing with biological knowledge

    Toward Synthesizing Our Knowledge of Morphology: Using Ontologies and Machine Reasoning to Extract Presence/Absence Evolutionary Phenotypes across Studies

    Get PDF
    The reality of larger and larger molecular databases and the need to integrate data scalably have presented a major challenge for the use of phenotypic data. Morphology is currently primarily described in discrete publications, entrenched in noncomputer readable text, and requires enormous investments of time and resources to integrate across large numbers of taxa and studies. Here we present a new methodology, using ontology-based reasoning systems working with the Phenoscape Knowledgebase (KB; kb.phenoscape.org), to automatically integrate large amounts of evolutionary character state descriptions into a synthetic character matrix of neomorphic (presence/absence) data. Using the KB, which includes more than 55 studies of sarcopterygian taxa, we generated a synthetic supermatrix of 639 variable characters scored for 1051 taxa, resulting in over 145,000 populated cells. Of these characters, over 76% were made variable through the addition of inferred presence/absence states derived by machine reasoning over the formal semantics of the source ontologies. Inferred data reduced the missing data in the variable character-subset from 98.5% to 78.2%. Machine reasoning also enables the isolation of conflicts in the data, that is, cells where both presence and absence are indicated; reports regarding conflicting data provenance can be generated automatically. Further, reasoning enables quantification and new visualizations of the data, here for example, allowing identification of character space that has been undersampled across the fin-to-limb transition. The approach and methods demonstrated here to compute synthetic presence/absence supermatrices are applicable to any taxonomic and phenotypic slice across the tree of life, providing the data are semantically annotated. Because such data can also be linked to model organism genetics through computational scoring of phenotypic similarity, they open a rich set of future research questions into phenotype-to-genome relationships

    Toward Synthesizing Our Knowledge of Morphology: Using Ontologies and Machine Reasoning to Extract Presence/Absence Evolutionary Phenotypes across Studies

    Get PDF
    The reality of larger and larger molecular databases and the need to integrate data scalably have presented a major challenge for the use of phenotypic data. Morphology is currently primarily described in discrete publications, entrenched in noncomputer readable text, and requires enormous investments of time and resources to integrate across large numbers of taxa and studies. Here we present a new methodology, using ontology-based reasoning systems working with the Phenoscape Knowledgebase (KB; kb.phenoscape.org), to automatically integrate large amounts of evolutionary character state descriptions into a synthetic character matrix of neomorphic (presence/absence) data. Using the KB, which includes more than 55 studies of sarcopterygian taxa, we generated a synthetic supermatrix of 639 variable characters scored for 1051 taxa, resulting in over 145,000 populated cells. Of these characters, over 76% were made variable through the addition of inferred presence/absence states derived by machine reasoning over the formal semantics of the source ontologies. Inferred data reduced the missing data in the variable character-subset from 98.5% to 78.2%. Machine reasoning also enables the isolation of conflicts in the data, that is, cells where both presence and absence are indicated; reports regarding conflicting data provenance can be generated automatically. Further, reasoning enables quantification and new visualizations of the data, here for example, allowing identification of character space that has been undersampled across the fin-to-limb transition. The approach and methods demonstrated here to compute synthetic presence/absence supermatrices are applicable to any taxonomic and phenotypic slice across the tree of life, providing the data are semantically annotated. Because such data can also be linked to model organism genetics through computational scoring of phenotypic similarity, they open a rich set of future research questions into phenotype-to-genome relationships
    • …
    corecore