8,866 research outputs found

    Growth and Duplication of Public Source Code over Time: Provenance Tracking at Scale

    Get PDF
    We study the evolution of the largest known corpus of publicly available source code, i.e., the Software Heritage archive (4B unique source code files, 1B commits capturing their development histories across 50M software projects). On such corpus we quantify the growth rate of original, never-seen-before source code files and commits. We find the growth rates to be exponential over a period of more than 40 years.We then estimate the multiplication factor, i.e., how much the same artifacts (e.g., files or commits) appear in different contexts (e.g., commits or source code distribution places). We observe a combinatorial explosion in the multiplication of identical source code files across different commits.We discuss the implication of these findings for the problem of tracking the provenance of source code artifacts (e.g., where and when a given source code file or commit has been observed in the wild) for the entire body of publicly available source code. To that end we benchmark different data models for capturing software provenance information at this scale and growth rate. We identify a viable solution that is deployable on commodity hardware and appears to be maintainable for the foreseeable future

    Chemical information matters: an e-Research perspective on information and data sharing in the chemical sciences

    No full text
    Recently, a number of organisations have called for open access to scientific information and especially to the data obtained from publicly funded research, among which the Royal Society report and the European Commission press release are particularly notable. It has long been accepted that building research on the foundations laid by other scientists is both effective and efficient. Regrettably, some disciplines, chemistry being one, have been slow to recognise the value of sharing and have thus been reluctant to curate their data and information in preparation for exchanging it. The very significant increases in both the volume and the complexity of the datasets produced has encouraged the expansion of e-Research, and stimulated the development of methodologies for managing, organising, and analysing "big data". We review the evolution of cheminformatics, the amalgam of chemistry, computer science, and information technology, and assess the wider e-Science and e-Research perspective. Chemical information does matter, as do matters of communicating data and collaborating with data. For chemistry, unique identifiers, structure representations, and property descriptors are essential to the activities of sharing and exchange. Open science entails the sharing of more than mere facts: for example, the publication of negative outcomes can facilitate better understanding of which synthetic routes to choose, an aspiration of the Dial-a-Molecule Grand Challenge. The protagonists of open notebook science go even further and exchange their thoughts and plans. We consider the concepts of preservation, curation, provenance, discovery, and access in the context of the research lifecycle, and then focus on the role of metadata, particularly the ontologies on which the emerging chemical Semantic Web will depend. Among our conclusions, we present our choice of the "grand challenges" for the preservation and sharing of chemical information

    Towards Automatic Capturing of Manual Data Processing Provenance

    Get PDF
    Often data processing is not implemented by a work ow system or an integration application but is performed manually by humans along the lines of a more or less specified procedure. Collecting provenance information during manual data processing can not be automated. Further, manual collection of provenance information is error prone and time consuming. Therefore, we propose to infer provenance information based on the read and write access of users. The derived provenance information is complete, but has a low precision. Therefore, we propose further to introducing organizational guidelines in order to improve the precision of the inferred provenance information

    BRIL - Capturing Experiments in the Wild

    Get PDF
    This presentation describes a project to embed a repository system (based on Fedora) within the complex, experimental processes of a number of researchers in biophysics and structural biology. The project is capturing not just individual datasets but entire experimental workflows as complex objects, incorporating provenance information based on the Open Provenance Model, to support reproduction and validation of published results. The repository is integrated within these experimental processes, so that data capture is as far as possible automatic and invisible to the researcher. A particular challenge is that the researchers’ work takes place in local environments within the department, entirely decoupled from the repository. In meeting this challenge, the project is bridging the gap between the “wild”, ad hoc and independent environment of the researchers desktop, and the curated, sustainable, institutional environment of the repository, and in the process project crosses the boundary between several of the pairs of polar opposites identified in the call

    BRIL - Capturing Experiments in the Wild

    Get PDF
    This presentation describes a project to embed a repository system (based on Fedora) within the complex, experimental processes of a number of researchers in biophysics and structural biology. The project is capturing not just individual datasets but entire experimental workflows as complex objects, incorporating provenance information based on the Open Provenance Model, to support reproduction and validation of published results. The repository is integrated within these experimental processes, so that data capture is as far as possible automatic and invisible to the researcher. A particular challenge is that the researchers’ work takes place in local environments within the department, entirely decoupled from the repository. In meeting this challenge, the project is bridging the gap between the “wild”, ad hoc and independent environment of the researchers desktop, and the curated, sustainable, institutional environment of the repository, and in the process project crosses the boundary between several of the pairs of polar opposites identified in the call

    Are current ecological restoration practices capturing natural levels of genetic diversity? A New Zealand case study using AFLP and ISSR data from mahoe (Melicytus ramiflorus)

    Get PDF
    Sourcing plant species of local provenance (eco-sourcing) has become standard practice in plant community restoration projects. Along with established ecological restoration practices, knowledge of genetic variation in existing and restored forest fragments is important for ensuring the maintenance of natural levels of genetic variation and connectivity (gene flow) among populations. The application of restoration genetics often employs anonymous ‘fingerprinting’ markers in combination with limited sample sizes due to financial constraints. Here, we used two such marker systems, AFLPs and ISSRs, to estimate population-level genetic variation of a frequently used species in restoration projects in New Zealand, māhoe (Melicytus ramiflorus, Violaceae). We examined two rural and two urban forest fragments, as potential local source populations, to determine whether the māhoe population at the recently (re)constructed ecosystem at Waiwhakareke Natural Heritage Park (WNHP), Hamilton, New Zealand reflects the genetic variation observed in these four potential source populations. Both marker systems produced similar results and indicated, even with small population sizes, that levels of genetic variation at WNHP were comparable to in situ populations. However, the AFLPs did provide finer resolution of the population genetic structure than ISSRs. ISSRs, which are less expensive and technically less demanding to generate than AFLPs, may be sufficient for restoration projects where only a broad level of genotypic resolution is required. We recommend the use of AFLPs when species with a high conservation status are being used due to the greater resolution of this technique

    QUAL : A Provenance-Aware Quality Model

    Get PDF
    The research described here is supported by the award made by the RCUK Digital Economy program to the dot.rural Digital Economy Hub; award reference: EP/G066051/1.Peer reviewedPostprin

    The Research Object Suite of Ontologies: Sharing and Exchanging Research Data and Methods on the Open Web

    Get PDF
    Research in life sciences is increasingly being conducted in a digital and online environment. In particular, life scientists have been pioneers in embracing new computational tools to conduct their investigations. To support the sharing of digital objects produced during such research investigations, we have witnessed in the last few years the emergence of specialized repositories, e.g., DataVerse and FigShare. Such repositories provide users with the means to share and publish datasets that were used or generated in research investigations. While these repositories have proven their usefulness, interpreting and reusing evidence for most research results is a challenging task. Additional contextual descriptions are needed to understand how those results were generated and/or the circumstances under which they were concluded. Because of this, scientists are calling for models that go beyond the publication of datasets to systematically capture the life cycle of scientific investigations and provide a single entry point to access the information about the hypothesis investigated, the datasets used, the experiments carried out, the results of the experiments, the people involved in the research, etc. In this paper we present the Research Object (RO) suite of ontologies, which provide a structured container to encapsulate research data and methods along with essential metadata descriptions. Research Objects are portable units that enable the sharing, preservation, interpretation and reuse of research investigation results. The ontologies we present have been designed in the light of requirements that we gathered from life scientists. They have been built upon existing popular vocabularies to facilitate interoperability. Furthermore, we have developed tools to support the creation and sharing of Research Objects, thereby promoting and facilitating their adoption.Comment: 20 page

    Seeing the trees as well as the forest: the importance of managing forest genetic resources

    Get PDF
    Reliable data on the status and trends of forest genetic resources are essential for their sustainable management. The reviews presented in this special edition of Forest Ecology and Management on forest genetic resources complement the first ever synthesis of the State of the World’s Forest Genetic Resources (SOW-FGR) that has just been published by the Food and Agriculture Organization. In this editorial, we present some of the key findings of the SOW-FGR and introduce the seven reviews presented in this special edition on: (1) tree genetic resources and livelihoods; (2) the benefits and dangers of international germplasm transfers; (3) genetic indicators for monitoring threats to populations and the effectiveness of ameliorative actions; (4) the genetic impacts of timber management practices; (5) genetic considerations in forest ecosystem restoration projects using native trees; (6) genetic-level responses to climate change; and (7) ex situ conservation approaches and their integration with in situ methods. Recommendations for action arising from the SOW-FGR, which are captured in the first Global Plan of Action for the Conservation, Sustainable Use and Development of Forest Genetic Resources, and the above articles are discussed. These include: increasing the awareness of the importance of and threats to forest genetic resources and the mainstreaming of genetic considerations into forest management and restoration; establishing common garden provenance trials to support restoration and climate change initiatives that extend to currently little-researched tree species; streamlining processes for germplasm exchange internationally for research and development; and the intelligent use of modern molecular marker methods as genetic indicators in management and for improvement purposes

    Towards structured sharing of raw and derived neuroimaging data across existing resources

    Full text link
    Data sharing efforts increasingly contribute to the acceleration of scientific discovery. Neuroimaging data is accumulating in distributed domain-specific databases and there is currently no integrated access mechanism nor an accepted format for the critically important meta-data that is necessary for making use of the combined, available neuroimaging data. In this manuscript, we present work from the Derived Data Working Group, an open-access group sponsored by the Biomedical Informatics Research Network (BIRN) and the International Neuroimaging Coordinating Facility (INCF) focused on practical tools for distributed access to neuroimaging data. The working group develops models and tools facilitating the structured interchange of neuroimaging meta-data and is making progress towards a unified set of tools for such data and meta-data exchange. We report on the key components required for integrated access to raw and derived neuroimaging data as well as associated meta-data and provenance across neuroimaging resources. The components include (1) a structured terminology that provides semantic context to data, (2) a formal data model for neuroimaging with robust tracking of data provenance, (3) a web service-based application programming interface (API) that provides a consistent mechanism to access and query the data model, and (4) a provenance library that can be used for the extraction of provenance data by image analysts and imaging software developers. We believe that the framework and set of tools outlined in this manuscript have great potential for solving many of the issues the neuroimaging community faces when sharing raw and derived neuroimaging data across the various existing database systems for the purpose of accelerating scientific discovery
    corecore