8 research outputs found

    DataHub: Collaborative Data Science & Dataset Version Management at Scale

    Get PDF
    Relational databases have limited support for data collaboration, where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a) a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and (b) a platform, DataHub, that gives users the ability to perform collaborative data analysis building on this version control system. We outline the challenges in providing dataset version control at scale.Comment: 7 page

    Predominant intragenic methylation is associated with gene expression characteristics in a bivalve mollusc

    Get PDF
    Characterization of DNA methylation patterns in the Pacific oyster, Crassostrea gigas, indicates that this epigenetic mechanism plays an important functional role in gene regulation and may be involved in the regulation of developmental processes and environmental responses. However, previous studies have been limited to in silico analyses or characterization of DNA methylation at the single gene level. Here, we have employed a genome-wide approach to gain insight into how DNA methylation supports the regulation of the genome in C. gigas. Using a combination of methylation enrichment and high-throughput bisulfite sequencing, we have been able to map methylation at over 2.5 million individual CpG loci. This is the first high-resolution methylome generated for a molluscan species. Results indicate that methylation varies spatially across the genome with a majority of the methylated sites mapping to intra genic regions. The bisulfite sequencing data was combined with RNA-seq data to examine genome-wide relationships between gene body methylation and gene expression, where it was shown that methylated genes are associated with high transcript abundance and low variation in expression between tissue types. The combined data suggest DNA methylation plays a complex role in regulating genome activity in bivalves

    Data Management in the Long Tail: Science, Software, and Service

    Get PDF
    Scientists in all fields face challenges in managing and sustaining access to their research data. The larger and longer term the research project, the more likely that scientists are to have resources and dedicated staff to manage their technology and data, leaving those scientists whose work is based on smaller and shorter term projects at a disadvantage. The volume and variety of data to be managed varies by many factors, only two of which are the number of collaborators and length of the project. As part of an NSF project to conceptualize the Institute for Empowering Long Tail Research, we explored opportunities offered by Software as a Service (SaaS). These cloud-based services are popular in business because they reduce costs and labor for technology management, and are gaining ground in scientific environments for similar reasons. We studied three settings where scientists conduct research in small and medium-sized laboratories. Two were NSF Science and Technology Centers (CENS and C-DEBI) and the third was a workshop of natural reserve scientists and managers. These laboratories have highly diverse data and practices, make minimal use of standards for data or metadata, and lack resources for data management or sustaining access to their data, despite recognizing the need. We found that SaaS could address technical needs for basic document creation, analysis, and storage, but did not support the diverse and rapidly changing needs for sophisticated domain-specific tools and services. These are much more challenging knowledge infrastructure requirements that require long-term investments by multiple stakeholders.

    QUERY FROM EXAMPLES

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Scalable diversification for data exploration platforms

    Get PDF

    Querying heterogeneous data in an in-situ unified agile system

    Get PDF
    Data integration provides a unified view of data by combining different data sources. In today’s multi-disciplinary and collaborative research environments, data is often produced and consumed by various means, multiple researchers operate on the data in different divisions to satisfy various research requirements, and using different query processors and analysis tools. This makes data integration a crucial component of any successful data intensive research activity. The fundamental difficulty is that data is heterogeneous not only in syntax, structure, and semantics, but also in the way it is accessed and queried. We introduce QUIS (QUery In-Situ), an agile query system equipped with a unified query language and a federated execution engine. It is capable of running queries on heterogeneous data sources in an in-situ manner. Its language provides advanced features such as virtual schemas, heterogeneous joins, and polymorphic result set representation. QUIS utilizes a federation of agents to transform a given input query written in its language to a (set of) computation models that are executable on the designated data sources. Federative query virtualization has the disadvantage that some aspects of a query may not be supported by the designated data sources. QUIS ensures that input queries are always fully satisfied. Therefore, if the target data sources do not fulfill all of the query requirements, QUIS detects the features that are lacking and complements them in a transparent manner. QUIS provides union and join capabilities over an unbound list of heterogeneous data sources; in addition, it offers solutions for heterogeneous query planning and optimization. In brief, QUIS is intended to mitigate data access heterogeneity through query virtualization, on-the-fly transformation, and federated execution. It offers in-Situ querying, agile querying, heterogeneous data source querying, unifeied execution, late-bound virtual schemas, and Remote execution

    Database-as-a-Service for Long-Tail Science

    No full text
    corecore