Search CORE

8 research outputs found

DataHub: Collaborative Data Science & Dataset Version Management at Scale

Author: Bhardwaj Anant
Bhattacherjee Souvik
Chavan Amit
Deshpande Amol
Elmore Aaron J.
Madden Samuel
Parameswaran Aditya G.
Publication venue
Publication date: 02/09/2014
Field of study

Relational databases have limited support for data collaboration, where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a) a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and (b) a platform, DataHub, that gives users the ability to perform collaborative data analysis building on this version control system. We outline the challenges in providing dataset version control at scale.Comment: 7 page

arXiv.org e-Print Archive

CiteSeerX

DSpace@MIT

Predominant intragenic methylation is associated with gene expression characteristics in a bivalve mollusc

Author: Mackenzie R. Gavery
Steven B. Roberts
Publication venue: 'PeerJ'
Publication date: 01/11/2013
Field of study

Characterization of DNA methylation patterns in the Pacific oyster, Crassostrea gigas, indicates that this epigenetic mechanism plays an important functional role in gene regulation and may be involved in the regulation of developmental processes and environmental responses. However, previous studies have been limited to in silico analyses or characterization of DNA methylation at the single gene level. Here, we have employed a genome-wide approach to gain insight into how DNA methylation supports the regulation of the genome in C. gigas. Using a combination of methylation enrichment and high-throughput bisulfite sequencing, we have been able to map methylation at over 2.5 million individual CpG loci. This is the first high-resolution methylome generated for a molluscan species. Results indicate that methylation varies spatially across the genome with a majority of the methylated sites mapping to intra genic regions. The bisulfite sequencing data was combined with RNA-seq data to examine genome-wide relationships between gene body methylation and gene expression, where it was shown that methylated genes are associated with high transcript abundance and low variation in expression between tissue types. The combined data suggest DNA methylation plays a complex role in regulating genome activity in bivalves

Directory of Open Access Journals

PubMed Central

Data Management in the Long Tail: Science, Software, and Service

Author: Borgman Christine L.
Cummings Rebekah L.
Darch Peter T.
Golshan Milena S.
Randles Bernadette M.
Sands Ashley E.
Wallis Jillian C.
Publication venue: 'Edinburgh University Library'
Publication date: 26/05/2016
Field of study

Scientists in all fields face challenges in managing and sustaining access to their research data. The larger and longer term the research project, the more likely that scientists are to have resources and dedicated staff to manage their technology and data, leaving those scientists whose work is based on smaller and shorter term projects at a disadvantage. The volume and variety of data to be managed varies by many factors, only two of which are the number of collaborators and length of the project. As part of an NSF project to conceptualize the Institute for Empowering Long Tail Research, we explored opportunities offered by Software as a Service (SaaS). These cloud-based services are popular in business because they reduce costs and labor for technology management, and are gaining ground in scientific environments for similar reasons. We studied three settings where scientists conduct research in small and medium-sized laboratories. Two were NSF Science and Technology Centers (CENS and C-DEBI) and the third was a workshop of natural reserve scientists and managers. These laboratories have highly diverse data and practices, make minimal use of standards for data or metadata, and lack resources for data management or sustaining access to their data, despite recognizing the need. We found that SaaS could address technical needs for basic document creation, analysis, and storage, but did not support the diverse and rapidly changing needs for sophisticated domain-specific tools and services. These are much more challenging knowledge infrastructure requirements that require long-term investments by multiple stakeholders.

Crossref

eScholarship - University of California

International Journal of Digital Curation

QUERY FROM EXAMPLES

Author: LI HAO
Publication venue
Publication date: 30/06/2016
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS

Scalable diversification for data exploration platforms

Author: Khan Hina Anwar
Publication venue: 'University of Queensland Library'
Publication date: 18/11/2016
Field of study

University of Queensland eSpace

Querying heterogeneous data in an in-situ unified agile system

Author: Chamanara Javad
Publication venue
Publication date: 01/01/2018
Field of study

Data integration provides a unified view of data by combining different data sources. In today’s multi-disciplinary and collaborative research environments, data is often produced and consumed by various means, multiple researchers operate on the data in different divisions to satisfy various research requirements, and using different query processors and analysis tools. This makes data integration a crucial component of any successful data intensive research activity. The fundamental difficulty is that data is heterogeneous not only in syntax, structure, and semantics, but also in the way it is accessed and queried. We introduce QUIS (QUery In-Situ), an agile query system equipped with a unified query language and a federated execution engine. It is capable of running queries on heterogeneous data sources in an in-situ manner. Its language provides advanced features such as virtual schemas, heterogeneous joins, and polymorphic result set representation. QUIS utilizes a federation of agents to transform a given input query written in its language to a (set of) computation models that are executable on the designated data sources. Federative query virtualization has the disadvantage that some aspects of a query may not be supported by the designated data sources. QUIS ensures that input queries are always fully satisfied. Therefore, if the target data sources do not fulfill all of the query requirements, QUIS detects the features that are lacking and complements them in a transparent manner. QUIS provides union and join capabilities over an unbound list of heterogeneous data sources; in addition, it offers solutions for heterogeneous query planning and optimization. In brief, QUIS is intended to mitigate data access heterogeneity through query virtualization, on-the-fly transformation, and federated execution. It offers in-Situ querying, agile querying, heterogeneous data source querying, unifeied execution, late-bound virtual schemas, and Remote execution

Digitale Bibliothek Thüringen

Database-as-a-Service for Long-Tail Science

Author: A. Bouch
B. Boeckmann
D. Gotz
H. Elmeleegy
J. Lin
J. Mackinlay
M. Dörk
P.G. Brown
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Crossref