45 research outputs found

    Decibel: the relational dataset branching system

    Get PDF
    As scientific endeavors and data analysis become increasingly collaborative, there is a need for data management systems that natively support the versioning or branching of datasets to enable concurrent analysis, cleaning, integration, manipulation, or curation of data across teams of individuals. Common practice for sharing and collaborating on datasets involves creating or storing multiple copies of the dataset, one for each stage of analysis, with no provenance information tracking the relationships between these datasets. This results not only in wasted storage, but also makes it challenging to track and integrate modifications made by different users to the same dataset. In this paper, we introduce the Relational Dataset Branching System, Decibel, a new relational storage system with built-in version control designed to address these short-comings. We present our initial design for Decibel and provide a thorough evaluation of three versioned storage engine designs that focus on efficient query processing with minimal storage overhead. We also develop an exhaustive benchmark to enable the rigorous testing of these and future versioned storage engine designs.National Science Foundation (U.S.) (1513972)National Science Foundation (U.S.) (1513407)National Science Foundation (U.S.) (1513443)Intel Science and Technology Center for Big Dat

    Extending Versioning in Collaborative Research

    Get PDF
    In the digital world, software development is the domain in which the management of versions, also called versioning, was first introduced and became common praxis. Versioning mechanisms usually run in the backend of web environments for data security reasons. However, what is equally important is the use of versioning as a mechanism for evidence and reference for the users. When users collaboratively create research data on web platforms, a version history enables others to verify and reanalyse the data. Moreover, versioning mechanisms could support scholars to make research more transparent and discursive. Following this reasoning, this paper discusses versioning as performed by version control systems and investigates how this kind of versioning can be extended to serve scholarly practice

    The Sea of Stuff: a model to manage shared mutable data in a distributed environment

    Get PDF
    Managing data is one of the main challenges in distributed systems and computer science in general. Data is created, shared, and managed across heterogeneous distributed systems of users, services, applications, and devices without a clear and comprehensive data model. This technological fragmentation and lack of a common data model result in a poor understanding of what data is, how it evolves over time, how it should be managed in a distributed system, and how it should be protected and shared. From a user perspective, for example, backing up data over multiple devices is a hard and error-prone process, or synchronising data with a cloud storage service can result in conflicts and unpredictable behaviours. This thesis identifies three challenges in data management: (1) how to extend the current data abstractions so that content, for example, is accessible irrespective of its location, versionable, and easy to distribute; (2) how to enable transparent data storage relative to locations, users, applications, and services; and (3) how to allow data owners to protect data against malicious users and automatically control content over a distributed system. These challenges are studied in detail in relation to the current state of the art and addressed throughout the rest of the thesis. The artefact of this work is the Sea of Stuff (SOS), a generic data model of immutable self-describing location-independent entities that allow the construction of a distributed system where data is accessible and organised irrespective of its location, easy to protect, and can be automatically managed according to a set of user-defined rules. The evaluation of this thesis demonstrates the viability of the SOS model for managing data in a distributed system and using user-defined rules to automatically manage data across multiple nodes."This work was supported by Adobe Systems, Inc. and EPSRC [grant number EP/M506631/1]" - from the Acknowledgements pag

    Linked Research on the Decentralised Web

    Get PDF
    This thesis is about research communication in the context of the Web. I analyse literature which reveals how researchers are making use of Web technologies for knowledge dissemination, as well as how individuals are disempowered by the centralisation of certain systems, such as academic publishing platforms and social media. I share my findings on the feasibility of a decentralised and interoperable information space where researchers can control their identifiers whilst fulfilling the core functions of scientific communication: registration, awareness, certification, and archiving. The contemporary research communication paradigm operates under a diverse set of sociotechnical constraints, which influence how units of research information and personal data are created and exchanged. Economic forces and non-interoperable system designs mean that researcher identifiers and research contributions are largely shaped and controlled by third-party entities; participation requires the use of proprietary systems. From a technical standpoint, this thesis takes a deep look at semantic structure of research artifacts, and how they can be stored, linked and shared in a way that is controlled by individual researchers, or delegated to trusted parties. Further, I find that the ecosystem was lacking a technical Web standard able to fulfill the awareness function of research communication. Thus, I contribute a new communication protocol, Linked Data Notifications (published as a W3C Recommendation) which enables decentralised notifications on the Web, and provide implementations pertinent to the academic publishing use case. So far we have seen decentralised notifications applied in research dissemination or collaboration scenarios, as well as for archival activities and scientific experiments. Another core contribution of this work is a Web standards-based implementation of a clientside tool, dokieli, for decentralised article publishing, annotations and social interactions. dokieli can be used to fulfill the scholarly functions of registration, awareness, certification, and archiving, all in a decentralised manner, returning control of research contributions and discourse to individual researchers. The overarching conclusion of the thesis is that Web technologies can be used to create a fully functioning ecosystem for research communication. Using the framework of Web architecture, and loosely coupling the four functions, an accessible and inclusive ecosystem can be realised whereby users are able to use and switch between interoperable applications without interfering with existing data. Technical solutions alone do not suffice of course, so this thesis also takes into account the need for a change in the traditional mode of thinking amongst scholars, and presents the Linked Research initiative as an ongoing effort toward researcher autonomy in a social system, and universal access to human- and machine-readable information. Outcomes of this outreach work so far include an increase in the number of individuals self-hosting their research artifacts, workshops publishing accessible proceedings on the Web, in-the-wild experiments with open and public peer-review, and semantic graphs of contributions to conference proceedings and journals (the Linked Open Research Cloud). Some of the future challenges include: addressing the social implications of decentralised Web publishing, as well as the design of ethically grounded interoperable mechanisms; cultivating privacy aware information spaces; personal or community-controlled on-demand archiving services; and further design of decentralised applications that are aware of the core functions of scientific communication

    Versioning Cultural Objects : Digital Approaches

    Get PDF
    This volume approaches an understanding of the term versioning in the broadest sense, discussing ideas about how versions differ across forms of media, including text, image, and sound. Versions of cultural objects are identified, defined, articulated, and analysed through diverse mechanisms in different fields of research. The study of versions allows for the investigation of the creative processes behind the conception of works, a closer inspection of their socio-political contexts, and promotes investigation of their provenance and circulation. Chapters in this volume include discussion of what a “version” means in different fields, case studies implementing digital versioning techniques, conceptual models for representing versions digitally, and computational and management issues for digital projects

    Graph database management systems: storage, management and query processing

    Get PDF
    The proliferation of graph data, generated from diverse sources, have given rise to many research efforts concerning graph analysis. Interactions in social networks, publication networks, protein networks, software code dependencies and transportation systems are all examples of graph-structured data originating from a variety of application domains and demonstrating different characteristics. In recent years, graph database management systems (GDBMS) have been introduced for the management and analysis of graph data. Motivated by the growing number of real-life applications making use of graph database systems, this thesis focuses on the effectiveness and efficiency aspects of such systems. Specifically, we study the following topics relevant to graph database systems: (i) modeling large-scale applications in GDBMS; (ii) storage and indexing issues in GDBMS, and (iii) efficient query processing in GDBMS. In this thesis, we adopt two different application scenarios to examine how graph database systems can model complex features and perform relevant queries on each of them. Motivated by the popular application of social network analytics, we selected Twitter, a microblogging platform, to conduct our detailed analysis. Addressing limitations of existing models, we pro- pose a data model for the Twittersphere that proactively captures Twitter-specific interactions. We examine the feasibility of running analytical queries on GDBMS and offer empirical analysis of the performance of the proposed approach. Next, we consider a use case of modeling software code dependencies in a graph database system, and investigate how these systems can support capturing the evolution of a codebase overtime. We study a code comprehension tool that extracts software dependencies and stores them in a graph database. On a versioned graph built using a very large codebase, we demonstrate how existing code comprehension queries can be efficiently processed and also show the benefit of running queries across multiple versions. Another important aspect of this thesis is the study of storage aspects of graph systems. Throughput of many graph queries can be significantly affected by disk I/O performance; therefore graph database systems need to focus on effective graph storage for optimising disk operations. We observe that the locality of edges plays an important role and we address the edge-labeling problem which aims to label both incoming and outgoing edges of a graph maximizing the ‘edge-consecutiveness’ metric. By achieving a better layout and locality of edges on disk, we show that our proposed algorithms result in significantly improved disk I/O performance leading to faster execution of neighbourhood queries. Some applications require the integrated processing of queries from graph and the textual domains within a graph database system. Aggregation of these dimensions facilitates gaining key insights in several application scenarios. For example, in a social network setting, one may want to find the closest k users in the network (graph traversal) who talk about a particular topic A (textual search). Motivated by such practical use cases, in this thesis we study the top-k social-textual ranking query that essentially requires efficient combination of a keyword search query with a graph traversal. We propose algorithms that leverage graph partitioning techniques, based on the premise that socially close users will be placed within the same partition, allowing more localised computations. We show that our proposed approaches are able to achieve significantly better results compared to standard baselines and demonstrating robust behaviour under changing parameters

    Generation and Applications of Knowledge Graphs in Systems and Networks Biology

    Get PDF
    The acceleration in the generation of data in the biomedical domain has necessitated the use of computational approaches to assist in its interpretation. However, these approaches rely on the availability of high quality, structured, formalized biomedical knowledge. This thesis has the two goals to improve methods for curation and semantic data integration to generate high granularity biological knowledge graphs and to develop novel methods for using prior biological knowledge to propose new biological hypotheses. The first two publications describe an ecosystem for handling biological knowledge graphs encoded in the Biological Expression Language throughout the stages of curation, visualization, and analysis. Further, the second two publications describe the reproducible acquisition and integration of high-granularity knowledge with low contextual specificity from structured biological data sources on a massive scale and support the semi-automated curation of new content at high speed and precision. After building the ecosystem and acquiring content, the last three publications in this thesis demonstrate three different applications of biological knowledge graphs in modeling and simulation. The first demonstrates the use of agent-based modeling for simulation of neurodegenerative disease biomarker trajectories using biological knowledge graphs as priors. The second applies network representation learning to prioritize nodes in biological knowledge graphs based on corresponding experimental measurements to identify novel targets. Finally, the third uses biological knowledge graphs and develops algorithmics to deconvolute the mechanism of action of drugs, that could also serve to identify drug repositioning candidates. Ultimately, the this thesis lays the groundwork for production-level applications of drug repositioning algorithms and other knowledge-driven approaches to analyzing biomedical experiments
    corecore