2,523 research outputs found

    Biological Sequence Indexing Using Persistent Java

    Get PDF
    This thesis makes three contributions in the area of computing science. Our first contribution is the recognition that new data types produced by large-scale biological research techiques lead to a flood of data which creates new challenges in the areas of data indexing, integration, manipulation and visualisation. The second contribution is a new research methodology which combines orthogonal persistence with an empirical evaluation of disk-resident suffix indexes. This methodology allowed us to develop a practical algorithm for the construction of suffix trees on disk up to any size supported by the available file and addressing space, which has hitherto not been possible. The third contribution is a new experimental methodology for examining the usefulness of suffix indexes, and the use of this methodology in an empirical investigation of the indexing gain achieved by combining an approximate matching algorithm with a large suffix index. Those results are presented against the background of the changing technological landscape affecting life sciences and bioinformatics research and the resulting need for new computing solutions

    Building an Archive with Saada

    Full text link
    Saada transforms a set of heterogeneous FITS files or VOTables of various categories (images, tables, spectra ...) in a database without writing code. Databases created with Saada come with a rich Web interface and an Application Programming Interface (API). They support the four most common VO services. Such databases can mix various categories of data in multiple collections. They allow a direct access to the original data while providing a homogenous view thanks to an internal data model compatible with the characterization axis defined by the VO. The data collections can be bound to each other with persistent links making relevant browsing paths and allowing data-mining oriented queries.Comment: 18 pages, 5 figures Special VO issu

    Flexible web-based integration of distributed large-scale human protein interaction maps

    Get PDF
    Protein-protein interactions constitute the backbone of many molecular processes. This has motivated the recent construction of several large-scale human protein-protein interaction maps [1-10]. Although these maps clearly offer a wealth of information, their use is challenging: complexity, rapid growth, and fragmentation of interaction data hamper their usability. To overcome these hurdles, we have developed a publicly accessible database termed UniHI (Unified Human Interactome) for integration of human protein-protein interaction data. This database is designed to provide biomedical researchers a common platform for exploring previously disconnected human interaction maps. UniHI offers researchers flexible integrated tools for accessing comprehensive information about the human interactome. Several features included in the UniHI allow users to perform various types of network-oriented and functional analysis. At present, UniHI contains over 160,000 distinct interactions between 17,000 unique proteins from ten major interaction maps derived by both computational and experimental approaches [1-10]. Here we describe the details of the implementation and maintenance of UniHI and discuss the challenges that have to be addressed for a successful integration of interaction data

    A disk-resident suffix tree index and generic framework for managing tunable indexes

    Get PDF
    This thesis introduces two related technologies. The first is a disk-resident index for biological sequence data, and the second is a framework and toolkit for the management of operational parameters for applications of which this index is typical. The Top-Compressed Suffix Tree is a novel data structure that can be used to provide a scalable, disk-resident index for large sequences. This data structure is based on the suffix tree, but has been designed to overcome the problems associated with using such structures on secondary memory. Top-Compressed Suffix Trees can be constructed incrementally, allowing indexes to be created that are larger than the amount of available main memory. Correspondingly, querying such an index only requires part of the data structure to be resident in main memory, thus allowing support for on-demand faulting and eviction of index sections during search. Such an index may be of great benefit to scientists requiring efficient access to vast repositories of genomic data. The Generic Index Development and Operation Framework (GIDOF) is a framework and toolkit that supports various tasks relating to the management of operational parameters. The performance of an index's implementation is typically influenced by several operational parameters parameters that must be tuned carefully if optimum performance is to be obtained. Indexes implemented using GIDOF can be structured in such a way that values of selected operational parameters can be adjusted; resulting in an index implementation that can be tuned to suit a given workload or system environment. This thesis presents a detailed description of the design of both the Top-Compressed Suffix Tree and the algorithms that operate over it. Extensive performance measurements are then presented and discussed, covering such aspects of index performance as construction time, average query performance and the size of the completed index. An overview of the GIDOF parameter model and toolkit is then given together with examples of how this framework can be used to manage tunable indexes, such as the Top-Compressed Suffix Tree

    Critical evaluation of the JDO API for the persistence and portability requirements of complex biological databases

    Get PDF
    BACKGROUND: Complex biological database systems have become key computational tools used daily by scientists and researchers. Many of these systems must be capable of executing on multiple different hardware and software configurations and are also often made available to users via the Internet. We have used the Java Data Object (JDO) persistence technology to develop the database layer of such a system known as the SigPath information management system. SigPath is an example of a complex biological database that needs to store various types of information connected by many relationships. RESULTS: Using this system as an example, we perform a critical evaluation of current JDO technology; discuss the suitability of the JDO standard to achieve portability, scalability and performance. We show that JDO supports portability of the SigPath system from a relational database backend to an object database backend and achieves acceptable scalability. To answer the performance question, we have created the SigPath JDO application benchmark that we distribute under the Gnu General Public License. This benchmark can be used as an example of using JDO technology to create a complex biological database and makes it possible for vendors and users of the technology to evaluate the performance of other JDO implementations for similar applications. CONCLUSIONS: The SigPath JDO benchmark and our discussion of JDO technology in the context of biological databases will be useful to bioinformaticians who design new complex biological databases and aim to create systems that can be ported easily to a variety of database backends

    Extending the 5S Framework of Digital Libraries to support Complex Objects, Superimposed Information, and Content-Based Image Retrieval Services

    Get PDF
    Advanced services in digital libraries (DLs) have been developed and widely used to address the required capabilities of an assortment of systems as DLs expand into diverse application domains. These systems may require support for images (e.g., Content-Based Image Retrieval), Complex (information) Objects, and use of content at fine grain (e.g., Superimposed Information). Due to the lack of consensus on precise theoretical definitions for those services, implementation efforts often involve ad hoc development, leading to duplication and interoperability problems. This article presents a methodology to address those problems by extending a precisely specified minimal digital library (in the 5S framework) with formal definitions of aforementioned services. The theoretical extensions of digital library functionality presented here are reinforced with practical case studies as well as scenarios for the individual and integrative use of services to balance theory and practice. This methodology has implications that other advanced services can be continuously integrated into our current extended framework whenever they are identified. The theoretical definitions and case study we present may impact future development efforts and a wide range of digital library researchers, designers, and developers

    Sparq2l:towards support for subgraph extraction queries in rdf databases

    Get PDF
    Many applications in analytical domains often have the need to “connect the dots ” i.e., query about the structure of data. In bioinformatics for example, it is typical to want to query about interactions between proteins. The aim of such queries is to “extract ” relationships between entities i.e. paths from a data graph. Often, such queries will specify certain constraints that qualifying results must satisfy e.g. paths involving a set of mandatory nodes. Unfortunately, most present day Semantic Web query languages including the current draft of the anticipated recommendation SPARQL, lack the ability to express queries about arbitrary path structures in data. In addition, many systems that support some limited form of path queries rely on main memory graph algorithms limiting their applicability to very large scale graphs. In this paper, we present an approach for supporting Path Extraction queries. Our proposal comprises (i) a query language SPARQ2L which extends SPARQL with path variables and path variable constraint expressions, and (ii) a novel query evaluation framework based on efficient algebraic techniques for solving path problems which allows for path queries to be efficiently evaluated on disk resident RDF graphs. The effectiveness of our proposal is demonstrated by a performance evaluation of our approach on both real world and synthetic datasets
    • …
    corecore