87 research outputs found

    RegPrecise web services interface: programmatic access to the transcriptional regulatory interactions in bacteria reconstructed by comparative genomics.

    Get PDF
    Web services application programming interface (API) was developed to provide a programmatic access to the regulatory interactions accumulated in the RegPrecise database (http://regprecise.lbl.gov), a core resource on transcriptional regulation for the microbial domain of the Department of Energy (DOE) Systems Biology Knowledgebase. RegPrecise captures and visualize regulogs, sets of genes controlled by orthologous regulators in several closely related bacterial genomes, that were reconstructed by comparative genomics. The current release of RegPrecise 2.0 includes >1400 regulogs controlled either by protein transcription factors or by conserved ribonucleic acid regulatory motifs in >250 genomes from 24 taxonomic groups of bacteria. The reference regulons accumulated in RegPrecise can serve as a basis for automatic annotation of regulatory interactions in newly sequenced genomes. The developed API provides an efficient access to the RegPrecise data by a comprehensive set of 14 web service resources. The RegPrecise web services API is freely accessible at http://regprecise.lbl.gov/RegPrecise/services.jsp with no login requirements

    Statistical Models for Co-occurrence Data

    Get PDF
    Modeling and predicting co-occurrences of events is a fundamental problem of unsupervised learning. In this contribution we develop a statistical framework for analyzing co-occurrence data in a general setting where elementary observations are joint occurrences of pairs of abstract objects from two finite sets. The main challenge for statistical models in this context is to overcome the inherent data sparseness and to estimate the probabilities for pairs which were rarely observed or even unobserved in a given sample set. Moreover, it is often of considerable interest to extract grouping structure or to find a hierarchical data organization. A novel family of mixture models is proposed which explain the observed data by a finite number of shared aspects or clusters. This provides a common framework for statistical inference and structure discovery and also includes several recently proposed models as special cases. Adopting the maximum likelihood principle, EM algorithms are derived to fit the model parameters. We develop improved versions of EM which largely avoid overfitting problems and overcome the inherent locality of EM--based optimization. Among the broad variety of possible applications, e.g., in information retrieval, natural language processing, data mining, and computer vision, we have chosen document retrieval, the statistical analysis of noun/adjective co-occurrence and the unsupervised segmentation of textured images to test and evaluate the proposed algorithms

    Efficiently Supporting Hierarchy and Data Updates in DNA Storage

    Full text link
    We propose a novel and flexible DNA-storage architecture that provides the notion of hierarchy among the objects tagged with the same primer pair and enables efficient data updates. In contrast to prior work, in our architecture a pair of PCR primers of length 20 does not define a single object, but an independent storage partition, which is internally managed in an independent way with its own index structure. We make the observation that, while the number of mutually compatible primer pairs is limited, the internal address space available to any pair of primers (i.e., partition) is virtually unlimited. We expose and leverage the flexibility with which this address space can be managed to provide rich and functional storage semantics, such as hierarchical data organization and efficient and flexible implementations of data updates. Furthermore, to leverage the full power of the prefix-based nature of PCR addressing, we define a methodology for transforming an arbitrary indexing scheme into a PCR-compatible equivalent. This allows us to run PCR with primers that can be variably extended to include a desired part of the index, and thus narrow down the scope of the reaction to retrieve a specific object (e.g., file or directory) within the partition with high precision. Our wetlab evaluation demonstrates the practicality of the proposed ideas and shows 140x reduction in sequencing cost retrieval of smaller objects within the partition

    High Performance Computing Applications in Remote Sensing Studies for Land Cover Dynamics

    Get PDF
    Global and regional land cover studies require the ability to apply complex models on selected subsets of large amounts of multi-sensor and multi-temporal data sets that have been derived from raw instrument measurements using widely accepted pre-processing algorithms. The computational and storage requirements of most such studies far exceed what is possible on a single workstation environment. We have been pursuing a new approach that couples scalable and open distributed heterogeneous hardware with the development of high performance software for processing, indexing, and organizing remotely sensed data. Hierarchical data management tools are used to ingest raw data, create metadata, and organize the archived data so as to automatically achieve computational load balancing among the available nodes and minimize I/O overheads. We illustrate our approach with four specific examples. The first is the development of the first fast operational scheme for the atmospheric correction of Landsat TM scenes, while the second example focuses on image segmentation using a novel hierarchical connected components algorithm. Retrieval of global BRDF (Bidirectional Reflectance Distribution Function) in the red and near infrared wavelengths using four years (1983 to 1986) of Pathfinder AVHRR Land (PAL) data set is the focus of our third example. The fourth example is the development of a hierarchical data organization scheme that allows on-demand processing and retrieval of regional and global AVHRR data sets. Our results show that substantial improvements in computational times can be achieved by using the high performance computing technology

    ROOT - A C++ Framework for Petabyte Data Storage, Statistical Analysis and Visualization

    Full text link
    ROOT is an object-oriented C++ framework conceived in the high-energy physics (HEP) community, designed for storing and analyzing petabytes of data in an efficient way. Any instance of a C++ class can be stored into a ROOT file in a machine-independent compressed binary format. In ROOT the TTree object container is optimized for statistical data analysis over very large data sets by using vertical data storage techniques. These containers can span a large number of files on local disks, the web, or a number of different shared file systems. In order to analyze this data, the user can chose out of a wide set of mathematical and statistical functions, including linear algebra classes, numerical algorithms such as integration and minimization, and various methods for performing regression analysis (fitting). In particular, ROOT offers packages for complex data modeling and fitting, as well as multivariate classification based on machine learning techniques. A central piece in these analysis tools are the histogram classes which provide binning of one- and multi-dimensional data. Results can be saved in high-quality graphical formats like Postscript and PDF or in bitmap formats like JPG or GIF. The result can also be stored into ROOT macros that allow a full recreation and rework of the graphics. Users typically create their analysis macros step by step, making use of the interactive C++ interpreter CINT, while running over small data samples. Once the development is finished, they can run these macros at full compiled speed over large data sets, using on-the-fly compilation, or by creating a stand-alone batch program. Finally, if processing farms are available, the user can reduce the execution time of intrinsically parallel tasks - e.g. data mining in HEP - by using PROOF, which will take care of optimally distributing the work over the available resources in a transparent way

    Improving Real-Time Data Dissemination Performance by Multi Path Data Scheduling in Data Grids

    Get PDF
    The performance of data grids for data intensive, real-time applications is highly dependent on the data dissemination algorithm employed in the system. Motivated by this fact, this study first formally defines the real-time splittable data dissemination problem (RTS/DDP) where data transfer requests can be routed over multiple paths to maximize the number of data transfers to be completed before their deadlines. Since RTS/DDP is proved to be NP-hard, four different heuristic algorithms, namely kSP/ESMP, kSP/BSMP, kDP/ESMP, and kDP/BSMP are proposed. The performance of these heuristic algorithms is analyzed through an extensive set of data grid system simulation scenarios. The simulation results reveal that a performance increase up to 8 % as compared to a very competitive single path data dissemination algorithm is possible

    The CDF Data Handling System

    Full text link
    The Collider Detector at Fermilab (CDF) records proton-antiproton collisions at center of mass energy of 2.0 TeV at the Tevatron collider. A new collider run, Run II, of the Tevatron started in April 2001. Increased luminosity will result in about 1~PB of data recorded on tapes in the next two years. Currently the CDF experiment has about 260 TB of data stored on tapes. This amount includes raw and reconstructed data and their derivatives. The data storage and retrieval are managed by the CDF Data Handling (DH) system. This system has been designed to accommodate the increased demands of the Run II environment and has proven robust and reliable in providing reliable flow of data from the detector to the end user. This paper gives an overview of the CDF Run II Data Handling system which has evolved significantly over the course of this year. An outline of the future direction of the system is given.Comment: Talk from the 2003 Computing in High Energy and Nuclear Physics (CHEP03), La Jolla, Ca, USA, March 2003, 7 pages, LaTeX, 4 EPS figures, PSN THKT00
    • 

    corecore