811 research outputs found

    Computational methods for integrated analysis of omics and pathway data

    Get PDF
    One of the key tenets of bioinformatics is to find ways to enable the interoperability of heterogeneous data sources and improve the integration of various biological data. High-throughput experimental methods continue to improve and become more easily accessible. This allows researchers to measure not just their specific gene or protein of interest, but the entirety of the biological machinery inside the cell. These measurements are referred to as omics , such as genomics, transcriptomics, proteomics, metabolomics, and fluxomics. Omics data is highly interrelated at the systems-level, as each type of molecule (DNA, RNA, protein, etc.) can interact with and have an impact on the other types. These interactions may be direct, such as the central dogma of biology that information flows from DNA to RNA to protein. They may also be indirect, such as the regulation of gene expression or metabolic feedback loops. Regardless, it is becoming apparent that multiple levels of omics data must be analyzed and understood simultaneously if we are to advance our understanding of systems-level biology. Much of our current biological knowledge is stored in public databases, most of which specialize in a particular type of omics or a specific organism. Despite efforts to improve consistency between databases, there are many challenges which can impede efforts to meaningfully compare or combine these resources. At a basic level, differences in naming and internal database ID assignments prevent simple mapping between objects in these databases. More fundamentally, though, is the lack of a standardized way to define equivalency between two functionally identical biological entities. One benefit of improving database interoperability is that targeted high quality data from one database can be used to improve another database. Comparison between MaizeCyc and CornCyc identified many manually curated GO annotations present in MaizeCyc but not in CornCyc. CycTools facilitates the transfer of high-quality annotation data from one database to another by automatically mapping equivalent objects in both databases. This java-based tool has a graphical user interface which guides users through the transfer process. A case study which uses two independent Zea Mays pathway databases, CornCyc and MaizeCyc, illustrates the challenges of comparing the content of even closely related resources. This example highlights the downstream implications that the choice of initial computational enzymatic function assignment pipelines and subsequent manual curation had on the overall scope and quality of the content of each database. We compare the prediction accuracy of the protein EC assignments for 177 maize enzymes between these resources and find that while MaizeCyc covers a broader scope of enzyme predictions, CornCyc predictions are more accurate. The advantage of high quality, integrated data resources must be realized through analysis methods which can account for multiple data types simultaneously. Due to the difficulty in obtaining systems-wide metabolic flux measurements, researchers have made several efforts to integrate transcriptional regulatory data with metabolic models in order to improve the accuracy of metabolic flux predictions. Transcriptional regulation involves the binding of transcription factors (i.e. proteins) to binding sites on the DNA in order to positively or negatively influence expression of the targeted gene. This has an indirect, downstream impact on the organism\u27s metabolism, as metabolic reactions depend on gene-derived enzymes in order to catalyze the reaction. A novel method is proposed which seeks to integrate transcriptional regulation and metabolic reactions data into a single model in order to investigate the interactions between metabolism and regulation. In contrast to existing methods which seek to use transcriptional regulation networks to limit the solution space of the constraint-based metabolic model, we seek to define a transcriptional regulatory space which can be associated with the metabolic distribution of interest. This allows us to make inferences about how changes in the regulatory network could lead to improved metabolic flux

    Sequencing and analysis of genes expressed in the cambial tissue of Quercus rubra using a normalized, large-insert cDNA library

    Get PDF
    The logistical issues associated with completely sequencing a very large genome greatly limit the number organisms that can have such a project devoted to them. One of the methods developed to circumvent this impasse is the sequencing of expressed sequence tags (ESTs), that is, partial cDNAs. The technique is often used as an introduction to completely unsequenced genomes as well as a more detailed analysis of previously characterized genomes. In the case of poorly characterized genomes, EST sequencing provides a quick, efficient profile of the nucleotide sequences of messenger RNA. Furthermore, many plant ESTs have been quickly annotated via regions of sequence similarity comparisons with genes of model organisms such as the mustard, Arabidopsis thaliana Heynh, and the hardwood, Populus trichocarpa Torr. & A.Gray. This project focused on rapidly dividing cambial tissue from a Quercus rubra L. individual with a partially characterized ancestry. That individual was recovered from one of the few oak nurseries in the world, namely the Watauga Genetic Research Orchard near Elizabethton, TN. The cambial transcriptome provided 984 cDNA clones resulting in 870 unique sequences. After appropriate filtering the unique sequences were submitted for homology comparison against the gene databases of Arabidopsis, Populus, as well as the generalized UniProt database. Putative function was assigned to more than 90% of the unique sequences; however forty sequences have no significant homology to any known protein. The nucleotide sequences produced in this study will be submitted to the GenBank database where they will become the foundation for a Q. rubra sequence resource. Since the sequences were recovered from cambial tissue of spring wood, they will assist in better understanding wood formation within this species. Such studies should lead to increases in both the quality and quantity of this valuable hardwood found in western North Carolina

    Graphene-based Josephson junction single photon detector

    Full text link
    We propose to use graphene-based Josephson junctions (gJjs) to detect single photons in a wide electromagnetic spectrum from visible to radio frequencies. Our approach takes advantage of the exceptionally low electronic heat capacity of monolayer graphene and its constricted thermal conductance to its phonon degrees of freedom. Such a system could provide high sensitivity photon detection required for research areas including quantum information processing and radio-astronomy. As an example, we present our device concepts for gJj single photon detectors in both the microwave and infrared regimes. The dark count rate and intrinsic quantum efficiency are computed based on parameters from a measured gJj, demonstrating feasibility within existing technologies.Comment: 11 pages, 6 figures, and 1 table in the main tex

    Utilizing Load Shifting for Optimal Compressor Sequencing in Industrial Refrigeration

    Full text link
    The ubiquity and energy needs of industrial refrigeration has prompted several research studies investigating various control opportunities for reducing energy demand. This work focuses on one such opportunity, termed compressor sequencing, which entails intelligently selecting the operational state of the compressors to service the required refrigeration load with the least possible work. We first study the static compressor sequencing problem and observe that deriving the optimal compressor operational state is computationally challenging and can vary dramatically based on the refrigeration load. Thus we introduce load shifting in conjunction with compressor sequencing, which entails strategically precooling the facility to allow for more efficient compressor operation. Interestingly, we show that load shifting not only provides benefits in computing the optimal compressor operational state, but also can lead to significant energy savings. Our results are based on and compared to real-world sensor data from an operating industrial refrigeration site of Butterball LLC located in Huntsville, AR, which demonstrated that without load shifting, even optimal compressor operation results in compressors often running at intermediate capacity levels, which can lead to inefficiencies. Through collected data, we demonstrate that a load shifting approach for compressor sequencing has the potential to reduce energy use of the compressors up to 20% compared to optimal sequencing without load shifting

    Detection of Vacuum Birefringence with Intense Laser Pulses

    Full text link
    We propose a novel technique that promises hope of being the first to directly detect a polarization in the quantum electrodynamic (QED) vacuum. The technique is based upon the use of ultra-short pulses of light circulating in low dispersion optical resonators. We show that the technique circumvents the need for large scale liquid helium cooled magnets, and more importantly avoids the experimental pitfalls that plague existing experiments that make use of these magnets. Likely improvements in the performance of optics and lasers would result in the ability to observe vacuum polarization in an experiment of only a few hours duration.Comment: 4 pages, 1 figur

    What's In My Big Data?

    Full text link
    Large text corpora are the backbone of language models. However, we have a limited understanding of the content of these corpora, including general statistics, quality, social factors, and inclusion of evaluation data (contamination). In this work, we propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corpora. WIMBD builds on two basic capabilities -- count and search -- at scale, which allows us to analyze more than 35 terabytes on a standard compute node. We apply WIMBD to ten different corpora used to train popular language models, including C4, The Pile, and RedPajama. Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content, personally identifiable information, toxic language, and benchmark contamination. For instance, we find that about 50% of the documents in RedPajama and LAION-2B-en are duplicates. In addition, several datasets used for benchmarking models trained on such corpora are contaminated with respect to important benchmarks, including the Winograd Schema Challenge and parts of GLUE and SuperGLUE. We open-source WIMBD's code and artifacts to provide a standard set of evaluations for new text-based corpora and to encourage more analyses and transparency around them: github.com/allenai/wimbd
    corecore