27 research outputs found

    Applications on emerging paradigms in parallel computing

    Get PDF
    The area of computing is seeing parallelism increasingly being incorporated at various levels: from the lowest levels of vector processing units following Single Instruction Multiple Data (SIMD) processing, Simultaneous Multi-threading (SMT) architectures, and multi/many-cores with thread-level shared memory and SIMT parallelism, to the higher levels of distributed memory parallelism as in supercomputers and clusters, and scaling them to large distributed systems as server farms and clouds. All together these form a large hierarchy of parallelism. Developing high-performance parallel algorithms and efficient software tools, which make use of the available parallelism, is inevitable in order to harness the raw computational power these emerging systems have to offer. In the work presented in this thesis, we develop architecture-aware parallel techniques on such emerging paradigms in parallel computing, specifically, parallelism offered by the emerging multi- and many-core architectures, as well as the emerging area of cloud computing, to target large scientific applications. First, we develop efficient parallel algorithms to compute optimal pairwise alignments of genomic sequences on heterogeneous multi-core processors, and demonstrate them on the IBM Cell Broadband Engine. Then, we develop parallel techniques for scheduling all-pairs computations on heterogeneous systems, including clusters of Cell processors, and NVIDIA graphics processors. We compare the performance of our strategies on Cell, GPU and Intel Nehalem multi-core processors. Further, we apply our algorithms to specific applications taken from the areas of systems biology, fluid dynamics and materials science: pairwise Mutual Information computations for reconstruction of gene regulatory networks; pairwise Lp-norm distance computations for coherent structures discovery in the design of flapping-wing Micro Air Vehicles, and construction of stochastic models for a set of properties of heterogeneous materials. Lastly, in the area of cloud computing, we propose and develop an abstract framework to enable computations in parallel on large tree structures, to facilitate easy development of a class of scientific applications based on trees. Our framework, in the style of Google\u27s MapReduce paradigm, is based on two generic user-defined functions through which a user writes an application. We implement our framework as a generic programming library for a large cluster of homogeneous multi-core processor, and demonstrate its applicability through two applications: all-k-nearest neighbors computations, and Fast Multipole Method (FMM) based simulations

    Conserved noncoding sequences highlight shared components of regulatory networks in dicotyledonous plants

    Get PDF
    Conserved noncoding sequences (CNSs) in DNA are reliable pointers to regulatory elements controlling gene expression. Using a comparative genomics approach with four dicotyledonous plant species (Arabidopsis thaliana, papaya [Carica papaya], poplar [Populus trichocarpa], and grape [Vitis vinifera]), we detected hundreds of CNSs upstream of Arabidopsis genes. Distinct positioning, length, and enrichment for transcription factor binding sites suggest these CNSs play a functional role in transcriptional regulation. The enrichment of transcription factors within the set of genes associated with CNS is consistent with the hypothesis that together they form part of a conserved transcriptional network whose function is to regulate other transcription factors and control development. We identified a set of promoters where regulatory mechanisms are likely to be shared between the model organism Arabidopsis and other dicots, providing areas of focus for further research

    Alteromonas Myovirus V22 Represents a New Genus of Marine Bacteriophages Requiring a Tail Fiber Chaperone for Host Recognition

    Get PDF
    Marine phages play a variety of critical roles in regulating the microbial composition of our oceans. Despite constituting the majority of genetic diversity within these environments, there are relatively few isolates with complete genome sequences or in-depth analyses of their host interaction mechanisms, such as characterization of their receptor binding proteins (RBPs). Here, we present the 92,760-bp genome of the Alteromonas-targeting phage V22. Genomic and morphological analyses identify V22 as a myovirus; however, due to a lack of sequence similarity to any other known myoviruses, we propose that V22 be classified as the type phage of a new Myoalterovirus genus within the Myoviridae family. V22 shows gene homology and synteny with two different subfamilies of phages infecting enterobacteria, specifically within the structural region of its genome. To improve our understanding of the V22 adsorption process, we identified putative RBPs (gp23, gp24, and gp26) and tested their ability to decorate the V22 propagation strain, Alteromonas mediterranea PT11, as recombinant green fluorescent protein (GFP)-tagged constructs. Only GFP-gp26 was capable of bacterial recognition and identified as the V22 RBP. Interestingly, production of functional GFP-gp26 required coexpression with the downstream protein gp27. GFP-gp26 could be expressed alone but was incapable of host recognition. By combining size-exclusion chromatography with fluorescence microscopy, we reveal how gp27 is not a component of the final RBP complex but instead is identified as a new type of phage-encoded intermolecular chaperone that is essential for maturation of the gp26 RBP.This work was supported by grants ‘VIREVO’ CGL2016‐76273‐P (MCI/AEI/FEDER, EU) (cofounded with FEDER funds) from the Spanish Ministerio de Ciencia e Innovación and ‘HIDRAS3’ PROMETEU/2019/009 from Generalitat Valenciana. R.G.-S. was supported by a predoctoral fellowship from the Valencian Consellería de Educació, Investigació, Cultura i Esport (ACIF/2016/050) and was also a beneficiary of the BEFPI 2019 fellowship for predoctoral stays from Generalitat Valenciana and The European Social Fund. F.R.-V. was a beneficiary of the 5top100 program of the Ministry for Science and Education of Russia

    Homology sequence analysis using GPU acceleration

    Get PDF
    A number of problems in bioinformatics, systems biology and computational biology field require abstracting physical entities to mathematical or computational models. In such studies, the computational paradigms often involve algorithms that can be solved by the Central Processing Unit (CPU). Historically, those algorithms benefit from the advancements of computing power in the serial processing capabilities of individual CPU cores. However, the growth has slowed down over recent years, as scaling out CPU has been shown to be both cost-prohibitive and insecure. To overcome this problem, parallel computing approaches that employ the Graphics Processing Unit (GPU) have gained attention as complementing or replacing traditional CPU approaches. The premise of this research is to investigate the applicability of various parallel computing platforms to several problems in the detection and analysis of homology in biological sequence. I hypothesize that by exploiting the sheer amount of computation power and sequencing data, it is possible to deduce information from raw sequences without supplying the underlying prior knowledge to come up with an answer. I have developed such tools to perform analysis at scales that are traditionally unattainable with general-purpose CPU platforms. I have developed a method to accelerate sequence alignment on the GPU, and I used the method to investigate whether the Operational Taxonomic Unit (OTU) classification problem can be improved with such sheer amount of computational power. I have developed a method to accelerate pairwise k-mer comparison on the GPU, and I used the method to further develop PolyHomology, a framework to scaffold shared sequence motifs across large numbers of genomes to illuminate the structure of the regulatory network in yeasts. The results suggest that such approach to heterogeneous computing could help to answer questions in biology and is a viable path to new discoveries in the present and the future.Includes bibliographical reference

    Histone variants in archaea

    Get PDF
    Eukaryotic histone variants are involved in a wide range of processes and play a key role in altering nucleosome dynamics to shape the architecture of chromatin. The importance of individual variants has been studied extensively in many eukaryotes. In comparison, we know relatively little about histones in archaea. Despite sequence variation and evidence for potential functional differences between histone paralogs in the same species, whether archaea have histone variants, and therefore the potential for complex histone-based chromatin, has not been comprehensively explored. In this work, I apply structural and sequence-based approaches and present evidence that histone variants exist in archaea. In silico modelling suggests that, similarly to some eukaryotic variants, paralogs in archaea can be identified by unique structural properties. In particular, I describe one such variant, a “capstone”, that can drastically alter histone-based chromatin by limiting oligomerisation. Other paralogs have less extreme structural properties but are shared between species which separated hundreds of millions of years ago, on par with some eukaryotic histone variants. Although there are shared features between the two, histones in archaea have appear to have explored a different sequence space to eukaryotic histones, evolving separately and in parallel.Open Acces

    Network and multi-scale signal analysis for the integration of large omic datasets: applications in \u3ci\u3ePopulus trichocarpa\u3c/i\u3e

    Get PDF
    Poplar species are promising sources of cellulosic biomass for biofuels because of their fast growth rate, high cellulose content and moderate lignin content. There is an increasing movement on integrating multiple layers of ’omics data in a systems biology approach to understand gene-phenotype relationships and assist in plant breeding programs. This dissertation involves the use of network and signal processing techniques for the combined analysis of these various data types, for the goals of (1) increasing fundamental knowledge of P. trichocarpa and (2) facilitating the generation of hypotheses about target genes and phenotypes of interest. A data integration “Lines of Evidence” method is presented for the identification and prioritization of target genes involved in functions of interest. A new post-GWAS method, Pleiotropy Decomposition, is presented, which extracts pleiotropic relationships between genes and phenotypes from GWAS results, allowing for identification of genes with signatures favorable to genome editing. Continuous wavelet transform signal processing analysis is applied in the characterization of genome distributions of various features (including variant density, gene density, and methylation profiles) in order to identify chromosome structures such as the centromere. This resulted in the approximate centromere locations on all P. trichocarpa chromosomes, which had previously not been adequately reported in the scientific literature. Discrete wavelet transform signal processing followed by correlation analysis was applied to genomic features from various data types including transposable element density, methylation density, SNP density, gene density, centromere position and putative ancestral centromere position. Subsequent correlation analysis of the resulting wavelet coefficients identified scale-specific relationships between these genomic features, and provide insights into the evolution of the genome structure of P. trichocarpa. These methods have provided strategies to both increase fundamental knowledge about the P. trichocarpa system, as well as to identify new target genes related to biofuels targets. We intend that these approaches will ultimately be used in the designing of better plants for more efficient and sustainable production of bioenergy

    A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset

    Get PDF
    Background: Single-nucleotide polymorphisms (SNPs) are the most widely used form of molecular genetic variation studies. As reference genomes and resequencing data sets expand exponentially, tools must be in place to call SNPs at a similar pace. The genome analysis toolkit (GATK) is one of the most widely used SNP calling software tools publicly available, but unfortunately, high-performance computing versions of this tool have yet to become widely available and affordable. Results: Here we report an open-source high-performance computing genome variant calling workflow (HPC-GVCW) for GATK that can run on multiple computing platforms from supercomputers to desktop machines. We benchmarked HPC-GVCW on multiple crop species for performance and accuracy with comparable results with previously published reports (using GATK alone). Finally, we used HPC-GVCW in production mode to call SNPs on a “subpopulation aware” 16-genome rice reference panel with ~ 3000 resequenced rice accessions. The entire process took ~ 16 weeks and resulted in the identification of an average of 27.3 M SNPs/genome and the discovery of ~ 2.3 million novel SNPs that were not present in the flagship reference genome for rice (i.e., IRGSP RefSeq). Conclusions: This study developed an open-source pipeline (HPC-GVCW) to run GATK on HPC platforms, which significantly improved the speed at which SNPs can be called. The workflow is widely applicable as demonstrated successfully for four major crop species with genomes ranging in size from 400 Mb to 2.4 Gb. Using HPC-GVCW in production mode to call SNPs on a 25 multi-crop-reference genome data set produced over 1.1 billion SNPs that were publicly released for functional and breeding studies. For rice, many novel SNPs were identified and were found to reside within genes and open chromatin regions that are predicted to have functional consequences. Combined, our results demonstrate the usefulness of combining a high-performance SNP calling architecture solution with a subpopulation-aware reference genome panel for rapid SNP discovery and public deployment. © 2024, The Author(s).Open access journalThis item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at [email protected]

    Organization And Introgression Mechanics Of Phaseolus Vulgaris (Common Bean)

    Get PDF
    Phaseolus vulgaris is a major food crop grown and consumed around the world. A new world vegetable, the common bean underwent two separate domestication events, both pre-Columbus. These events generated two different land races, the Mesoamerican and Andean, named for the area where the domestication took place. Since the initial domestications the land races have been generally evenly cultivated, but despite its popularity the common bean has only very recently been fully sequenced. One of the issues faced by bean growers worldwide is Common Bacterial Blight (CBB). A disease caused by Xanthomonas axonopodis, CBB causes crop loses ranging from 20–40% every year but does not affect all species within Phaseolus evenly; P. acutifolius, for example, shows an innate resistance to CBB. To leverage this advantage, researchers at the University of Guelph, in partnership with the Ontario Agricultural College, developed a cultivar of Mesoamerican P. vulgaris that was introgressed with PI440795, a P. acutifolius accession, and backcrossed repeatedly with several other Mesoamerican P. vulgaris accessions to generate ‘OAC-Rex’, a plant that displays the crop-desired traits of P. vulgaris and the disease resistance traits of P. acutifolius. Genetic introgression is the process of crossing distantly related organisms followed by repeated backcrossing, resulting in a viable offspring that displays characteristics of each parent. Though rarely occurring, it can be observed in both plants and animals and is often exploited in a crop development context to generate new cultivars. Unfortunately, though regularly observed, introgression has been followed on a predominantly phenotypic level, usually many generations after the event, and as such molecular aspects of this phenomenon are largely unknown.By studying OAC-Rex, PI440795, and G-19833 (an Andean cultivar whose whole-genome has been published) introgression was examined directly and a method for the detection of regions within the introgressed genome uniquely donated from either paren

    Evolutionary genomics : statistical and computational methods

    Get PDF
    This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward

    Evolutionary Genomics

    Get PDF
    This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward
    corecore