154 research outputs found

    Barely visible but highly unique : the Ostreococcus genome unveils its secrets

    Get PDF

    Computational genomics of developmental gene regulation

    Get PDF
    The development of multicellular organisms requires the precise execution of complex transcriptional programs. The demands posed by development, coupled with the relatively late evolution of multicellularity, could have led to a separate mode of gene regulation for gene involved in, and regulated throughout development. I investigated the regulation of genes by enhancers using histone modifications coupled to gene expression, based on the observa- tion that developmental genes are surrounded by dense clusters of conserved enhancers which act in concert. Genes regulated by enhancers are much more likely to be developmentally regulated genes, and many enhancers at each loci co-ordinate to direct transcription across multiple tissues. CAGE-seq is a powerful tool for determining the structure of promot- ers. I analysed promoters in Amphioxus using CAGE-seq to determine if the diverse promoter architectures observed in vertebrates had ancestral ori- gins. Promoters in amphioxus can be divided into developmental and house- keeping promoters, which each have characteristic patterns of dinucleotide enrichment. Housekeeping promoters in Amphioxus have a novel promoter architecture, and a contain a high frequency of bidirectional promoters, which represents the ancestral vertebrate state. This set of genes highlight the mal- leability of promoter architecture during evolution. I developed a package in R/Bioconductor ‘heatmaps’ to enable effective visualisation of this, and other, data. Taken together, these results suggest a second mode of regulation in ver- tebrates governing the regulation of developmental genes.Open Acces

    Organization and evolution of information within eukaryotic genomes.

    Get PDF

    Metals in enzyme catalysis and visualization methods

    Full text link
    Metal ions play essential roles in biological functions including catalysis, protein stability, DNA-protein interactions and cell signaling. It is estimated that 30% of proteins utilize metals in some fashion. Additionally, methods by which metal ions can be visualized have been utilized to study metal concentrations and localizations in relation to disease. Understanding the roles metals play in biological systems has great potential in medicine and technology. Chapters 1 and 2 of this dissertation analyzes the structure and function of the Mn-dependent enzyme oxalate decarboxylase (OxDc) and Chapter 2 presents a bioinformatic analysis of the cupin superfamily that provides the structural scaffold of the decarboxylase. The X-ray crystal structure of the W132F variant was determined and utilized together with EPR data to develop a computational approach to determining EPR spectra of the enzyme’s two metal-binding centers. Furthermore, a variant in which the catalytic Glu162 was deleted revealed the binding mode of oxalate, the first substrate-bound structure of OxDc. OxDc is a member of the cupin superfamily, which comprises a wide variety of proteins and enzymes with great sequence and functional diversity. A bioinformatics analysis of the superfamily was performed to analyze how sequence variation determines function and metal utilization. Chapters 3 and 4 discuss the expansion of lanthanide-binding tags (LBTs) to in cellulo studies. Lanthanide-binding tags are short sequences of amino acids that have high affinity and selectivity for lanthanide ions. An EGF-LBT construct used to quantify EGF receptors on the surface of A431 and HeLa cells. The results from the LBT quantification are consistent with previous studies of EGFR receptors in these cell types, validating the use of this method for future studies. The potential of using LBTs for X-ray fluorescence microscopy (XFM) was also investigated. LBT-labeled constructs were utilized to investigate if membrane bound as well as cytosolic LBT-containing proteins could be visualized and localized to their cell compartments via XFM; both membrane-localized and cytosolic proteins were successfully visualized. With the high resolution (< 150 Å) obtainable with new synchrotron beamline configurations LBTs could be used to study nanoscale biological structures in their near-native state

    Methods and Applications for Collection, Contamination Estimation, and Linkage Analysis of Large-scale Human Genotype Data

    Full text link
    In recent decades statistical genetics has contributed substantially to our knowledge of human health and biology. This research has many facets: from collecting data, to cleaning, to analyzing. As the scope of the scientific questions considered and the scale of the data continue to increase, these bring additional challenges to every step of the process. In this dissertation, I describe novel approaches for each of these three steps, focused on the specific problems of participant recruitment and engagement, DNA contamination estimation, and linkage analysis with large data sets. In Chapter 1, we introduce the subject of this dissertation and how it fits with other developments in the generation, analysis and interpretation of human genetic data. In Chapter 2, we describe Genes for Good, a new platform for engaging a large, diverse participant pool in genetics research through social media. We developed a Facebook application where participants can sign up, take surveys related to their health, and easily invite interested friends to join. After completing a required number of these surveys, we send participants a spit kit to collect their DNA. In a statistical analysis of 27,000 individuals from all over the United States genotyped in our study, we replicated health trends and genetic associations, showing the utility of our approach and accuracy of self-reported phenotypes we collected. In Chapter 3, we introduce VICES (Verify Intensity Contamination from Estimated Sources), a statistical method for joint estimation of DNA contamination and its sources in genotyping arrays. Genotyping array data are typically highly accurate but sensitive to mixing of DNA samples from multiple individuals before or during genotyping. VICES jointly estimates the total proportion of contaminating DNA and identify which samples it came from by regressing deviations in probe intensity for a sample being tested on the genotypes of another sample. Through analysis of array intensity and genotype data from HapMap samples and the Michigan Genomics Initiative, we show that our method reliably estimates contamination more accurately than existing methods and implicates problematic steps to guide process improvements. In Chapter 4, we propose Population Linkage, a novel approach to perform linkage analysis on genome-wide genotype data from tens of thousands of arbitrarily related individuals. Our method estimates kinship and identical-by-descent segments (IBD) between all pairs of individuals, fits them as variance components using Haseman-Elston regression, and tests for linkage. This chapter addresses how to iteratively assess evidence of linkage in large numbers of individuals across the genome, reduce repeated calculations, model relationships without pedigrees, and determine segregation of genomic segments between relatives using single-nucleotide polymorphism (SNP) genotypes. After applying our method to 6,602 individuals from the National Institute on Aging (NIA) SardiNIA study and 69,716 individuals from the Trøndelag Health Study (HUNT), we show that most of our signals overlapped with known GWAS loci and many of these could explain a greater proportion of the trait variance than the top GWAS SNP. In Chapter 5, we discuss the impact and future directions for the work presented in this dissertation. We have proposed novel approaches for gathering useful research data, checking its quality, and detecting associations in the investigation of human genetics. Also, this work serves as an example for thinking about the process of human genetic discovery from beginning to end as a whole and understanding the role of each part.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162998/1/gzajac_1.pd

    Machine learning and computational methods to identify molecular and clinical markers for complex diseases – case studies in cancer and obesity

    Get PDF
    In biomedical research, applied machine learning and bioinformatics are the essential disciplines heavily involved in translating data-driven findings into medical practice. This task is especially accomplished by developing computational tools and algorithms assisting in detection and clarification of underlying causes of the diseases. The continuous advancements in high-throughput technologies coupled with the recently promoted data sharing policies have contributed to presence of a massive wealth of data with remarkable potential to improve human health care. In concordance with this massive boost in data production, innovative data analysis tools and methods are required to meet the growing demand. The data analyzed by bioinformaticians and computational biology experts can be broadly divided into molecular and conventional clinical data categories. The aim of this thesis was to develop novel statistical and machine learning tools and to incorporate the existing state-of-the-art methods to analyze bio-clinical data with medical applications. The findings of the studies demonstrate the impact of computational approaches in clinical decision making by improving patients risk stratification and prediction of disease outcomes. This thesis is comprised of five studies explaining method development for 1) genomic data, 2) conventional clinical data and 3) integration of genomic and clinical data. With genomic data, the main focus is detection of differentially expressed genes as the most common task in transcriptome profiling projects. In addition to reviewing available differential expression tools, a data-adaptive statistical method called Reproducibility Optimized Test Statistic (ROTS) is proposed for detecting differential expression in RNA-sequencing studies. In order to prove the efficacy of ROTS in real biomedical applications, the method is used to identify prognostic markers in clear cell renal cell carcinoma (ccRCC). In addition to previously known markers, novel genes with potential prognostic and therapeutic role in ccRCC are detected. For conventional clinical data, ensemble based predictive models are developed to provide clinical decision support in treatment of patients with metastatic castration resistant prostate cancer (mCRPC). The proposed predictive models cover treatment and survival stratification tasks for both trial-based and realworld patient cohorts. Finally, genomic and conventional clinical data are integrated to demonstrate the importance of inclusion of genomic data in predictive ability of clinical models. Again, utilizing ensemble-based learners, a novel model is proposed to predict adulthood obesity using both genetic and social-environmental factors. Overall, the ultimate objective of this work is to demonstrate the importance of clinical bioinformatics and machine learning for bio-clinical marker discovery in complex disease with high heterogeneity. In case of cancer, the interpretability of clinical models strongly depends on predictive markers with high reproducibility supported by validation data. The discovery of these markers would increase chance of early detection and improve prognosis assessment and treatment choice

    Optimization and Management of Large-scale Scientific Workflows in Heterogeneous Network Environments: From Theory to Practice

    Get PDF
    Next-generation computation-intensive scientific applications feature large-scale computing workflows of various structures, which can be modeled as simple as linear pipelines or as complex as Directed Acyclic Graphs (DAGs). Supporting such computing workflows and optimizing their end-to-end network performance are crucial to the success of scientific collaborations that require fast system response, smooth data flow, and reliable distributed operation.We construct analytical cost models and formulate a class of workflow mapping problems with different mapping objectives and network constraints. The difficulty of these mapping problems essentially arises from the topological matching nature in the spatial domain, which is further compounded by the resource sharing complicacy in the temporal dimension. We provide detailed computational complexity analysis and design optimal or heuristic algorithms with rigorous correctness proof or performance analysis. We decentralize the proposed mapping algorithms and also investigate these optimization problems in unreliable network environments for fault tolerance.To examine and evaluate the performance of the workflow mapping algorithms before actual deployment and implementation, we implement a simulation program that simulates the execution dynamics of distributed computing workflows. We also develop a scientific workflow automation and management platform based on an existing workflow engine for experimentations in real environments. The performance superiority of the proposed mapping solutions are illustrated by extensive simulation-based comparisons with existing algorithms and further verified by large-scale experiments on real-life scientific workflow applications through effective system implementation and deployment in real networks

    Towards Implicit Parallel Programming for Systems

    Get PDF
    Multi-core processors require a program to be decomposable into independent parts that can execute in parallel in order to scale performance with the number of cores. But parallel programming is hard especially when the program requires state, which many system programs use for optimization, such as for example a cache to reduce disk I/O. Most prevalent parallel programming models do not support a notion of state and require the programmer to synchronize state access manually, i.e., outside the realms of an associated optimizing compiler. This prevents the compiler to introduce parallelism automatically and requires the programmer to optimize the program manually. In this dissertation, we propose a programming language/compiler co-design to provide a new programming model for implicit parallel programming with state and a compiler that can optimize the program for a parallel execution. We define the notion of a stateful function along with their composition and control structures. An example implementation of a highly scalable server shows that stateful functions smoothly integrate into existing programming language concepts, such as object-oriented programming and programming with structs. Our programming model is also highly practical and allows to gradually adapt existing code bases. As a case study, we implemented a new data processing core for the Hadoop Map/Reduce system to overcome existing performance bottlenecks. Our lambda-calculus-based compiler automatically extracts parallelism without changing the program's semantics. We added further domain-specific semantic-preserving transformations that reduce I/O calls for microservice programs. The runtime format of a program is a dataflow graph that can be executed in parallel, performs concurrent I/O and allows for non-blocking live updates
    • …
    corecore