117 research outputs found

    Bioinformatics

    Get PDF
    This book is divided into different research areas relevant in Bioinformatics such as biological networks, next generation sequencing, high performance computing, molecular modeling, structural bioinformatics, molecular modeling and intelligent data analysis. Each book section introduces the basic concepts and then explains its application to problems of great relevance, so both novice and expert readers can benefit from the information and research works presented here

    Statistics and Evolution of Functional Genomic Sequence

    Get PDF
    In this thesis, three separate problems of genomics are addressed, utilizing methods related to the field of statistical mechanics. The goal of the project discussed in the first chapter is the elucidation of post-transcriptional gene regulation imposed by microRNAs, a recently discovered class of tiny non-coding RNAs. A probabilistic algorithm for the computational identification of genes regulated by microRNAs is introduced, which was developed based on experimental data and statistical analysis of whole genome data. In particular, the application of this algorithm to multiple-alignments of groups of related species allows for the specific and sensitive detection of genes targeted by microRNAs on a genome-wide level. Examination of clade-specific predictions and cross-clade comparison yields deeper insights into microRNA biology and first clues about long-term evolution of microRNA regulation, which are discussed in detail. Modeling evolutionary dynamics of microsatellites, an abundant class of repetitive sequence in eukaryotic genomes, was the objective of the second project and is discussed in chapter two. Inspired by the putative functionality of some of these elements and the difficulty of constructing correct sequence alignments that reflect the evolutionary relationships between microsatellites, a neutral model for microsatellite evolution is developed and tested in the fruit fly Drosophila melanogaster by comparing evolutionary rates predicted by the model to independent measurements of these rates from multiple alignments of three closely relates Drosophila species. The model is applied separately to genomic sequence categories of different functional annotations in order to assess the varying influence of selective constraint among these categories. In the last chapter, a general population genetic model is introduced that allows for the determination of transcription factor binding site stability as a function of selection strength, mutation rate and effective population size at arbitrary values of these parameters. The analytical solution of this model indicates the probability of a binding site to be functional. The model is used to compute the population fraction of functional binding sites at fixed selection pressure across a variety of different taxa. The results lead to the conclusion that a decreasing effective population size, such as observed at the evolutionary transition from prokaryotes to eukaryotes, could result in loss of binding site stability. An extension to our model serves us to assess the compensatory effect of the emergence of multiple binding sites for the same transcription factor in order to maintain the existing regulatory relationship

    Uuden fuusiogeenin löytö glioblastoomasta laskennallisin menetelmin

    Get PDF
    Cancer is a disease characterized by the uncontrolled and invasive growth of cells. All forms of cancer are caused by genomic alterations that alter normal cellular function, leading to a malignant phenotype that is inherited across cell division. Fusion genes are a type of genomic alteration where pieces from two genes are fused together, forming a new gene with altered behaviour. Fusion genes are known to play a role in many human cancers. In this work, we used computational analysis and whole transcriptome sequencing to search for fusion genes in a cohort of 40 brain cancer patients. We discovered a novel fusion gene FGFR3-TACC3 that characterizes a new subtype of glioblastoma, a highly lethal form of brain cancer. In a larger validation cohort, the fusion gene was found in 4 of 48 glioblastoma patients but not in any of 43 low-grade gliomas tested. The fusion gene is caused by tandem duplication and encodes a chimeric protein that promotes glioma progression and cell growth. The fusion gene was mutually exclusive with the amplification of EGFR, PDGFRA and MET, three oncogenes associated with glioblastoma. The availability of small molecule inhibitors for FGFR3 suggests an effective treatment strategy for glioblastoma patients harboring the fusion

    Data Science: Measuring Uncertainties

    Get PDF
    With the increase in data processing and storage capacity, a large amount of data is available. Data without analysis does not have much value. Thus, the demand for data analysis is increasing daily, and the consequence is the appearance of a large number of jobs and published articles. Data science has emerged as a multidisciplinary field to support data-driven activities, integrating and developing ideas, methods, and processes to extract information from data. This includes methods built from different knowledge areas: Statistics, Computer Science, Mathematics, Physics, Information Science, and Engineering. This mixture of areas has given rise to what we call Data Science. New solutions to the new problems are reproducing rapidly to generate large volumes of data. Current and future challenges require greater care in creating new solutions that satisfy the rationality for each type of problem. Labels such as Big Data, Data Science, Machine Learning, Statistical Learning, and Artificial Intelligence are demanding more sophistication in the foundations and how they are being applied. This point highlights the importance of building the foundations of Data Science. This book is dedicated to solutions and discussions of measuring uncertainties in data analysis problems

    Multiscale Models Of Interfacial Mechanics In Low Dimensional Systems

    Get PDF
    Crucial thrusts in modern technology from electronic information processing to engineering cellular systems require manipulation and control of materials on smaller and smaller scales to succeed. A simple and successful way to break conventional material property limitations or design multifunctional devices is to interface two different materials together. At small length scales, the surface to bulk ratio of each component material increases, to the point that the interfacial physics can dominate the properties of the engineered system. Simultaneously, the combinatorial space of possible interfaces between materials and/or molecules is far too vast to explore by trial-and-error experimentation alone. Intuitive theoretical models can greatly improve our ability to navigate such large search spaces by providing insight on how two materials are likely to interact. The goal of this thesis is to develop predictive physical models which explain emergent phenomena at material interfaces across multiple length and time scales. A variety of state-of-the-art tools were applied to realize this goal, including analytical mathematics, quantum mechanical simulations, finite element methods, and deep neural networks. At the electron scale, a continuum model parametrized by first-principles simulations was employed to develop design criteria for confined quantum states in lateral heterostructures of two-dimensional materials. At the atomic scale, a chemo-mechanical model incorporating long-range electrostatics was developed to explain synthesizability trends in composite heterostructures of inorganic perovskites and organic molecules. A machine learning graph neural network model was developed and applied to predict the impact of general surface strains on the adsorption energy of small molecule intermediates on catalyst surfaces. Finally, at the microscale, a nonlinear kinetic model was developed to explain how cells acquire and retain memory of the mechanical properties of their surroundings across multiple timescales, which can lead to irreversible adaptation and differentiation. The methods and results presented in this thesis can improve our understanding of physical phenomena arising at interfaces and provide a blueprint for future applications of multiscale computational modeling to science and engineering problems

    Bayesian methods and data science with health informatics data

    Get PDF
    Cancer is a complex disease, driven by a range of genetic and environmental factors. Every year millions of people are diagnosed with a type of cancer and the survival prognosis for many of them is poor due to the lack of understanding of the causes of some cancers. Modern large-scale studies offer a great opportunity to study the mechanisms underlying different types of cancer but also brings the challenges of selecting informative features, estimating the number of cancer subtypes, and providing interpretative results. In this thesis, we address these challenges by developing efficient clustering algorithms based on Dirichlet process mixture models which can be applied to different data types (continuous, discrete, mixed) and to multiple data sources (in our case, molecular and clinical data) simultaneously. We show how our methodology addresses the drawbacks of widely used clustering methods such as k-means and iClusterPlus. We also introduce a more efficient version of the clustering methods by using simulated annealing in the inference stage. We apply the data integration methods to data from The Cancer Genome Atlas (TCGA), which include clinical and molecular data about glioblastoma, breast cancer, colorectal cancer, and pancreatic cancer. We find subtypes which are prognostic of the overall survival in two aggressive types of cancer: pancreatic cancer and glioblastoma, which were not identified by the comparison models. We analyse a Hospital Episode Statistics (HES) dataset comprising clinical information about all pancreatic cancer patients in the United Kingdom operated during the period 2001 - 2016. We investigate the effect of centralisation on the short- and long-term survival of the patients, and the factors affecting the patient survival. Our analyses show that higher volume surgery centres are associated with lower 90-day mortality rates and that age, index of multiple deprivation and diagnosis type are significant risk factors for the short-term survival. Our findings suggest the analysis of large complex molecular datasets coupled with methodology advances can allow us to gain valuable insights in the cancer genome and the associated molecular mechanisms

    Informative sequence-based models for fragment distributions in ChIP-seq, RNA-seq and ChIP-chip data

    Get PDF
    Many high throughput sequencing protocols for RNA and DNA require that the polynucleic acid is fragmented so that the identity of a limited number of nucleic acids of one or both of the ends of the fragments can be determined by sequencing. The nucleic acid sequence allows the fragment to be located within the genome, and the fragment distribution can then be used for a variety of different purposes. In the case of DNA this includes identifying the locations where specific proteins are bound to the genome. In the case of RNA this includes quantifying the expression levels of different gene variants or transcripts. If the locations of the polynucleic acid fragments are partly determined by the underlying nucleic acid sequence this could bias any results derived from the data. Unfortunately, such sequence dependencies have already been observed in the distribution of both RNA and DNA fragments. Previous analyses of such data in order to reduce the bias have examined the role of regional characteristics such as GC bias, or the bias towards a specific sequence at the start of the fragments. This thesis introduces a new method for modelling the bias which considers the degree to which the nucleotide sequence affects the likelihood of a fragment originating at that location. This shows that there is often not a single bias characteristic, but multiple, alternative sequence biases that coexist within a single dataset. This also shows that the nucleotide sequence immediately proximal to the fragment also has a significant effect on the fragment likelihood. This new approach highlights characteristics that were previously hidden and provides a more powerful basis for correcting such bias. Multiple alternative sequence biases are observed when both RNA and DNA are fragmented, but the more detailed information provided by the new technique shows in detail how the characteristics are different for RNA and DNA and indicates that very different molecular mechanisms are responsible for the biases in the two processes. This thesis also shows how removing the effect of this bias in ChIP-seq experiments can reveal more subtle features of the distribution of the fragments. This can provide information on the nature of the binding between proteins and the DNA with per-nucleotide precision, revealed through the change in likelihood of the DNA fragmenting at each position in the binding site. It is also shown how the model fitting technique developed to analyse sequence bias can also be used to obtain additional information from the results of ChIP-chip experiments. The approach is used to find the nucleotide sequence preference of DNA binding proteins, and also the cooperative effects associated with binding at multiple binding sites in close proximity
    corecore