Search CORE

604 research outputs found

The EM Algorithm and the Rise of Computational Biology

Author: Citable Link
Jun S. Liu
Xiaodan Fan
Yuan Yuan
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2010
Field of study

In the past decade computational biology has grown from a cottage industry with a handful of researchers to an attractive interdisciplinary field, catching the attention and imagination of many quantitatively-minded scientists. Of interest to us is the key role played by the EM algorithm during this transformation. We survey the use of the EM algorithm in a few important computational biology problems surrounding the "central dogma"; of molecular biology: from DNA to RNA and then to proteins. Topics of this article include sequence motif discovery, protein sequence alignment, population genetics, evolutionary models and mRNA expression microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/09-STS312 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Crossref

RSAT variation-tools: An accessible and flexible framework to predict the impact of regulatory variants on transcription factor binding

Author: Behera
Bernstein
Browning
Camacho
Choi
Chèneby
Coetzee
Contreras-Moreira
Deng
Deplancke
Durinck
Durinck
Eberle
Fang
Hertz
Huang
International Hapmap Consortium
International HapMap Consortium
Inukai
Kalita
Kaplan
Kersey
Kumar
Lambert
Lee
Lelli
Lin
Lin
MacArthur
Manke
Mascher
Medina-Rivera
Medina-Rivera
Medina-Rivera
Nguyen
O’Leary
Quinlan
Ramirez
Seo
Shi
Shin
Stormo
The International Barley Genome Sequencing Consortium
Thurman
Tian
Turatsinze
Ulirsch
van Helden
Wang
Ward
Ward
Zabet
Zhou
Zuo
Publication venue: 'Elsevier BV'
Publication date: 01/01/2019
Field of study

International audienceGene regulatory regions contain short and degenerated DNA binding sites recognized by transcription factors (TFBS). When TFBS harbor SNPs, the DNA binding site may be affected, thereby altering the tran-scriptional regulation of the target genes. Such regulatory SNPs have been implicated as causal variants in Genome-Wide Association Study (GWAS) studies. In this study, we describe improved versions of the programs Variation-tools designed to predict regulatory variants, and present four case studies to illustrate their usage and applications. In brief, Variation-tools facilitate i) obtaining variation information, ii) interconversion of variation file formats, iii) retrieval of sequences surrounding variants, and iv) calculating the change on predicted transcription factor affinity scores between alleles, using motif scanning approaches. Notably, the tools support the analysis of haplotypes. The tools are included within the well-maintained suite Regulatory Sequence Analysis Tools (RSAT, http://rsat.eu), and accessible through a web interface that currently enables analysis of five metazoa and ten plant genomes. Variation-tools can also be used in command-line with any locally-installed Ensembl genome. Users can input personal collections of variants and motifs, providing flexibility in the analysis

Crossref

HAL AMU

Repositorio Universidad de Zaragoza

HAL Descartes

Digital.CSIC

Algorithms in comparative genomics

Author: Chikkagoudar Satish
Publication venue: Digital Commons @ NJIT
Publication date: 31/01/2010
Field of study

The field of comparative genomics is abundant with problems of interest to computer scientists. In this thesis, the author presents solutions to three contemporary problems: obtaining better alignments for phylogeny reconstruction, identifying related RNA sequences in genomes, and ranking Single Nucleotide Polymorphisms (SNPs) in genome-wide association studies (GWAS). Sequence alignment is a basic and widely used task in bioinformatics. Its applications include identifying protein structure, RNAs and transcription factor binding sites in genomes, and phylogeny reconstruction. Phylogenetic descriptions depend not only on the employed reconstruction technique, but also on the underlying sequence alignment. The author has studied and established a simple prescription for obtaining a better phylogeny by improving the underlying alignments used in phylogeny reconstruction. This was achieved by improving upon Gotoh\u27s iterative heuristic by iterating with maximum parsimony guide-trees. This approach has shown an improvement in accuracy over standard alignment programs. A novel alignment algorithm named Probalign-RNAgenome that can identify non-coding RNAs in genomic sequences was also developed. Non-coding RNAs play a critical role in the cell such as gene regulation. It is thought that many such RNAs lie undiscovered in the genome. To date, alignment based approaches have shown to be more accurate than thermodynamic methods for identifying such non-coding RNAs. Probalign-RNAgenome employs a probabilistic consistency based approach for aligning a query RNA sequence to its homolog in a genomic sequence. Results show that this approach is more accurate on real data than the widely used BLAST and Smith- Waterman algorithms. Within the realm of comparative genomics are also a large number of recently conducted GWAS. GWAS aim to identify regions in the genome that are associated with a given disease. The support vector machine (SVM) provides a discriminative alternative to the widely used chi-square statistic in GWAS. A novel hybrid strategy that combines the chi-square statistic with the SVM was developed and implemented. Its performance was studied on simulated data and the Wellcome Trust Case Control Consortium (WTCCC) studies. Results presented in this thesis show that the hybrid strategy ranks causal SNPs in simulated data significantly higher than the chi-square test and SVM alone. The results also show that the hybrid strategy ranks previously replicated SNPs and associated regions (where applicable) of type 1 diabetes, rheumatoid arthritis, and Crohn\u27s disease higher than the chi-square, SVM, and SVM Recursive Feature Elimination (SVM-RFE)

Digital Commons @ New Jersey Institute of Technology (NJIT)

Network-based methods for biological data integration in precision medicine

Author: Núñez Carpintero Iker
Publication venue: 'Edicions de la Universitat de Barcelona'
Publication date: 14/11/2023
Field of study

[eng] The vast and continuously increasing volume of available biomedical data produced during the last decades opens new opportunities for large-scale modeling of disease biology, facilitating a more comprehensive and integrative understanding of its processes. Nevertheless, this type of modelling requires highly efficient computational systems capable of dealing with such levels of data volumes. Computational approximations commonly used in machine learning and data analysis, namely dimensionality reduction and network-based approaches, have been developed with the goal of effectively integrating biomedical data. Among these methods, network-based machine learning stands out due to its major advantage in terms of biomedical interpretability. These methodologies provide a highly intuitive framework for the integration and modelling of biological processes. This PhD thesis aims to explore the potential of integration of complementary available biomedical knowledge with patient-specific data to provide novel computational approaches to solve biomedical scenarios characterized by data scarcity. The primary focus is on studying how high-order graph analysis (i.e., community detection in multiplex and multilayer networks) may help elucidate the interplay of different types of data in contexts where statistical power is heavily impacted by small sample sizes, such as rare diseases and precision oncology. The central focus of this thesis is to illustrate how network biology, among the several data integration approaches with the potential to achieve this task, can play a pivotal role in addressing this challenge provided its advantages in molecular interpretability. Through its insights and methodologies, it introduces how network biology, and in particular, models based on multilayer networks, facilitates bringing the vision of precision medicine to these complex scenarios, providing a natural approach for the discovery of new biomedical relationships that overcomes the difficulties for the study of cohorts presenting limited sample sizes (data-scarce scenarios). Delving into the potential of current artificial intelligence (AI) and network biology applications to address data granularity issues in the precision medicine field, this PhD thesis presents pivotal research works, based on multilayer networks, for the analysis of two rare disease scenarios with specific data granularities, effectively overcoming the classical constraints hindering rare disease and precision oncology research. The first research article presents a personalized medicine study of the molecular determinants of severity in congenital myasthenic syndromes (CMS), a group of rare disorders of the neuromuscular junction (NMJ). The analysis of severity in rare diseases, despite its importance, is typically neglected due to data availability. In this study, modelling of biomedical knowledge via multilayer networks allowed understanding the functional implications of individual mutations in the cohort under study, as well as their relationships with the causal mutations of the disease and the different levels of severity observed. Moreover, the study presents experimental evidence of the role of a previously unsuspected gene in NMJ activity, validating the hypothetical role predicted using the newly introduced methodologies. The second research article focuses on the applicability of multilayer networks for gene priorization. Enhancing concepts for the analysis of different data granularities firstly introduced in the previous article, the presented research provides a methodology based on the persistency of network community structures in a range of modularity resolution, effectively providing a new framework for gene priorization for patient stratification. In summary, this PhD thesis presents major advances on the use of multilayer network-based approaches for the application of precision medicine to data-scarce scenarios, exploring the potential of integrating extensive available biomedical knowledge with patient-specific data

Diposit Digital de la Universitat de Barcelona

Application of a Naïve Bayes Classifier to Assign Polyadenylation Sites from 3\u27 End Deep Sequencing Data: A Dissertation

Author: Sheppard Sarah E.
Publication venue: eScholarship@UMassChan
Publication date: 29/04/2013
Field of study

Cleavage and polyadenylation of a precursor mRNA is important for transcription termination, mRNA stability, and regulation of gene expression. This process is directed by a multitude of protein factors and cis elements in the pre-mRNA sequence surrounding the cleavage and polyadenylation site. Importantly, the location of the cleavage and polyadenylation site helps define the 3’ untranslated region of a transcript, which is important for regulation by microRNAs and RNA binding proteins. Additionally, these sites have generally been poorly annotated. To identify 3’ ends, many techniques utilize an oligo-dT primer to construct deep sequencing libraries. However, this approach can lead to identification of artifactual polyadenylation sites due to internal priming in homopolymeric stretches of adenines. Previously, simple heuristic filters relying on the number of adenines in the genomic sequence downstream of a putative polyadenylation site have been used to remove these sites of internal priming. However, these simple filters may not remove all sites of internal priming and may also exclude true polyadenylation sites. Therefore, I developed a naïve Bayes classifier to identify putative sites from oligo-dT primed 3’ end deep sequencing as true or false/internally primed. Notably, this algorithm uses a combination of sequence elements to distinguish between true and false sites. Finally, the resulting algorithm is highly accurate in multiple model systems and facilitates identification of novel polyadenylation sites

eScholarship@UMMS

Genomic analyses identify hundreds of variants associated with age at menarche and support a role for puberty timing in cancer risk

Author: Albrecht Eva
Alizadeh Behrooz Z.
Altmaier Elisabeth
Amini Marzyeh
Andrulis Irene L.
Bandinelli Stefania
Barbieri Caterina M.
Beckmann Matthias W.
Benitez Javier
Bergmann Sven
Bochud Murielle
Boerwinkle Eric
Bojesen Stig E.
Bolla Manjeet K.
Boomsma Dorret I.
Boutin Thibaud
Brand Judith S.
Brauch Hiltrud
Brenner Hermann
Broer Linda
Brüning Thomas
Buring Julie E.
Campbell Archie
Campbell Harry
Catamo Eulalia
Chang-Claude Jenny
Chanock Stephen
Chasman Daniel I.
Chenevix-Trench Georgia
Ciullo Marina
Corre Tanguy
Couch Fergus J.
Cousminer Diana L.
Cox Angela
Crisponi Laura
Cucca Francesco
Czene Kamila
Davey Smith George
Day Felix R.
De Geus Eco J.C.N.
De Mutsert Renée
De Vivo Immaculata
Demerath Ellen
Dennis Joe
Devilee Peter
Dos-Santos-Silva Isabel
Dunning Alison M.
Easton Douglas F.
Edwards Digna R. Velez
Eriksson Johan G.
Esko Tõnu
Fasching Peter A.
Fernández-Rhodes Lindsay
Ferrucci Luigi
Finucane Hilary
Flesch-Janys Dieter
Franceschini Nora
Franke Lude
Gabrielson Marike
Gandin Ilaria
Gieger Christian
Giles Graham G.
Giri Ayush
Grallert Harald
Gudbjartsson Daniel F.
Gudnason Vilmundur
Guénel Pascal
Hall Per
Hallberg Emily
Hamann Ute
Harris Tamara B.
Hartman Catharina A.
Hayward Caroline
He Chunyan
Heiss Gerardo
Helgason Hannes
Hinds David
Hooning Maartje J.
Hopper John L.
Hottenga Jouke J.
Hu Frank
Hunter David J.
Ikram M. Arfan
Im Hae Kyung
Joshi Peter K.
Järvelin Marjo-Riitta
Karasik David
Karlsson Robert
Kellis Manolis
Kolcic Ivana
Kraft Peter
Kutalik Zoltan
Lachance Genevieve
Lambrechts Diether
Langenberg Claudia
Launer Lenore J.
Laven Joop S.E.
Lawlor Debbie A.
Lenarduzzi Stefania
Li Jingmei
Lind Penelope A.
Lindstrom Sara
Liu Yongmei
Loh Po-Ru
Luan Jian'An
Lunetta Kathryn L.
Magnusson Patrik K.E.
Mangino Massimo
Mannermaa Arto
Marco Brumat
Martin Nicholas G.
Mbarek Hamdi
McCarthy Mark I.
McMahon George
Medland Sarah E.
Meisinger Christa
Meitinger Thomas
Menni Cristina
Metspalu Andres
Michailidou Kyriaki
Milani Lili
Milne Roger L.
Montgomery Grant W.
Mook-Kanamori Dennis O.
Mulligan Anna M.
Murabito Joanne M.
Murray Anna
Mägi Reedik
Nalls Mike A.
Navarro Pau
Nevanlinna Heli
Nohr Ellen A.
Nolte Ilja M.
Noordam Raymond
Nutile Teresa
Nyholt Dale R.
O'Mara Tracy A.
Oldehinkel Albertine J.
Ong Ken K.
Padmanabhan Sandosh
Palotie Aarno
Paternoster Lavinia
Pedersen Nancy
Perjakova Natalia
Perry John R.B.
Peters Annette
Peto Julian
Pharoah Paul D.P.
Polasek Ozren
Pollard Katherine S.
Porcu Eleonora
Porteous David
Pouta Anneli
Price Alkes L.
Radice Paolo
Rahman Iffat
Ridker Paul M.
Ring Susan M.
Robino Antonietta
Rose Lynda M.
Rosendaal Frits R.
Rudan Igor
Rueedi Rico
Ruggiero Daniela
Ruth Katherine S.
Sala Cinzia F.
Sarkar Abhishek K.
Schmidt Marjanka K.
Schraut Katharina E.
Scott Robert A.
Segrè Ayellet V.
Shah Mitul
Smith Albert V.
Snieder Harold
Sorice Rossella
Southey Melissa C.
Sovio Ulla
Spector Tim D.
Spurdle Amanda B.
Stampfer Meir
Stefansson Kari
Steri Maristella
Stolk Lisette
Strauch Konstantin
Stöckl Doris
Sulem Patrick
Tanaka Toshiko
Teumer Alexander
Thompson Deborah J.
Thorsteindottir Unnur
Tikkanen Emmi
Timpson Nicholas J.
Toniolo Daniela
Traglia Michela
Truong Thérèse
Tung Joyce Y.
Tyrer Jonathan P.
Uitterlinden André G.
Ulivi Sheila
Van Dijk Ko Willems
Visser Jenny A.
Vitart Veronique
Vollenweider Peter
Völker Uwe
Völzke Henry
Wang Qin
Wareham Nicholas J.
Whalen Sean
Widen Elisabeth
Willemsen Gonneke
Wilson James F.
Winqvist Robert
Wolffenbuttel Bruce H.R.
Zhao Jing Hua
Zoledziewska Magdalena
Zygmunt Marek
Publication venue
Publication date: 01/01/2017
Field of study

The timing of puberty is a highly polygenic childhood trait that is epidemiologically associated with various adult diseases. Using 1000 Genomes Project-imputed genotype data in up to similar to 370,000 women, we identify 389 independent signals (P <5 x 10(-8)) for age at menarche, a milestone in female pubertal development. In Icelandic data, these signals explain similar to 7.4% of the population variance in age at menarche, corresponding to similar to 25% of the estimated heritability. We implicate similar to 250 genes via coding variation or associated expression, demonstrating significant enrichment in neural tissues. Rare variants near the imprinted genes MKRN3 and DLK1 were identified, exhibiting large effects when paternally inherited. Mendelian randomization analyses suggest causal inverse associations, independent of body mass index (BMI), between puberty timing and risks for breast and endometrial cancers in women and prostate cancer in men. In aggregate, our findings highlight the complexity of the genetic regulation of puberty timing and support causal links with cancer susceptibility

Carolina Digital Repository

Identification of Deleterious and Disease Alleles in a General Population and Preterm Labor Patients

Author: Chun Sung Gook
Publication venue: Washington University Open Scholarship
Publication date: 31/07/2012
Field of study

With the recent advance in sequencing technology, there have been growing interests in developing new methods to predict disease-causing alleles in a personal genome by integrating functional evidences from sequence conservation, genome-wide association studies and the transcriptional regulatory network. However, even in protein-coding regions, it is not well understood how often and by what mechanism deleterious alleles disrupting strong sequence conservation can become common in population frequency and affect complex traits in humans. Moreover, in non-coding regions, even for known disease-causing genes, it is not clear how sequence conservation can be combined with functional genomic data to predict underlying disease-causing variants. To address the first question, I developed a new likelihood ratio test for sequence conservation to predict deleterious missense alleles in the human genome. By applying the new test to three personal genomes, I find that the presence of only 10% of common deleterious SNPs can be explained by false positives due to multiple hypothesis testing, violation of evolutionary model assumptions, recent gene duplication and relaxation of selective constraints on biological processes. Next, by applying the likelihood ratio test to a general human population, I find that both computationally predicted deleterious SNPs and known disease-associated alleles are enriched within genomic regions that have been influenced by positive selection in the recent past. The observed pattern agrees with the prediction that deleterious alleles can dragged along to higher-than-expected allele frequencies due to the genetic linkage with beneficial alleles by the hitchhiking effect. Second, I developed an integrative strategy to predict disease-causing non-coding variants in FSH receptor, a gene known to be associated with preterm birth, as a proof of principle. I sequenced protein-coding and conserved non-coding regions in preterm and term mothers, and conducted fine-mapping and transcription factor binding site analysis to narrow down the causal non-coding variants. Here, I find that in non-coding regions the causal variants can be resolved better by accounting for the expected effects of binding site mutations on the transcription regulatory network in addition to sequence conservation. These results indicate that the comparative genomics will provide the new opportunity to explore deleterious and disease-causing genetic variation at an unprecedentedly high resolution across the genome and in a population especially if functional genomics can be integrated with comparative genomics

Washington University St. Louis: Open Scholarship

Genome-Wide Analysis of Natural Selection on Human Cis-Elements

Author: AG Clark
BP Lewis
CD Bustamante
DA Hinds
EC Bush
ET Dermitzakis
F Jacob
G Dennis Jr
G Wang
GA Wray
GV Kryukov
Hoa Giang
JA Drake
JC Fay
JC Fay
JD Storey
JH McDonald
Joshua B. Plotkin
K Chen
L Everett
Matthew W. Hahn
MC King
MW Hahn
P Sethupathy
P Sethupathy
PD Keightley
Praveen Sethupathy
R Haygood
R Nielsen
RJ Britten
RS Spielman
S Levy
SA Sawyer
Sridhar Hannenhalli
TD Schneider
V Matys
Publication venue: Public Library of Science
Publication date: 01/01/2008
Field of study

Background: It has been speculated that the polymorphisms in the non-coding portion of the human genome underlie much of the phenotypic variability among humans and between humans and other primates. If so, these genomic regions may be undergoing rapid evolutionary change, due in part to natural selection. However, the non-coding region is a heterogeneous mix of functional and non-functional regions. Furthermore, the functional regions are comprised of a variety of different types of elements, each under potentially different selection regimes. Findings and Conclusions: Using the HapMap and Perlegen polymorphism data that map to a stringent set of putative binding sites in human proximal promoters, we apply the Derived Allele Frequency distribution test of neutrality to provide evidence that many human-specific and primate-specific binding sites are likely evolving under positive selection. We also discuss inherent limitations of publicly available human SNP datasets that complicate the inference of selection pressures. Finally, we show that the genes whose proximal binding sites contain high frequency derived alleles are enriched for positive regulation of protein metabolism and developmental processes. Thus our genome-scale investigation provide

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Recommended from our members

Topics in Signal Processing: applications in genomics and genetics

Author: Elmas Abdulkadir
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2016
Field of study

The information in genomic or genetic data is influenced by various complex processes and appropriate mathematical modeling is required for studying the underlying processes and the data. This dissertation focuses on the formulation of mathematical models for certain problems in genomics and genetics studies and the development of algorithms for proposing efficient solutions. A Bayesian approach for the transcription factor (TF) motif discovery is examined and the extensions are proposed to deal with many interdependent parameters of the TF-DNA binding. The problem is described by statistical terms and a sequential Monte Carlo sampling method is employed for the estimation of unknown parameters. In particular, a class-based resampling approach is applied for the accurate estimation of a set of intrinsic properties of the DNA binding sites. Through statistical analysis of the gene expressions, a motif-based computational approach is developed for the inference of novel regulatory networks in a given bacterial genome. To deal with high false-discovery rates in the genome-wide TF binding predictions, the discriminative learning approaches are examined in the context of sequence classification, and a novel mathematical model is introduced to the family of kernel-based Support Vector Machines classifiers. Furthermore, the problem of haplotype phasing is examined based on the genetic data obtained from cost-effective genotyping technologies. Based on the identification and augmentation of a small and relatively more informative genotype set, a sparse dictionary selection algorithm is developed to infer the haplotype pairs for the sampled population. In a relevant context, to detect redundant information in the single nucleotide polymorphism (SNP) sites, the problem of representative (tag) SNP selection is introduced. An information theoretic heuristic is designed for the accurate selection of tag SNPs that capture the genetic diversity in a large sample set from multiple populations. The method is based on a multi-locus mutual information measure, reflecting a biological principle in the population genetics that is linkage disequilibrium

Columbia University Academic Commons

Co-regulatory expression quantitative trait loci mapping: method and application to endometrial cancer

Abstract Background Expression quantitative trait loci (eQTL) studies have helped identify the genetic determinants of gene expression. Understanding the potential interacting mechanisms underlying such findings, however, is challenging. Methods We describe a method to identify the <it>trans-</it>acting drivers of multiple gene co-expression, which reflects the action of regulatory molecules. This method-termed <it>co-regulatory expression quantitative trait locus </it>(creQTL) <it>mapping</it>-allows for evaluation of a more focused set of phenotypes within a clear biological context than conventional eQTL mapping. Results Applying this method to a study of endometrial cancer revealed regulatory mechanisms supported by the literature: a creQTL between a locus upstream of STARD13/DLC2 and a group of seven IFNβ-induced genes. This suggests that the Rho-GTPase encoded by STARD13 regulates IFNβ-induced genes and the DNA damage response. Conclusions Because of the importance of IFNβ in cancer, our results suggest that creQTL may provide a finer picture of gene regulation and may reveal additional molecular targets for intervention. An open source R implementation of the method is available at <url>http://sites.google.com/site/kenkompass/</url>.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central