Search CORE

30 research outputs found

Artificial intelligence used in genome analysis studies

Author: D'Agaro Edo
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2018
Field of study

Next Generation Sequencing (NGS) or deep sequencing technology enables parallel reading of multiple individual DNA fragments, thereby enabling the identification of millions of base pairs in several hours. Recent research has clearly shown that machine learning technologies can efficiently analyse large sets of genomic data and help to identify novel gene functions and regulation regions. A deep artificial neural network consists of a group of artificial neurons that mimic the properties of living neurons. These mathematical models, termed Artificial Neural Networks (ANN), can be used to solve artificial intelligence engineering problems in several different technological fields (e.g., biology, genomics, proteomics, and metabolomics). In practical terms, neural networks are non-linear statistical structures that are organized as modelling tools and are used to simulate complex genomic relationships between inputs and outputs. To date, Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNN) have been demonstrated to be the best tools for improving performance in problem solving tasks within the genomic field

Crossref

Archivio istituzionale della ricerca - Università degli Studi di Udine

Directory of Open Access Journals

A decision-theoretic approach for segmental classification

Author: Holmes Christopher C.
Yau Christopher
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2013
Field of study

This paper is concerned with statistical methods for the segmental classification of linear sequence data where the task is to segment and classify the data according to an underlying hidden discrete state sequence. Such analysis is commonplace in the empirical sciences including genomics, finance and speech processing. In particular, we are interested in answering the following question: given data

y

and a statistical model

\pi(x,y)

of the hidden states

x

, what should we report as the prediction

\hat{x}

under the posterior distribution

\pi (x|y)

? That is, how should you make a prediction of the underlying states? We demonstrate that traditional approaches such as reporting the most probable state sequence or most probable set of marginal predictions can give undesirable classification artefacts and offer limited control over the properties of the prediction. We propose a decision theoretic approach using a novel class of Markov loss functions and report

\hat{x}

via the principle of minimum expected loss (maximum expected utility). We demonstrate that the sequence of minimum expected loss under the Markov loss function can be enumerated exactly using dynamic programming methods and that it offers flexibility and performance improvements over existing techniques. The result is generic and applicable to any probabilistic model on a sequence, such as Hidden Markov models, change point or product partition models.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS657 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

Exploratory analysis of genomic segmentations with Segtools

Author: Buske Orion J
Hoffman Michael M
Le Roch Karine G
Noble William Stafford
Ponts Nadia
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background As genome-wide experiments and annotations become more prevalent, researchers increasingly require tools to help interpret data at this scale. Many functional genomics experiments involve partitioning the genome into labeled segments, such that segments sharing the same label exhibit one or more biochemical or functional traits. For example, a collection of ChlP-seq experiments yields a compendium of peaks, each labeled with one or more associated DNA-binding proteins. Similarly, manually or automatically generated annotations of functional genomic elements, including <it>cis</it>-regulatory modules and protein-coding or RNA genes, can also be summarized as genomic segmentations. Results We present a software toolkit called <it>Segtools </it>that simplifies and automates the exploration of genomic segmentations. The software operates as a series of interacting tools, each of which provides one mode of summarization. These various tools can be pipelined and summarized in a single HTML page. We describe the Segtools toolkit and demonstrate its use in interpreting a collection of human histone modification data sets and <it>Plasmodium falciparum </it>local chromatin structure data sets. Conclusions Segtools provides a convenient, powerful means of interpreting a genomic segmentation.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

HAL Descartes

eScholarship - University of California

ProdInra

Simultaneous characterization of sense and antisense genomic processes by the double-stranded hidden Markov model

Author: Dümcke Sebastian
Gagneur Julien
Glas Julia
Poron Don
Tresch Achim
Zacher Benedikt
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2016
Field of study

Hidden Markov models (HMMs) have been extensively used to dissect the genome into functionally distinct regions using data such as RNA expression or DNA binding measurements. It is a challenge to disentangle processes occurring on complementary strands of the same genomic region. We present the double-stranded HMM (dsHMM), a model for the strand-specific analysis of genomic processes. We applied dsHMM to yeast using strand specific transcription data, nucleosome data, and protein binding data for a set of 11 factors associated with the regulation of transcription. The resulting annotation recovers the mRNA transcription cycle (initiation, elongation, termination) while correctly predicting strand-specificity and directionality of the transcription process. We find that pre-initiation complex formation is an essentially undirected process, giving rise to a large number of bidirectional promoters and to pervasive antisense transcription. Notably, 12% of all transcriptionally active positions showed simultaneous activity on both strands. Furthermore, dsHMM reveals that antisense transcription is specifically suppressed by Nrd1, a yeast termination factor

Genome-Wide Copy Number Variation in Epilepsy: Novel Susceptibility Loci in Idiopathic Generalized and Focal Epilepsies

Author: A Escayg
A Itsara
AC Need
AJ Sharp
AJ Sharp
Alain Malafosse
Alexander G. Bassuk
Andre Franke
AT Pagnamenta
B Bakkaloglu
B Xu
BB de Vries
BW van Bon
C Shaw-Smith
Carl Baker
CG de Kovel
Christina A. Gurnett
CR Marshall
DA Koolen
DE Arking
DT Miller
EG Bochukova
EK Bijlsma
Evan E. Eichler
FD Hannes
G Kirov
G Sagoo
H Doose
H Doose
H Stefansson
HC Mefford
HC Mefford
Heather C. Mefford
Hiltrud Muhle
I Helbig
I Helbig
Ingo Helbig
J Christiansen
J Sebat
JA Bailey
JI Friedman
JM Friedman
Karen Buysse
L Claes
LA Weiss
LG Shaffer
LM Dibbens
M Alarcon
M Shinawi
MG Butler
Michel Guipponi
MT Bonati
N Brunetti-Pierri
N Day
NL Taske
P Szatmari
Philipp Ostertag
Pierre Genton
Pierre Thomas
R Redon
R Sultana
R Ullmann
RA Kumar
S Ben-Shachar
S Girirajan
Sarah von Spiczak
SC Greenway
SE McCarthy
SL Christian
SL Hartley
Stefan Schreiber
T Fujiwara
T Sahoo
T Walsh
Ulrich Stephani
VM Kalscheuer
WA Hauser
Wayne N. Frankel
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

Epilepsy is one of the most common neurological disorders in humans with a prevalence of 1% and a lifetime incidence of 3%. Several genes have been identified in rare autosomal dominant and severe sporadic forms of epilepsy, but the genetic cause is unknown in the vast majority of cases. Copy number variants (CNVs) are known to play an important role in the genetic etiology of many neurodevelopmental disorders, including intellectual disability (ID), autism, and schizophrenia. Genome-wide studies of copy number variation in epilepsy have not been performed. We have applied whole-genome oligonucleotide array comparative genomic hybridization to a cohort of 517 individuals with various idiopathic, non-lesional epilepsies. We detected one or more rare genic CNVs in 8.9% of affected individuals that are not present in 2,493 controls; five individuals had two rare CNVs. We identified CNVs in genes previously implicated in other neurodevelopmental disorders, including two deletions in AUTS2 and one deletion in CNTNAP2. Therefore, our findings indicate that rare CNVs are likely to contribute to a broad range of generalized and focal epilepsies. In addition, we find that 2.9% of patients carry deletions at 15q11.2, 15q13.3, or 16p13.11, genomic hotspots previously associated with ID, autism, or schizophrenia. In summary, our findings suggest common etiological factors for seemingly diverse diseases such as ID, autism, schizophrenia, and epilepsy

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Radboud Repository

Archive ouverte UNIGE

The Effect of Algorithms on Copy Number Variant Detection

Author: A Itsara
Anita Brandstaetter
B Xu
Benjamin Ely
Chang-En Yu
CM Carvalho
D Tsuang
D Zhang
DA Peiffer
Debby W. Tsuang
EE Eichler
GH Perry
GM Cooper
H Stefansson
Illumina
JI Nurnberger Jr
JM Kidd
JM Korn
JR Lupski
K Wang
Kenneth Wang
L Winchester
LV Wain
ME Calkins
ME Maxwell
MJ Khoury
N Day
NP Carter
P Szatmari
Peter Chi
R Pique-Regi
R Redon
S Colella
SA McCarroll
Steven P. Millard
Sulgi Kim
T Walsh
TA Manolio
Wendy H. Raskind
Zoran Brkanac
Publication venue: Public Library of Science
Publication date: 30/12/2010
Field of study

BACKGROUND: The detection of copy number variants (CNVs) and the results of CNV-disease association studies rely on how CNVs are defined, and because array-based technologies can only infer CNVs, CNV-calling algorithms can produce vastly different findings. Several authors have noted the large-scale variability between CNV-detection methods, as well as the substantial false positive and false negative rates associated with those methods. In this study, we use variations of four common algorithms for CNV detection (PennCNV, QuantiSNP, HMMSeg, and cnvPartition) and two definitions of overlap (any overlap and an overlap of at least 40% of the smaller CNV) to illustrate the effects of varying algorithms and definitions of overlap on CNV discovery. METHODOLOGY AND PRINCIPAL FINDINGS: We used a 56 K Illumina genotyping array enriched for CNV regions to generate hybridization intensities and allele frequencies for 48 Caucasian schizophrenia cases and 48 age-, ethnicity-, and gender-matched control subjects. No algorithm found a difference in CNV burden between the two groups. However, the total number of CNVs called ranged from 102 to 3,765 across algorithms. The mean CNV size ranged from 46 kb to 787 kb, and the average number of CNVs per subject ranged from 1 to 39. The number of novel CNVs not previously reported in normal subjects ranged from 0 to 212. CONCLUSIONS AND SIGNIFICANCE: Motivated by the availability of multiple publicly available genome-wide SNP arrays, investigators are conducting numerous analyses to identify putative additional CNVs in complex genetic disorders. However, the number of CNVs identified in array-based studies, and whether these CNVs are novel or valid, will depend on the algorithm(s) used. Thus, given the variety of methods used, there will be many false positives and false negatives. Both guidelines for the identification of CNVs inferred from high-density arrays and the establishment of a gold standard for validation of CNVs are needed

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

A varying threshold method for ChIP peak-calling using multiple sources of information

Author: Boyle
Heintzman
K.-B. Chen
M ller
Wadman
Y. Zhang
Publication venue: Oxford University Press
Publication date
Field of study

Motivation: Gene regulation commonly involves interaction among DNA, proteins and biochemical conditions. Using chromatin immunoprecipitation (ChIP) technologies, protein–DNA interactions are routinely detected in the genome scale. Computational methods that detect weak protein-binding signals and simultaneously maintain a high specificity yet remain to be challenging. An attractive approach is to incorporate biologically relevant data, such as protein co-occupancy, to improve the power of protein-binding detection. We call the additional data related with the target protein binding as supporting tracks

Crossref

PubMed Central

Stochastic Variational Inference for Hidden Markov Models

Author: Foti Nicholas J.
Fox Emily B.
Laird Dillon
Xu Jason
Publication venue
Publication date: 06/11/2014
Field of study

Variational inference algorithms have proven successful for Bayesian analysis in large data settings, with recent advances using stochastic variational inference (SVI). However, such methods have largely been studied in independent or exchangeable data settings. We develop an SVI algorithm to learn the parameters of hidden Markov models (HMMs) in a time-dependent data setting. The challenge in applying stochastic optimization in this setting arises from dependencies in the chain, which must be broken to consider minibatches of observations. We propose an algorithm that harnesses the memory decay of the chain to adaptively bound errors arising from edge effects. We demonstrate the effectiveness of our algorithm on synthetic experiments and a large genomics dataset where a batch algorithm is computationally infeasible.Comment: Appears in Advances in Neural Information Processing Systems (NIPS), 201

arXiv.org e-Print Archive

CiteSeerX

Discovery and characterization of chromatin states for systematic annotation of the human genome

Author: A Barski
A Siepel
AI Su
AP Boyle
B Schuettengruber
BD Strahl
BE Bernstein
C Zang
D Karolchik
DA Benson
DE Schones
DF Gudbjartsson
DS Johnson
G Hon
G Hon
H O'Geen
J Ernst
Jason Ernst
K Cui
KD Pruitt
KJ Won
L Guelen
L Jia
M Guttman
Manolis Kellis
N Day
ND Heintzman
ND Heintzman
P Carninci
P Kheradpour
P Kolasinska-Zwierz
R Andersson
RE Thurman
RM Neal
S Schwartz
SE Celniker
SL Schreiber
SP Sripathy
T Kouzarides
TS Furey
W Miller
WJ Kent
X Wang
Y Zhang
Z Wang
Z Wang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

A plethora of epigenetic modifications have been described in the human genome and shown to play diverse roles in gene regulation, cellular differentiation and the onset of disease. Although individual modifications have been linked to the activity levels of various genetic functional elements, their combinatorial patterns are still unresolved and their potential for systematic de novo genome annotation remains untapped. Here, we use a multivariate Hidden Markov Model to reveal 'chromatin states' in human T cells, based on recurrent and spatially coherent combinations of chromatin marks. We define 51 distinct chromatin states, including promoter-associated, transcription-associated, active intergenic, large-scale repressed and repeat-associated states. Each chromatin state shows specific enrichments in functional annotations, sequence motifs and specific experimentally observed characteristics, suggesting distinct biological roles. This approach provides a complementary functional annotation of the human genome that reveals the genome-wide locations of diverse classes of epigenetic function.National Science Foundation (U.S.). (Award 0905968)National Human Genome Research Institute (U.S.) (Award U54-HG004570)National Human Genome Research Institute (U.S.) (Award RC1-HG005334

eScholarship - University of California