Search CORE

227 research outputs found

Global Discriminative Learning for Higher-Accuracy Computational Gene Prediction

Author: Artemis Hatzigeorgiou
Axel Bernal
David Haussler
ENCODE Project Consortium
Fernando Pereira
Koby Crammer
Publication venue: Public Library of Science
Publication date: 01/03/2007
Field of study

Most ab initio gene predictors use a probabilistic sequence model, typically a hidden Markov model, to combine separately trained models of genomic signals and content. By combining separate models of relevant genomic features, such gene predictors can exploit small training sets and incomplete annotations, and can be trained fairly efficiently. However, that type of piecewise training does not optimize prediction accuracy and has difficulty in accounting for statistical dependencies among different parts of the gene model. With genomic information being created at an ever-increasing rate, it is worth investigating alternative approaches in which many different types of genomic evidence, with complex statistical dependencies, can be integrated by discriminative learning to maximize annotation accuracy. Among discriminative learning methods, large-margin classifiers have become prominent because of the success of support vector machines (SVM) in many classification tasks. We describe CRAIG, a new program for ab initio gene prediction based on a conditional random field model with semi-Markov structure that is trained with an online large-margin algorithm related to multiclass SVMs. Our experiments on benchmark vertebrate datasets and on regions from the ENCODE project show significant improvements in prediction accuracy over published gene predictors that use intrinsic features only, particularly at the gene level and on genes with long introns

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

ScholarlyCommons@Penn

Assessing the effects of data selection and representation on the development of reliable E. coli sigma 70 promoter region predictors

Author: Abbas Mostafa M.
El-Manzalawy Yasser
Mohie-Eldin Mostafa M.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2015
Field of study

As the number of sequenced bacterial genomes increases, the need for rapid and reliable tools for the annotation of functional elements (e.g., transcriptional regulatory elements) becomes more desirable. Promoters are the key regulatory elements, which recruit the transcriptional machinery through binding to a variety of regulatory proteins (known as sigma factors). The identification of the promoter regions is very challenging because these regions do not adhere to specific sequence patterns or motifs and are difficult to determine experimentally. Machine learning represents a promising and cost-effective approach for computational identification of prokaryotic promoter regions. However, the quality of the predictors depends on several factors including: i) training data; ii) data representation; iii) classification algorithms; iv) evaluation procedures. In this work, we create several variants of E. coli promoter data sets and utilize them to experimentally examine the effect of these factors on the predictive performance of E. coli σ70 promoter models. Our results suggest that under some combinations of the first three criteria, a prediction model might perform very well on cross-validation experiments while its performance on independent test data is drastically very poor. This emphasizes the importance of evaluating promoter region predictors using independent test data, which corrects for the over-optimistic performance that might be estimated using the cross-validation procedure. Our analysis of the tested models shows that good prediction models often perform well despite how the non-promoter data was obtained. On the other hand, poor prediction models seems to be more sensitive to the choice of non-promoter sequences. Interestingly, the best performing sequence-based classifiers outperform the best performing structure-based classifiers on both cross-validation and independent test performance evaluation experiments. Finally, we propose a meta-predictor method combining two top performing sequence-based and structure-based classifiers and compare its performance with some of the state-of-the-art E. coli σ70 promoter prediction methods.NPRP grant No. 4-1454-1-233 from the Qatar National Research Fund (a member of Qatar Foundation).Scopu

Qatar University Institutional Repository

Directory of Open Access Journals

PubMed Central

FigShare

A knowledge engineering approach to the recognition of genomic coding regions

Author: กิตติศักดิ์ เกิดประสพ
นิตยา เกิดประสพ
Publication venue: สาขาวิชาวิศวกรรมคอมพิวเตอร์ สำนักวิชาวิศวกรรมศาสตร์ มหาวิทยาลัยเทคโนโลยีสุรนารี
Publication date: 01/01/2560
Field of study

ได้ทุนอุดหนุนการวิจัยจากมหาวิทยาลัยเทคโนโลยีสุรนารี ปีงบประมาณ พ.ศ.2556-255

Suranaree University of Technology Intellectual Repository

In silico identification of NF-kappaB-regulated genes in pancreatic beta-cells

Author: Eizirik Decio L
Naamane Najib
van Helden Jacques
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

BACKGROUND: Pancreatic beta-cells are the target of an autoimmune attack in type 1 diabetes mellitus (T1DM). This is mediated in part by cytokines, such as interleukin (IL)-1β and interferon (IFN)-γ. These cytokines modify the expression of hundreds of genes, leading to beta-cell dysfunction and death by apoptosis. Several of these cytokine-induced genes are potentially regulated by the IL-1β-activated transcription factor (TF) nuclear factor (NF)-κB, and previous studies by our group have shown that cytokine-induced NF-κB activation is pro-apoptotic in beta-cells. To identify NF-κB-regulated gene networks in beta-cells we presently used a discriminant analysis-based approach to predict NF-κB responding genes on the basis of putative regulatory elements. RESULTS: The performance of linear and quadratic discriminant analysis (LDA, QDA) in identifying NF-κB-responding genes was examined on a dataset of 240 positive and negative examples of NF-κB regulation, using stratified cross-validation with an internal leave-one-out cross-validation (LOOCV) loop for automated feature selection and noise reduction. LDA performed slightly better than QDA, achieving 61% sensitivity, 91% specificity and 87% positive predictive value, and allowing the identification of 231, 251 and 580 NF-κB putative target genes in insulin-producing INS-1E cells, primary rat beta-cells and human pancreatic islets, respectively. Predicted NF-κB targets had a significant enrichment in genes regulated by cytokines (IL-1β or IL-1β + IFN-γ) and double stranded RNA (dsRNA), as compared to genes not regulated by these NF-κB-dependent stimuli. We increased the confidence of the predictions by selecting only evolutionary stable genes, i.e. genes with homologs predicted as NF-κB targets in rat, mouse, human and chimpanzee. CONCLUSION: The present in silico analysis allowed us to identify novel regulatory targets of NF-κB using a supervised classification method based on putative binding motifs. This provides new insights into the gene networks regulating cytokine-induced beta-cell dysfunction and death

Springer - Publisher Connector

HAL AMU

Directory of Open Access Journals

PubMed Central

DI-fusion

Use of neural networks to model molecular structure and function

Author: Hirst Jonathan Darrell
Publication venue: UCL (University College London)
Publication date: 01/01/1993
Field of study

This thesis is a study of some applications of neural networks - a recent computer algorithm - to modelling the structure and function of biologically important molecules. In Chapter 1, an introduction to neural networks is given. An overview of quantitative structure activity relationships (QSARs) is presented. The applications of neural networks to QSAR and to the prediction of structural and functional features of protein and nucleic acid sequences are reviewed. The neural network algorithms used are discussed in Chapter 2. In Chapter 3, a two-layer feed-forward neural network has been trained to recognise an ATP/GTP-binding local sequence motif. A comparably sophisticated statistical method was developed, which performed marginally better than the neural network. In a second study, described in Chapters 4 and 5, one of the largest data sets available for developing a quantitative structure activity relationship - the inhibition of dihydrofolate reductase by 2,4-diamino-6,6-dimethyl-5-phenyldihydrotriazine derivatives has been used to benchmark several computational methods. A hidden-layer neural network, a decision tree and inductive logic programming have been compared with the more established methods of linear regression and nearest neighbour. The data were represented in two ways: by the traditional Hansch parameters and by a new set of descriptors designed to allow the formulation of rules relating the activity of the inhibitors to their chemical structure. The performance of neural networks has been assessed rigourously in two distinct areas of biomolecular modelling; sequence analysis and drug design. The conclusions of these studies are presented in Chapter 6

UCL Discovery

A Survey on Concept Drift Adaptation

Author: Bifet A.
Bouchachia Abdelhamid
Gama J.
Pechenizkiy M.
Zliobaite Indre
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2014
Field of study

Concept drift primarily refers to an online supervised learning scenario when the relation between the in- put data and the target variable changes over time. Assuming a general knowledge of supervised learning in this paper we characterize adaptive learning process, categorize existing strategies for handling concept drift, discuss the most representative, distinct and popular techniques and algorithms, discuss evaluation methodology of adaptive algorithms, and present a set of illustrative applications. This introduction to the concept drift adaptation presents the state of the art techniques and a collection of benchmarks for re- searchers, industry analysts and practitioners. The survey aims at covering the different facets of concept drift in an integrated way to reflect on the existing scattered state-of-the-art

Repository TU/e

Pure OAI Repository

Bournemouth University Research Online

Improved Algorithms for Discovery of New Genes in Bacterial Genomes

Author: Wang Nan
Publication venue: Scholars Junction
Publication date: 08/08/2009
Field of study

In this dissertation, we describe a new approach for gene finding that can utilize proteomics information in addition to DNA and RNA to identify new genes in prokaryote genomes. Proteomics processing pipelines require identification of small pieces of proteins called peptides. Peptide identification is a very error-prone process and we have developed a new algorithm for validating peptide identifications using a distance-based outlier detection method. We demonstrate that our method identifies more peptides than other popular methods using standard mixtures of known proteins. In addition, our algorithm provides a much more accurate estimate of the false discovery rate than other methods. Once peptides have been identified and validated, we use a second algorithm, proteogenomic mapping (PGM) to map these peptides to the genome to find the genetic signals that allow us to identify potential novel protein coding genes called expressed Protein Sequence Tags (ePSTs). We then collect and combine evidence for ePSTs we generated, and evaluate the likelihood that each ePST represents a true new protein coding gene using supervised machine learning techniques. We use machine learning approaches to evaluate the likelihood that the ePSTs represent new genes. Finally, we have developed new approaches to Bayesian learning that allow us to model the knowledge domain from sparse biological datasets. We have developed two new bootstrap approaches that utilize resampling to build networks with the most robust features that reoccur in many networks. These bootstrap methods yield improved prediction accuracy. We have also developed an unsupervised Bayesian network structure learning method that can be used when training data is not available or when labels may not be reliable

Scholars Junction - Mississippi State University Institutional Repository

Improved Algorithms for Discovery of New Genes in Bacterial Genomes

Author: Wang Nan
Publication venue: Scholars Junction
Publication date: 03/08/2009
Field of study

Mississippi State University Libraries ETD database

Scholars Junction - Mississippi State University Institutional Repository

Recommended from our members

Learning Quantitative Sequence-Function Relationships using Massively Parallel Reporter Assays

Author: Insigne Kimberly Danielle
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

The field of genomics has grown rapidly over the past decade due to the advent of high throughput sequencing technologies. Genomics relies on this wealth of information to draw biological inferences, but using inference to establish causality can be challenging as manygenetic factors correlate with one another. Due to the declining cost of both reading and writing DNA, new techniques known as massively parallel reporter assays (MPRAs) provide the ability to test the function of a large library of tens to hundreds of thousands of designed DNA sequences simultaneously in a single experiment. Testing designed libraries allows us to explore beyond natural sequence variation to directly test thousands of sequence-function hypotheses simultaneously. In this dissertation I discuss two projects that explore sequence-function relationships in different biological systems. The first project is focused on how human genetic variation affects exon recognition, as mis-splicing is a major mechanism through which variants exert their influence. We developed a Multiplexed Functional Assay of Splicing using Sort-seq (MFASS) and assayed 27,333 variants in the Exome Aggregation Consortium within or adjacent to 2,198 human exons. We found that 3.8% (1,050) led to large splicing disruptions, many of which are extremely rare, located outside of canonical splice sites, distributed evenly across intronic and exonic regions, and are difficult to predict. MFASS enables direct functional measurement of large-effect splicing defects at scale.The second project is focused on promoters and transcriptional regulation in Escherichia coli. Promoter sequence space in bacteria is vast and difficult to study genome-wide due to external factors that influence transcription. We developed a genomically-encoded MPRA to characterize the global promoter landscape and dissect active promoters for regulatory motifs. We measure promoter activity of over 300,000 sequences spanning the entire genome and identify 3,321 active promoter regions in glucose minimal media and 3,477 in rich LB media. Furthermore, we perform a scanning mutagenesis of 2,057 E. coli promoters to identify regulatory sequences. Lastly, we implement a variety of machine learning models to classify promoters and quantitatively predict their activity. We present a series of approaches to rapidly characterize promoter sequences within the E. coli genome

eScholarship - University of California

OEBench: Investigating Open Environment Challenges in Real-World Relational Data Streams

Author: Diao Yiqun
He Bingsheng
Li Qinbin
Lu Mian
Yang Yutong
Publication venue
Publication date: 03/09/2023
Field of study

How to get insights from relational data streams in a timely manner is a hot research topic. This type of data stream can present unique challenges, such as distribution drifts, outliers, emerging classes, and changing features, which have recently been described as open environment challenges for machine learning. While existing studies have been done on incremental learning for data streams, their evaluations are mostly conducted with manually partitioned datasets. Thus, a natural question is how those open environment challenges look like in real-world relational data streams and how existing incremental learning algorithms perform on real datasets. To fill this gap, we develop an Open Environment Benchmark named OEBench to evaluate open environment challenges in relational data streams. Specifically, we investigate 55 real-world relational data streams and establish that open environment scenarios are indeed widespread in real-world datasets, which presents significant challenges for stream learning algorithms. Through benchmarks with existing incremental learning algorithms, we find that increased data quantity may not consistently enhance the model accuracy when applied in open environment scenarios, where machine learning models can be significantly compromised by missing values, distribution shifts, or anomalies in real-world data streams. The current techniques are insufficient in effectively mitigating these challenges posed by open environments. More researches are needed to address real-world open environment challenges. All datasets and code are open-sourced in https://github.com/sjtudyq/OEBench

arXiv.org e-Print Archive