6,282 research outputs found
Kernel methods in genomics and computational biology
Support vector machines and kernel methods are increasingly popular in
genomics and computational biology, due to their good performance in real-world
applications and strong modularity that makes them suitable to a wide range of
problems, from the classification of tumors to the automatic annotation of
proteins. Their ability to work in high dimension, to process non-vectorial
data, and the natural framework they provide to integrate heterogeneous data
are particularly relevant to various problems arising in computational biology.
In this chapter we survey some of the most prominent applications published so
far, highlighting the particular developments in kernel methods triggered by
problems in biology, and mention a few promising research directions likely to
expand in the future
A decision-theoretic approach for segmental classification
This paper is concerned with statistical methods for the segmental
classification of linear sequence data where the task is to segment and
classify the data according to an underlying hidden discrete state sequence.
Such analysis is commonplace in the empirical sciences including genomics,
finance and speech processing. In particular, we are interested in answering
the following question: given data and a statistical model of
the hidden states , what should we report as the prediction under
the posterior distribution ? That is, how should you make a
prediction of the underlying states? We demonstrate that traditional approaches
such as reporting the most probable state sequence or most probable set of
marginal predictions can give undesirable classification artefacts and offer
limited control over the properties of the prediction. We propose a decision
theoretic approach using a novel class of Markov loss functions and report
via the principle of minimum expected loss (maximum expected
utility). We demonstrate that the sequence of minimum expected loss under the
Markov loss function can be enumerated exactly using dynamic programming
methods and that it offers flexibility and performance improvements over
existing techniques. The result is generic and applicable to any probabilistic
model on a sequence, such as Hidden Markov models, change point or product
partition models.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS657 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Modeling dependent gene expression
In this paper we propose a Bayesian approach for inference about dependence
of high throughput gene expression. Our goals are to use prior knowledge about
pathways to anchor inference about dependence among genes; to account for this
dependence while making inferences about differences in mean expression across
phenotypes; and to explore differences in the dependence itself across
phenotypes. Useful features of the proposed approach are a model-based
parsimonious representation of expression as an ordinal outcome, a novel and
flexible representation of prior information on the nature of dependencies, and
the use of a coherent probability model over both the structure and strength of
the dependencies of interest. We evaluate our approach through simulations and
in the analysis of data on expression of genes in the Complement and
Coagulation Cascade pathway in ovarian cancer.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS525 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Exploring Patterns of Epigenetic Information With Data Mining Techniques
[Abstract] Data mining, a part of the Knowledge Discovery in Databases process (KDD), is the process of extracting patterns from large data sets by combining methods from statistics and artificial intelligence with database management. Analyses of epigenetic data have evolved towards genome-wide and high-throughput approaches, thus generating great amounts of data for which data mining is essential. Part of these data may contain patterns of epigenetic information which are mitotically and/or meiotically heritable determining gene expression and cellular differentiation, as well as cellular fate. Epigenetic lesions and genetic mutations are acquired by individuals during their life and accumulate with ageing. Both defects, either together or individually, can result in losing control over cell growth and, thus, causing cancer development. Data mining techniques could be then used to extract the previous patterns. This work reviews some of the most important applications of data mining to epigenetics.Programa Iberoamericano de Ciencia y TecnologÃa para el Desarrollo; 209RT-0366Galicia. ConsellerÃa de EconomÃa e Industria; 10SIN105004PRInstituto de Salud Carlos III; RD07/0067/000
Applications of Hidden Markov Models in Microarray Gene Expression Data
Hidden Markov models (HMMs) are well developed statistical models to capture hidden information from observable sequential symbols. They were first used in speech recognition in 1970s and have been successfully applied to the analysis of biological sequences since late 1980s as in finding protein secondary structure, CpG islands and families of related DNA or protein sequences [1]. In a HMM, the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters. In this chapter, we described two applications using HMMs to predict gene functions in yeast and DNA copy number alternations in human tumor cells, based on gene expression microarray data
- …