1,249 research outputs found
Analysis tools for the interplay between genome layout and regulation
Genome layout and gene regulation appear to be interdependent. Understanding this interdependence is key to exploring the dynamic nature of chromosome conformation and to engineering functional genomes. Evidence for non-random genome layout, defined as the relative positioning of either co-functional or co-regulated genes, stems from two main approaches. Firstly, the analysis of contiguous genome segments across species, has highlighted the conservation of gene arrangement (synteny) along chromosomal regions. Secondly, the study of long-range interactions along a chromosome has emphasised regularities in the positioning of microbial genes that are co-regulated, co-expressed or evolutionarily correlated. While one-dimensional pattern analysis is a mature field, it is often powerless on biological datasets which tend to be incomplete, and partly incorrect. Moreover, there is a lack of comprehensive, user-friendly tools to systematically analyse, visualise, integrate and exploit regularities along genomes.Here we present the Genome REgulatory and Architecture Tools SCAN (GREAT:SCAN) software for the systematic study of the interplay between genome layout and gene expression regulation.SCAN is a collection of related and interconnected applications currently able to perform systematic analyses of genome regularities as well as to improve transcription factor binding sites (TFBS) and gene regulatory network predictions based on gene positional information.We demonstrate the capabilities of these tools by studying on one hand the regular patterns of genome layout in the major regulons of the bacterium Escherichia coli. On the other hand, we demonstrate the capabilities to improve TFBS prediction in microbes. Finally, we highlight, by visualisation of multivariate techniques, the interplay between position and sequence information for effective transcription regulation
Rank discriminants for predicting phenotypes from RNA expression
Statistical methods for analyzing large-scale biomolecular data are
commonplace in computational biology. A notable example is phenotype prediction
from gene expression data, for instance, detecting human cancers,
differentiating subtypes and predicting clinical outcomes. Still, clinical
applications remain scarce. One reason is that the complexity of the decision
rules that emerge from standard statistical learning impedes biological
understanding, in particular, any mechanistic interpretation. Here we explore
decision rules for binary classification utilizing only the ordering of
expression among several genes; the basic building blocks are then two-gene
expression comparisons. The simplest example, just one comparison, is the TSP
classifier, which has appeared in a variety of cancer-related discovery
studies. Decision rules based on multiple comparisons can better accommodate
class heterogeneity, and thereby increase accuracy, and might provide a link
with biological mechanism. We consider a general framework ("rank-in-context")
for designing discriminant functions, including a data-driven selection of the
number and identity of the genes in the support ("context"). We then specialize
to two examples: voting among several pairs and comparing the median expression
in two groups of genes. Comprehensive experiments assess accuracy relative to
other, more complex, methods, and reinforce earlier observations that simple
classifiers are competitive.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS738 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Predicting protein disorder by analyzing amino acid sequence
<p>Abstract</p> <p>Background</p> <p>Many protein regions and some entire proteins have no definite tertiary structure, presenting instead as dynamic, disorder ensembles under different physiochemical circumstances. These proteins and regions are known as Intrinsically Unstructured Proteins (IUP). IUP have been associated with a wide range of protein functions, along with roles in diseases characterized by protein misfolding and aggregation.</p> <p>Results</p> <p>Identifying IUP is important task in structural and functional genomics. We exact useful features from sequences and develop machine learning algorithms for the above task. We compare our IUP predictor with PONDRs (mainly neural-network-based predictors), disEMBL (also based on neural networks) and Globplot (based on disorder propensity).</p> <p>Conclusion</p> <p>We find that augmenting features derived from physiochemical properties of amino acids (such as hydrophobicity, complexity etc.) and using ensemble method proved beneficial. The IUP predictor is a viable alternative software tool for identifying IUP protein regions and proteins.</p
Learning pair-wise gene functional similarity by multiplex gene expression maps
Abstract Background The relationships between the gene functional similarity and gene expression profile, and between gene function annotation and gene sequence have been studied extensively. However, not much work has considered the connection between gene functions and location of a gene's expression in the mammalian tissues. On the other hand, although unsupervised learning methods have been commonly used in functional genomics, supervised learning cannot be directly applied to a set of normal genes without having a target (class) attribute. Results Here, we propose a supervised learning methodology to predict pair-wise gene functional similarity from multiplex gene expression maps that provide information about the location of gene expression. The features are extracted from expression maps and the labels denote the functional similarities of pairs of genes. We make use of wavelet features, original expression values, difference and average values of neighboring voxels and other features to perform boosting analysis. The experimental results show that with increasing similarities of gene expression maps, the functional similarities are increased too. The model predicts the functional similarities between genes to a certain degree. The weights of the features in the model indicate the features that are more significant for this prediction. Conclusions By considering pairs of genes, we propose a supervised learning methodology to predict pair-wise gene functional similarity from multiplex gene expression maps. We also explore the relationship between similarities of gene maps and gene functions. By using AdaBoost coupled with our proposed weak classifier we analyze a large-scale gene expression dataset and predict gene functional similarities. We also detect the most significant single voxels and pairs of neighboring voxels and visualize them in the expression map image of a mouse brain. This work is very important for predicting functions of unknown genes. It also has broader applicability since the methodology can be applied to analyze any large-scale dataset without a target attribute and is not restricted to gene expressions
One-Class Classification: Taxonomy of Study and Review of Techniques
One-class classification (OCC) algorithms aim to build classification models
when the negative class is either absent, poorly sampled or not well defined.
This unique situation constrains the learning of efficient classifiers by
defining class boundary just with the knowledge of positive class. The OCC
problem has been considered and applied under many research themes, such as
outlier/novelty detection and concept learning. In this paper we present a
unified view of the general problem of OCC by presenting a taxonomy of study
for OCC problems, which is based on the availability of training data,
algorithms used and the application domains applied. We further delve into each
of the categories of the proposed taxonomy and present a comprehensive
literature review of the OCC algorithms, techniques and methodologies with a
focus on their significance, limitations and applications. We conclude our
paper by discussing some open research problems in the field of OCC and present
our vision for future research.Comment: 24 pages + 11 pages of references, 8 figure
Optimal classification in sparse Gaussian graphic model
Consider a two-class classification problem where the number of features is
much larger than the sample size. The features are masked by Gaussian noise
with mean zero and covariance matrix , where the precision matrix
is unknown but is presumably sparse. The useful features,
also unknown, are sparse and each contributes weakly (i.e., rare and weak) to
the classification decision. By obtaining a reasonably good estimate of
, we formulate the setting as a linear regression model. We propose a
two-stage classification method where we first select features by the method of
Innovated Thresholding (IT), and then use the retained features and Fisher's
LDA for classification. In this approach, a crucial problem is how to set the
threshold of IT. We approach this problem by adapting the recent innovation of
Higher Criticism Thresholding (HCT). We find that when useful features are rare
and weak, the limiting behavior of HCT is essentially just as good as the
limiting behavior of ideal threshold, the threshold one would choose if the
underlying distribution of the signals is known (if only). Somewhat
surprisingly, when is sufficiently sparse, its off-diagonal
coordinates usually do not have a major influence over the classification
decision. Compared to recent work in the case where is the identity
matrix [Proc. Natl. Acad. Sci. USA 105 (2008) 14790-14795; Philos. Trans. R.
Soc. Lond. Ser. A Math. Phys. Eng. Sci. 367 (2009) 4449-4470], the current
setting is much more general, which needs a new approach and much more
sophisticated analysis. One key component of the analysis is the intimate
relationship between HCT and Fisher's separation. Another key component is the
tight large-deviation bounds for empirical processes for data with
unconventional correlation structures, where graph theory on vertex coloring
plays an important role.Comment: Published in at http://dx.doi.org/10.1214/13-AOS1163 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
A primer on machine learning techniques for genomic applications
High throughput sequencing technologies have enabled the study of complex biological aspects at single nucleotide resolution, opening the big data era. The analysis of large volumes of heterogeneous “omic” data, however, requires novel and efficient computational algorithms based on the paradigm of Artificial Intelligence. In the present review, we introduce and describe the most common machine learning methodologies, and lately deep learning, applied to a variety of genomics tasks, trying to emphasize capabilities, strengths and limitations through a simple and intuitive language. We highlight the power of the machine learning approach in handling big data by means of a real life example, and underline how described methods could be relevant in all cases in which large amounts of multimodal genomic data are available
- …