94,956 research outputs found
MEMOFinder: combining _de_ _novo_ motif prediction methods with a database of known motifs
*Background:* Methods for finding overrepresented sequence motifs are useful in several key areas of computational biology. They aim at detecting very weak signals responsible for biological processes requiring robust sequence identification like transcription-factor binding to DNA or docking sites in proteins. Currently, general performance of the model-based motif-finding methods is unsatisfactory; however, different methods are successful in different cases. This leads to the practical problem of combining results of different motif-finding tools, taking into account current knowledge collected in motif databases.
*Results:* We propose a new complete service allowing researchers to submit their sequences for analysis by four different motif-finding methods for clustering and comparison with a reference motif database. It is tailored for regulatory motif detection, however it allows for substantial amount of configuration regarding sequence background, motif database and parameters for motif-finding methods.
*Availability:* The method is available online as a webserver at: http://bioputer.mimuw.edu.pl/software/mmf/. In addition, the source code is released on a GNU General Public License
Consensus clustering and functional interpretation of gene-expression data
Microarray analysis using clustering algorithms can suffer from lack of inter-method consistency in assigning related gene-expression profiles to clusters. Obtaining a consensus set of clusters from a number of clustering methods should improve confidence in gene-expression analysis. Here we introduce consensus clustering, which provides such an advantage. When coupled with a statistically based gene functional analysis, our method allowed the identification of novel genes regulated by NFκB and the unfolded protein response in certain B-cell lymphomas
Recommended from our members
Cost effective, experimentally robust differential-expression analysis for human/mammalian, pathogen and dual-species transcriptomics.
As sequencing read length has increased, researchers have quickly adopted longer reads for their experiments. Here, we examine 14 pathogen or host-pathogen differential gene expression data sets to assess whether using longer reads is warranted. A variety of data sets was used to assess what genomic attributes might affect the outcome of differential gene expression analysis including: gene density, operons, gene length, number of introns/exons and intron length. No genome attribute was found to influence the data in principal components analysis, hierarchical clustering with bootstrap support, or regression analyses of pairwise comparisons that were undertaken on the same reads, looking at all combinations of paired and unpaired reads trimmed to 36, 54, 72 and 101 bp. Read pairing had the greatest effect when there was little variation in the samples from different conditions or in their replicates (e.g. little differential gene expression). But overall, 54 and 72 bp reads were typically most similar. Given differences in costs and mapping percentages, we recommend 54 bp reads for organisms with no or few introns and 72 bp reads for all others. In a third of the data sets, read pairing had absolutely no effect, despite paired reads having twice as much data. Therefore, single-end reads seem robust for differential-expression analyses, but in eukaryotes paired-end reads are likely desired to analyse splice variants and should be preferred for data sets that are acquired with the intent to be community resources that might be used in secondary data analyses
Molecular Taxonomy of Phytopathogenic Fungi: A Case Study in Peronospora
Background: Inappropriate taxon definitions may have severe consequences in many areas. For instance, biologically
sensible species delimitation of plant pathogens is crucial for measures such as plant protection or biological control and for
comparative studies involving model organisms. However, delimiting species is challenging in the case of organisms for
which often only molecular data are available, such as prokaryotes, fungi, and many unicellular eukaryotes. Even in the case
of organisms with well-established morphological characteristics, molecular taxonomy is often necessary to emend current
taxonomic concepts and to analyze DNA sequences directly sampled from the environment. Typically, for this purpose
clustering approaches to delineate molecular operational taxonomic units have been applied using arbitrary choices
regarding the distance threshold values, and the clustering algorithms.
Methodology: Here, we report on a clustering optimization method to establish a molecular taxonomy of Peronospora
based on ITS nrDNA sequences. Peronospora is the largest genus within the downy mildews, which are obligate parasites of
higher plants, and includes various economically important pathogens. The method determines the distance function and
clustering setting that result in an optimal agreement with selected reference data. Optimization was based on both
taxonomy-based and host-based reference information, yielding the same outcome. Resampling and permutation methods
indicate that the method is robust regarding taxon sampling and errors in the reference data. Tests with newly obtained ITS
sequences demonstrate the use of the re-classified dataset in molecular identification of downy mildews.
Conclusions: A corrected taxonomy is provided for all Peronospora ITS sequences contained in public databases. Clustering
optimization appears to be broadly applicable in automated, sequence-based taxonomy. The method connects traditional
and modern taxonomic disciplines by specifically addressing the issue of how to optimally account for both traditional
species concepts and genetic divergence.Peer reviewe
Sequence-based Multiscale Model (SeqMM) for High-throughput chromosome conformation capture (Hi-C) data analysis
In this paper, I introduce a Sequence-based Multiscale Model (SeqMM) for the
biomolecular data analysis. With the combination of spectral graph method, I
reveal the essential difference between the global scale models and local scale
ones in structure clustering, i.e., different optimization on Euclidean (or
spatial) distances and sequential (or genomic) distances. More specifically,
clusters from global scale models optimize Euclidean distance relations. Local
scale models, on the other hand, result in clusters that optimize the genomic
distance relations. For a biomolecular data, Euclidean distances and sequential
distances are two independent variables, which can never be optimized
simultaneously in data clustering. However, sequence scale in my SeqMM can work
as a tuning parameter that balances these two variables and deliver different
clusterings based on my purposes. Further, my SeqMM is used to explore the
hierarchical structures of chromosomes. I find that in global scale, the
Fiedler vector from my SeqMM bears a great similarity with the principal vector
from principal component analysis, and can be used to study genomic
compartments. In TAD analysis, I find that TADs evaluated from different scales
are not consistent and vary a lot. Particularly when the sequence scale is
small, the calculated TAD boundaries are dramatically different. Even for
regions with high contact frequencies, TAD regions show no obvious consistence.
However, when the scale value increases further, although TADs are still quite
different, TAD boundaries in these high contact frequency regions become more
and more consistent. Finally, I find that for a fixed local scale, my method
can deliver very robust TAD boundaries in different cluster numbers.Comment: 22 PAGES, 13 FIGURE
Discovering transcriptional modules by Bayesian data integration
Motivation: We present a method for directly inferring transcriptional modules (TMs) by integrating gene expression and transcription factor binding (ChIP-chip) data. Our model extends a hierarchical Dirichlet process mixture model to allow data fusion on a gene-by-gene basis. This encodes the intuition that co-expression and co-regulation are not necessarily equivalent and hence we do not expect all genes to group similarly in both datasets. In particular, it allows us to identify the subset of genes that share the same structure of transcriptional modules in both datasets.
Results: We find that by working on a gene-by-gene basis, our model is able to extract clusters with greater functional coherence than existing methods. By combining gene expression and transcription factor binding (ChIP-chip) data in this way, we are better able to determine the groups of genes that are most likely to represent underlying TMs
Integrative Model-based clustering of microarray methylation and expression data
In many fields, researchers are interested in large and complex biological
processes. Two important examples are gene expression and DNA methylation in
genetics. One key problem is to identify aberrant patterns of these processes
and discover biologically distinct groups. In this article we develop a
model-based method for clustering such data. The basis of our method involves
the construction of a likelihood for any given partition of the subjects. We
introduce cluster specific latent indicators that, along with some standard
assumptions, impose a specific mixture distribution on each cluster. Estimation
is carried out using the EM algorithm. The methods extend naturally to multiple
data types of a similar nature, which leads to an integrated analysis over
multiple data platforms, resulting in higher discriminating power.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS533 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Exploring the assortativity-clustering space of a network's degree sequence
Nowadays there is a multitude of measures designed to capture different
aspects of network structure. To be able to say if the structure of certain
network is expected or not, one needs a reference model (null model). One
frequently used null model is the ensemble of graphs with the same set of
degrees as the original network. In this paper we argue that this ensemble can
be more than just a null model -- it also carries information about the
original network and factors that affect its evolution. By mapping out this
ensemble in the space of some low-level network structure -- in our case those
measured by the assortativity and clustering coefficients -- one can for
example study how close to the valid region of the parameter space the observed
networks are. Such analysis suggests which quantities are actively optimized
during the evolution of the network. We use four very different biological
networks to exemplify our method. Among other things, we find that high
clustering might be a force in the evolution of protein interaction networks.
We also find that all four networks are conspicuously robust to both random
errors and targeted attacks
- …