4,554 research outputs found
Joint assembly and genetic mapping of the Atlantic horseshoe crab genome reveals ancient whole genome duplication
Horseshoe crabs are marine arthropods with a fossil record extending back
approximately 450 million years. They exhibit remarkable morphological
stability over their long evolutionary history, retaining a number of ancestral
arthropod traits, and are often cited as examples of "living fossils." As
arthropods, they belong to the Ecdysozoa}, an ancient super-phylum whose
sequenced genomes (including insects and nematodes) have thus far shown more
divergence from the ancestral pattern of eumetazoan genome organization than
cnidarians, deuterostomes, and lophotrochozoans. However, much of ecdysozoan
diversity remains unrepresented in comparative genomic analyses. Here we use a
new strategy of combined de novo assembly and genetic mapping to examine the
chromosome-scale genome organization of the Atlantic horseshoe crab Limulus
polyphemus. We constructed a genetic linkage map of this 2.7 Gbp genome by
sequencing the nuclear DNA of 34 wild-collected, full-sibling embryos and their
parents at a mean redundancy of 1.1x per sample. The map includes 84,307
sequence markers and 5,775 candidate conserved protein coding genes. Comparison
to other metazoan genomes shows that the L. polyphemus genome preserves
ancestral bilaterian linkage groups, and that a common ancestor of modern
horseshoe crabs underwent one or more ancient whole genome duplications (WGDs)
~ 300 MYA, followed by extensive chromosome fusion
PIntron: a Fast Method for Gene Structure Prediction via Maximal Pairings of a Pattern and a Text
Current computational methods for exon-intron structure prediction from a
cluster of transcript (EST, mRNA) data do not exhibit the time and space
efficiency necessary to process large clusters of over than 20,000 ESTs and
genes longer than 1Mb. Guaranteeing both accuracy and efficiency seems to be a
computational goal quite far to be achieved, since accuracy is strictly related
to exploiting the inherent redundancy of information present in a large
cluster. We propose a fast method for the problem that combines two ideas: a
novel algorithm of proved small time complexity for computing spliced
alignments of a transcript against a genome, and an efficient algorithm that
exploits the inherent redundancy of information in a cluster of transcripts to
select, among all possible factorizations of EST sequences, those allowing to
infer splice site junctions that are highly confirmed by the input data. The
EST alignment procedure is based on the construction of maximal embeddings that
are sequences obtained from paths of a graph structure, called Embedding Graph,
whose vertices are the maximal pairings of a genomic sequence T and an EST P.
The procedure runs in time linear in the size of P, T and of the output.
PIntron, the software tool implementing our methodology, is able to process in
a few seconds some critical genes that are not manageable by other gene
structure prediction tools. At the same time, PIntron exhibits high accuracy
(sensitivity and specificity) when compared with ENCODE data. Detailed
experimental data, additional results and PIntron software are available at
http://www.algolab.eu/PIntron
Gene Similarity-based Approaches for Determining Core-Genes of Chloroplasts
In computational biology and bioinformatics, the manner to understand
evolution processes within various related organisms paid a lot of attention
these last decades. However, accurate methodologies are still needed to
discover genes content evolution. In a previous work, two novel approaches
based on sequence similarities and genes features have been proposed. More
precisely, we proposed to use genes names, sequence similarities, or both,
insured either from NCBI or from DOGMA annotation tools. Dogma has the
advantage to be an up-to-date accurate automatic tool specifically designed for
chloroplasts, whereas NCBI possesses high quality human curated genes (together
with wrongly annotated ones). The key idea of the former proposal was to take
the best from these two tools. However, the first proposal was limited by name
variations and spelling errors on the NCBI side, leading to core trees of low
quality. In this paper, these flaws are fixed by improving the comparison of
NCBI and DOGMA results, and by relaxing constraints on gene names while adding
a stage of post-validation on gene sequences. The two stages of similarity
measures, on names and sequences, are thus proposed for sequence clustering.
This improves results that can be obtained using either NCBI or DOGMA alone.
Results obtained with this quality control test are further investigated and
compared with previously released ones, on both computational and biological
aspects, considering a set of 99 chloroplastic genomes.Comment: 4 pages, IEEE International Conference on Bioinformatics and
Biomedicine (BIBM 2014
Improved Core Genes Prediction for Constructing well-supported Phylogenetic Trees in large sets of Plant Species
The way to infer well-supported phylogenetic trees that precisely reflect the
evolutionary process is a challenging task that completely depends on the way
the related core genes have been found. In previous computational biology
studies, many similarity based algorithms, mainly dependent on calculating
sequence alignment matrices, have been proposed to find them. In these kinds of
approaches, a significantly high similarity score between two coding sequences
extracted from a given annotation tool means that one has the same genes. In a
previous work article, we presented a quality test approach (QTA) that improves
the core genes quality by combining two annotation tools (namely NCBI, a
partially human-curated database, and DOGMA, an efficient annotation algorithm
for chloroplasts). This method takes the advantages from both sequence
similarity and gene features to guarantee that the core genome contains correct
and well-clustered coding sequences (\emph{i.e.}, genes). We then show in this
article how useful are such well-defined core genes for biomolecular
phylogenetic reconstructions, by investigating various subsets of core genes at
various family or genus levels, leading to subtrees with strong bootstraps that
are finally merged in a well-supported supertree.Comment: 12 pages, 7 figures, IWBBIO 2015 (3rd International Work-Conference
on Bioinformatics and Biomedical Engineering
Automated Protein Subfamily Identification and Classification
Function prediction by homology is widely used to provide preliminary functional annotations for genes for which experimental evidence of function is unavailable or limited. This approach has been shown to be prone to systematic error, including percolation of annotation errors through sequence databases. Phylogenomic analysis avoids these errors in function prediction but has been difficult to automate for high-throughput application. To address this limitation, we present a computationally efficient pipeline for phylogenomic classification of proteins. This pipeline uses the SCI-PHY (Subfamily Classification in Phylogenomics) algorithm for automatic subfamily identification, followed by subfamily hidden Markov model (HMM) construction. A simple and computationally efficient scoring scheme using family and subfamily HMMs enables classification of novel sequences to protein families and subfamilies. Sequences representing entirely novel subfamilies are differentiated from those that can be classified to subfamilies in the input training set using logistic regression. Subfamily HMM parameters are estimated using an information-sharing protocol, enabling subfamilies containing even a single sequence to benefit from conservation patterns defining the family as a whole or in related subfamilies. SCI-PHY subfamilies correspond closely to functional subtypes defined by experts and to conserved clades found by phylogenetic analysis. Extensive comparisons of subfamily and family HMM performances show that subfamily HMMs dramatically improve the separation between homologous and non-homologous proteins in sequence database searches. Subfamily HMMs also provide extremely high specificity of classification and can be used to predict entirely novel subtypes. The SCI-PHY Web server at http://phylogenomics.berkeley.edu/SCI-PHY/ allows users to upload a multiple sequence alignment for subfamily identification and subfamily HMM construction. Biologists wishing to provide their own subfamily definitions can do so. Source code is available on the Web page. The Berkeley Phylogenomics Group PhyloFacts resource contains pre-calculated subfamily predictions and subfamily HMMs for more than 40,000 protein families and domains at http://phylogenomics.berkeley.edu/phylofacts/
Finding the Core-Genes of Chloroplasts
Due to the recent evolution of sequencing techniques, the number of available
genomes is rising steadily, leading to the possibility to make large scale
genomic comparison between sets of close species. An interesting question to
answer is: what is the common functionality genes of a collection of species,
or conversely, to determine what is specific to a given species when compared
to other ones belonging in the same genus, family, etc. Investigating such
problem means to find both core and pan genomes of a collection of species,
\textit{i.e.}, genes in common to all the species vs. the set of all genes in
all species under consideration. However, obtaining trustworthy core and pan
genomes is not an easy task, leading to a large amount of computation, and
requiring a rigorous methodology. Surprisingly, as far as we know, this
methodology in finding core and pan genomes has not really been deeply
investigated. This research work tries to fill this gap by focusing only on
chloroplastic genomes, whose reasonable sizes allow a deep study. To achieve
this goal, a collection of 99 chloroplasts are considered in this article. Two
methodologies have been investigated, respectively based on sequence
similarities and genes names taken from annotation tools. The obtained results
will finally be evaluated in terms of biological relevance
Entropy-scaling search of massive biological data
Many datasets exhibit a well-defined structure that can be exploited to
design faster search tools, but it is not always clear when such acceleration
is possible. Here, we introduce a framework for similarity search based on
characterizing a dataset's entropy and fractal dimension. We prove that
searching scales in time with metric entropy (number of covering hyperspheres),
if the fractal dimension of the dataset is low, and scales in space with the
sum of metric entropy and information-theoretic entropy (randomness of the
data). Using these ideas, we present accelerated versions of standard tools,
with no loss in specificity and little loss in sensitivity, for use in three
domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics
(MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search
(esFragBag, 10x speedup of FragBag). Our framework can be used to achieve
"compressive omics," and the general theory can be readily applied to data
science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo
- …