3,368 research outputs found
Joint assembly and genetic mapping of the Atlantic horseshoe crab genome reveals ancient whole genome duplication
Horseshoe crabs are marine arthropods with a fossil record extending back
approximately 450 million years. They exhibit remarkable morphological
stability over their long evolutionary history, retaining a number of ancestral
arthropod traits, and are often cited as examples of "living fossils." As
arthropods, they belong to the Ecdysozoa}, an ancient super-phylum whose
sequenced genomes (including insects and nematodes) have thus far shown more
divergence from the ancestral pattern of eumetazoan genome organization than
cnidarians, deuterostomes, and lophotrochozoans. However, much of ecdysozoan
diversity remains unrepresented in comparative genomic analyses. Here we use a
new strategy of combined de novo assembly and genetic mapping to examine the
chromosome-scale genome organization of the Atlantic horseshoe crab Limulus
polyphemus. We constructed a genetic linkage map of this 2.7 Gbp genome by
sequencing the nuclear DNA of 34 wild-collected, full-sibling embryos and their
parents at a mean redundancy of 1.1x per sample. The map includes 84,307
sequence markers and 5,775 candidate conserved protein coding genes. Comparison
to other metazoan genomes shows that the L. polyphemus genome preserves
ancestral bilaterian linkage groups, and that a common ancestor of modern
horseshoe crabs underwent one or more ancient whole genome duplications (WGDs)
~ 300 MYA, followed by extensive chromosome fusion
The EM Algorithm and the Rise of Computational Biology
In the past decade computational biology has grown from a cottage industry
with a handful of researchers to an attractive interdisciplinary field,
catching the attention and imagination of many quantitatively-minded
scientists. Of interest to us is the key role played by the EM algorithm during
this transformation. We survey the use of the EM algorithm in a few important
computational biology problems surrounding the "central dogma"; of molecular
biology: from DNA to RNA and then to proteins. Topics of this article include
sequence motif discovery, protein sequence alignment, population genetics,
evolutionary models and mRNA expression microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/09-STS312 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Chromosomal-level assembly of the Asian Seabass genome using long sequence reads and multi-layered scaffolding
We report here the ~670 Mb genome assembly of the Asian seabass (Lates calcarifer), a tropical marine teleost. We used long-read sequencing augmented by transcriptomics, optical and genetic mapping along with shared synteny from closely related fish species to derive a chromosome-level assembly with a contig N50 size over 1 Mb and scaffold N50 size over 25 Mb that span ~90% of the genome. The population structure of L. calcarifer species complex was analyzed by re-sequencing 61 individuals representing various regions across the species' native range. SNP analyses identified high levels of genetic diversity and confirmed earlier indications of a population stratification comprising three clades with signs of admixture apparent in the South-East Asian population. The quality of the Asian seabass genome assembly far exceeds that of any other fish species, and will serve as a new standard for fish genomics
SUFFIX TREE, MINWISE HASHING AND STREAMING ALGORITHMS FOR BIG DATA ANALYSIS IN BIOINFORMATICS
In this dissertation, we worked on several algorithmic problems in bioinformatics using mainly three approaches: (a) a streaming model, (b) sux-tree based indexing, and (c) minwise-hashing (minhash) and locality-sensitive hashing (LSH). The streaming models are useful for large data problems where a good approximation needs to be achieved with limited space usage. We developed an approximation algorithm (Kmer-Estimate) using the streaming approach to obtain a better estimation of the frequency of k-mer counts. A k-mer, a subsequence of length k, plays an important role in many bioinformatics analyses such as genome distance estimation. We also developed new methods that use sux tree, a trie data structure, for alignment-free, non-pairwise algorithms for a conserved non-coding sequence (CNS) identification problem. We provided two different algorithms: STAG-CNS to identify exact-matched CNSs and DiCE to identify CNSs with mismatches. Using our algorithms, CNSs among various grass species were identified. A different approach was employed for identification of longer CNSs ( 100 bp, mostly found in animals). In our new method (MinCNE), the minhash approach was used to estimate the Jaccard similarity. Using also LSH, k-mers extracted from genomic sequences were clustered and CNSs were identified. Another new algorithm (MinIsoClust) that also uses minhash and LSH techniques was developed for an isoform clustering problem. Isoforms are generated from the same gene but by alternative splicing. As the isoform sequences share some exons but in different combinations, regular sequencing clustering methods do not work well. Our algorithm generates clusters for isoform sequences based on their shared minhash signatures. Finally, we discuss de novo transcriptome assembly algorithms and how to improve the assembly accuracy using ensemble approaches. First, we did a comprehensive performance analysis on different transcriptome assemblers using simulated benchmark datasets. Then, we developed a new ensemble approach (Minsemble) for the de novo transcriptome assembly problem that integrates isoform-clustering using minhash technique to identify potentially correct transcripts from various de novo transcriptome assemblers. Minsemble identified more correctly assembled transcripts as well as genes compared to other de novo and ensemble methods.
Adviser: Jitender S. Deogu
Recommended from our members
Genomic signatures of heterokaryosis in the oomycete pathogen Bremia lactucae.
Lettuce downy mildew caused by Bremia lactucae is the most important disease of lettuce globally. This oomycete is highly variable and rapidly overcomes resistance genes and fungicides. The use of multiple read types results in a high-quality, near-chromosome-scale, consensus assembly. Flow cytometry plus resequencing of 30 field isolates, 37 sexual offspring, and 19 asexual derivatives from single multinucleate sporangia demonstrates a high incidence of heterokaryosis in B. lactucae. Heterokaryosis has phenotypic consequences on fitness that may include an increased sporulation rate and qualitative differences in virulence. Therefore, selection should be considered as acting on a population of nuclei within coenocytic mycelia. This provides evolutionary flexibility to the pathogen enabling rapid adaptation to different repertoires of host resistance genes and other challenges. The advantages of asexual persistence of heterokaryons may have been one of the drivers of selection that resulted in the loss of uninucleate zoospores in multiple downy mildews
Hiking in the energy landscape in sequence space: a bumpy road to good folders
With the help of a simple 20 letters, lattice model of heteropolymers, we
investigate the energy landscape in the space of designed good-folder
sequences. Low-energy sequences form clusters, interconnected via neutral
networks, in the space of sequences. Residues which play a key role in the
foldability of the chain and in the stability of the native state are highly
conserved, even among the chains belonging to different clusters. If, according
to the interaction matrix, some strong attractive interactions are almost
degenerate (i.e. they can be realized by more than one type of aminoacid
contacts) sequence clusters group into a few super-clusters. Sequences
belonging to different super-clusters are dissimilar, displaying very small
() similarity, and residues in key-sites are, as a rule, not
conserved. Similar behavior is observed in the analysis of real protein
sequences.Comment: 17 pages 5 figures Corrected typos added auxiliary informatio
- …