Search CORE

6,570 research outputs found

Cis-regulatory module detection using constraint programming

Author: Guns Tias
Marchal Kathleen
Nijssen Siegfried
Sun Hong
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

We propose a method for finding CRMs in a set of co-regulated genes. Each CRM consists of a set of binding sites of transcription factors. We wish to find CRMs involving the same transcription factors in multiple sequences. Finding such a combination of transcription factors is inherently a combinatorial problem. We solve this problem by combining the principles of itemset mining and constraint programming. The constraints involve the putative binding sites of transcription factors, the number of sequences in which they co-occur and the proximity of the binding sites. Genomic background sequences are used to assess the significance of the modules. We experimentally validate our approach and compare it with state-of-the-art techniques

Crossref

Ghent University Academic Bibliography

Efficient Genomic Interval Queries Using Augmented Range Trees

Author: Eran Alal
Luo Yuan
Mao Chengsheng
Publication venue
Publication date: 04/06/2018
Field of study

Efficient large-scale annotation of genomic intervals is essential for personal genome interpretation in the realm of precision medicine. There are 13 possible relations between two intervals according to Allen's interval algebra. Conventional interval trees are routinely used to identify the genomic intervals satisfying a coarse relation with a query interval, but cannot support efficient query for more refined relations such as all Allen's relations. We design and implement a novel approach to address this unmet need. Through rewriting Allen's interval relations, we transform an interval query to a range query, then adapt and utilize the range trees for querying. We implement two types of range trees: a basic 2-dimensional range tree (2D-RT) and an augmented range tree with fractional cascading (RTFC) and compare them with the conventional interval tree (IT). Theoretical analysis shows that RTFC can achieve the best time complexity for interval queries regarding all Allen's relations among the three trees. We also perform comparative experiments on the efficiency of RTFC, 2D-RT and IT in querying noncoding element annotations in a large collection of personal genomes. Our experimental results show that 2D-RT is more efficient than IT for interval queries regarding most of Allen's relations, RTFC is even more efficient than 2D-RT. The results demonstrate that RTFC is an efficient data structure for querying large-scale datasets regarding Allen's relations between genomic intervals, such as those required by interpreting genome-wide variation in large populations.Comment: 4 figures, 4 table

arXiv.org e-Print Archive

Directory of Open Access Journals

SEED: efficient clustering of next-generation sequences.

Author: Bao Ergude
Girke Thomas
Jiang Tao
Kaloshian Isgouhi
Publication venue: eScholarship, University of California
Publication date: 02/08/2011
Field of study

MotivationSimilarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.ResultsHere, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60-85% and 21-41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12-27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms.AvailabilityThe SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/[email protected] informationSupplementary data are available at Bioinformatics online

PubMed Central

eScholarship - University of California

CoCoNUT: an efficient system for the comparison and analysis of genomes

Author: A Darling
A Kasprzyk
B Haas
B Ma
B Mau
B Morgenstern
B Raphael
C Wawra
DR Bentley
E Mardis
E Ohlebusch
E Passarge
E Sonnhammer
Enno Ohlebusch
G Bourque
G Gremme
I Ovcharenko
J Krumsiek
J Peterson
J Thompson
L Florea
M Abouelhoda
M Abouelhoda
M Abouelhoda
M Abouelhoda
M Abouelhoda
M Blanchette
M Brudno
M Clamp
M Höhl
M Kellis
M Margulies
Mohamed I Abouelhoda
P Chain
R Staden
S Altschul
S Karlin
S Kurtz
S Ranganathan
S Schwartz
S Schwartz
S Shibuya
Stefan Kurtz
T Treangen
T Vision
T Wu
The Arabidopsis Genome Initiative
W Kent
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

An efficient genetic algorithm for structural RNA pairwise alignment and its application to non-coding RNA discovery in yeast

Author: A Harmanci
A Harmanci
A Taneda
A Uzilov
A Uzilov
A Wilm
Akito Taneda
B Knudsen
C Lu
C Notredame
C Notredame
C Selig
CC Chang
CMA Davis Jr
D Dalli
D Dalli
D Rose
D Sankoff
DE Goldberg
E Rivas
E Rivas
E Torarinsson
E Torarinsson
E Torarinsson
F Miura
G Gonsalvez
H Kiryu
H Kiryu
I Hofacker
I Hofacker
I Holmes
J Cherry
J Gorodkin
J Havgaard
J Havgaard
J Pedersen
J Schultz
J Thompson
J Thompson
K Katoh
K Missal
K Missal
L David
M Bauer
M Gerstein
M Samanta
P Carninci
R Dowell
R Klein
R Nussinov
S Needleman
S Washietl
S Washietl
S Washietl
S Will
W Gish
X Xu
Y Tabei
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Aligning RNA sequences with low sequence identity has been a challenging problem since such a computation essentially needs an algorithm with high complexities for taking structural conservation into account. Although many sophisticated algorithms for the purpose have been proposed to date, further improvement in efficiency is necessary to accelerate its large-scale applications including non-coding RNA (ncRNA) discovery. Results We developed a new genetic algorithm, Cofolga2, for simultaneously computing pairwise RNA sequence alignment and consensus folding, and benchmarked it using BRAliBase 2.1. The benchmark results showed that our new algorithm is accurate and efficient in both time and memory usage. Then, combining with the originally trained SVM, we applied the new algorithm to novel ncRNA discovery where we compared <it>S. cerevisiae </it>genome with six related genomes in a pairwise manner. By focusing our search to the relatively short regions (50 bp to 2,000 bp) sandwiched by conserved sequences, we successfully predict 714 intergenic and 1,311 sense or antisense ncRNA candidates, which were found in the pairwise alignments with stable consensus secondary structure and low sequence identity (≤ 50%). By comparing with the previous predictions, we found that > 92% of the candidates is novel candidates. The estimated rate of false positives in the predicted candidates is 51%. Twenty-five percent of the intergenic candidates has supports for expression in cell, i.e. their genomic positions overlap those of the experimentally determined transcripts in literature. By manual inspection of the results, moreover, we obtained four multiple alignments with low sequence identity which reveal consensus structures shared by three species/sequences. Conclusion The present method gives an efficient tool complementary to sequence-alignment-based ncRNA finders.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

The Parallelism Motifs of Genomic Data Analysis

Author: Awan Muaaz
Azad Ariful
Brock Benjamin
Buluc Aydin
Egan Rob
Ekanayake Saliya
Ellis Marquita
Georganas Evangelos
Guidi Giulia
Hofmeyr Steven
Oliker Leonid
Selvitopi Oguz
Teodoropol Cristina
Yelick Katherine
Publication venue: 'The Royal Society'
Publication date: 20/01/2020
Field of study

Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share this data with the research community, but some of these genomic data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high end parallel systems today and place different requirements on programming support, software libraries, and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high performance genomics analysis, including alignment, profiling, clustering, and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or motifs that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing

arXiv.org e-Print Archive

eScholarship - University of California

Viral population estimation using pyrosequencing

Author: A Dempster
A Rambaut
AMN Tsibris
B Gaschen
Baback Gharizadeh
C Wang
Chunlin Wang
D O'Meara
DC Douek
E Domingo
E Halperin
EH Simpson
ES Lander
Glenn Tesler
GS Gottlieb
GW Tyson
H Fakhrai-Rad
I Malet
IM Rouzine
J Kececioglu
JE Hopcroft
JF Simons
K Chen
KJ Metzner
L Bacheler
L Doukhan
L Excoffier
Lior Pachter
LR Ford
M Breitbart
M Eigen
M Margulies
M Stephens
MA Nowak
MJ Gonzales
ML Collins
ML Sogin
Mostafa Ronaghi
MT Tammi
N Beerenwinkel
Nicholas Eriksson
Niko Beerenwinkel
P Jenkins
PA Pevzner
R Schmid
R Shankarappa
Robert W. Shafer
RP Dilworth
S Huse
S-Y Rhee
S-Y Rhee
Soo-Yon Rhee
VA Johnson
Yumi Mitsuya
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2008
Field of study

The diversity of virus populations within single infected hosts presents a major difficulty for the natural immune response as well as for vaccine design and antiviral drug therapy. Recently developed pyrophosphate based sequencing technologies (pyrosequencing) can be used for quantifying this diversity by ultra-deep sequencing of virus samples. We present computational methods for the analysis of such sequence data and apply these techniques to pyrosequencing data obtained from HIV populations within patients harboring drug resistant virus strains. Our main result is the estimation of the population structure of the sample from the pyrosequencing reads. This inference is based on a statistical approach to error correction, followed by a combinatorial algorithm for constructing a minimal set of haplotypes that explain the data. Using this set of explaining haplotypes, we apply a statistical model to infer the frequencies of the haplotypes in the population via an EM algorithm. We demonstrate that pyrosequencing reads allow for effective population reconstruction by extensive simulations and by comparison to 165 sequences obtained directly from clonal sequencing of four independent, diverse HIV populations. Thus, pyrosequencing can be used for cost-effective estimation of the structure of virus populations, promising new insights into viral evolutionary dynamics and disease control strategies.Comment: 23 pages, 13 figure

arXiv.org e-Print Archive

CiteSeerX

Public Library of Science (PLOS)

Repository for Publications and Research Data

Crossref

Directory of Open Access Journals

PubMed Central

Caltech Authors