Search CORE

34,568 research outputs found

Aligning Sequences by Minimum Description Length

Author: Conery JohnS
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

<p/> <p>This paper presents a new information theoretic framework for aligning sequences in bioinformatics. A transmitter compresses a set of sequences by constructing a regular expression that describes the regions of similarity in the sequences. To retrieve the original set of sequences, a receiver generates all strings that match the expression. An alignment algorithm uses minimum description length to encode and explore alternative expressions; the expression with the shortest encoding provides the best overall alignment. When two substrings contain letters that are similar according to a substitution matrix, a code length function based on conditional probabilities defined by the matrix will encode the substrings with fewer bits. In one experiment, alignments produced with this new method were found to be comparable to alignments from <inline-formula><graphic file="1687-4153-2007-72936-i1.gif"/></inline-formula>. A second experiment measured the accuracy of the new method on pairwise alignments of sequences from the BAliBASE alignment benchmark.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Jabba: hybrid error correction for long sequencing reads using maximal exact matches

Author: Audenaert P.
Demeester Piet
Fostier Jan
Heydari Mahdi
Miclotte Giles
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2015
Field of study

Third generation sequencing platforms produce longer reads with higher error rates than second generation sequencing technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned. In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is that this mapping is constructed with a seed and extend methodology, using maximal exact matches as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of maximal exact matches in the context of third generation reads are presented

Ghent University Academic Bibliography

File Updates Under Random/Arbitrary Insertions And Deletions

Author: Cadambe Viveck
Jaggi Sidharth
Médard Muriel
Schwartz Moshe
Wang Qiwen
Publication venue
Publication date: 27/02/2015
Field of study

A client/encoder edits a file, as modeled by an insertion-deletion (InDel) process. An old copy of the file is stored remotely at a data-centre/decoder, and is also available to the client. We consider the problem of throughput- and computationally-efficient communication from the client to the data-centre, to enable the server to update its copy to the newly edited file. We study two models for the source files/edit patterns: the random pre-edit sequence left-to-right random InDel (RPES-LtRRID) process, and the arbitrary pre-edit sequence arbitrary InDel (APES-AID) process. In both models, we consider the regime in which the number of insertions/deletions is a small (but constant) fraction of the original file. For both models we prove information-theoretic lower bounds on the best possible compression rates that enable file updates. Conversely, our compression algorithms use dynamic programming (DP) and entropy coding, and achieve rates that are approximately optimal.Comment: The paper is an extended version of our paper to be appeared at ITW 201

arXiv.org e-Print Archive

DSpace@MIT

Crossref

A New Simulated Annealing Algorithm for the Multiple Sequence Alignment Problem: The approach of Polymers in a Random Media

Author: A. Godzik
D. Gunsfield
J. Kim
M. Hernández-Guía
M. Ishikawa
M. S. Waterman
P. Pevzner
R. Durbin
R. Mulet
S. Geman
S. Rodríguez-Pérez
Publication venue: 'American Physical Society (APS)'
Publication date: 10/01/2005
Field of study

We proposed a probabilistic algorithm to solve the Multiple Sequence Alignment problem. The algorithm is a Simulated Annealing (SA) that exploits the representation of the Multiple Alignment between

D

sequences as a directed polymer in

D

dimensions. Within this representation we can easily track the evolution in the configuration space of the alignment through local moves of low computational cost. At variance with other probabilistic algorithms proposed to solve this problem, our approach allows for the creation and deletion of gaps without extra computational cost. The algorithm was tested aligning proteins from the kinases family. When D=3 the results are consistent with those obtained using a complete algorithm. For

D>3

where the complete algorithm fails, we show that our algorithm still converges to reasonable alignments. Moreover, we study the space of solutions obtained and show that depending on the number of sequences aligned the solutions are organized in different ways, suggesting a possible source of errors for progressive algorithms.Comment: 7 pages and 11 figure

arXiv.org e-Print Archive

Crossref

Recommended from our members

De novo assembly of the cattle reference genome with single-molecule sequencing.

Author: Bickhart Derek M
Cole John B
Couldrey Christine
Dreischer Christian
Elsik Christine G
Ghurye Jay
Hagen Darren E
Hall Richard
Hammond John A
Hoffman Jinna
Koren Sergey
Li Wenli
Liu George
Low Wai Y
McDaneld Tara G
McKay Stephanie D
Medrano Juan F
Murdoch Brenda M
Nandolo Wilson
Phillippy Adam M
Rhie Arang
Rosen Benjamin D
Rowan Troy N
Schnabel Robert D
Schroeder Steven G
Schultheiss Sebastian J
Schwartz John C
Smith Timothy PL
Snelling Warren M
Thibaud-Nissen Françoise
Tseng Elizabeth
Van Tassell Curtis P
Zimin Aleksey
Publication venue: eScholarship, University of California
Publication date: 01/03/2020
Field of study

BackgroundMajor advances in selection progress for cattle have been made following the introduction of genomic tools over the past 10-12 years. These tools depend upon the Bos taurus reference genome (UMD3.1.1), which was created using now-outdated technologies and is hindered by a variety of deficiencies and inaccuracies.ResultsWe present the new reference genome for cattle, ARS-UCD1.2, based on the same animal as the original to facilitate transfer and interpretation of results obtained from the earlier version, but applying a combination of modern technologies in a de novo assembly to increase continuity, accuracy, and completeness. The assembly includes 2.7 Gb and is >250× more continuous than the original assembly, with contig N50 >25 Mb and L50 of 32. We also greatly expanded supporting RNA-based data for annotation that identifies 30,396 total genes (21,039 protein coding). The new reference assembly is accessible in annotated form for public use.ConclusionsWe demonstrate that improved continuity of assembled sequence warrants the adoption of ARS-UCD1.2 as the new cattle reference genome and that increased assembly accuracy will benefit future research on this species

eScholarship - University of California

A Neural Multi-sequence Alignment TeCHnique (NeuMATCH)

Author: Dogan Pelin
Gross Markus
Li Boyang
Sigal Leonid
Publication venue
Publication date: 09/04/2018
Field of study

The alignment of heterogeneous sequential data (video to text) is an important and challenging problem. Standard techniques for this task, including Dynamic Time Warping (DTW) and Conditional Random Fields (CRFs), suffer from inherent drawbacks. Mainly, the Markov assumption implies that, given the immediate past, future alignment decisions are independent of further history. The separation between similarity computation and alignment decision also prevents end-to-end training. In this paper, we propose an end-to-end neural architecture where alignment actions are implemented as moving data between stacks of Long Short-term Memory (LSTM) blocks. This flexible architecture supports a large variety of alignment tasks, including one-to-one, one-to-many, skipping unmatched elements, and (with extensions) non-monotonic alignment. Extensive experiments on semi-synthetic and real datasets show that our algorithm outperforms state-of-the-art baselines.Comment: Accepted at CVPR 2018 (Spotlight). arXiv file includes the paper and the supplemental materia

arXiv.org e-Print Archive

Repository for Publications and Research Data

Crossref

Multiple structural alignment for distantly related all b structures using TOPS pattern discovery and simulated annealing

Author: Gilbert D
Westhead DR
Williams A
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2003
Field of study

Topsalign is a method that will structurally align diverse protein structures, for example, structural alignment of protein superfolds. All proteins within a superfold share the same fold but often have very low sequence identity and different biological and biochemical functions. There is often signi®cant structural diversity around the common scaffold of secondary structure elements of the fold. Topsalign uses topological descriptions of proteins. A pattern discovery algorithm identi®es equivalent secondary structure elements between a set of proteins and these are used to produce an initial multiple structure alignment. Simulated annealing is used to optimize the alignment. The output of Topsalign is a multiple structure-based sequence alignment and a 3D superposition of the structures. This method has been tested on three superfolds: the b jelly roll, TIM (a/b) barrel and the OB fold. Topsalign outperforms established methods on very diverse structures. Despite the pattern discovery working only on b strand secondary structure elements, Topsalign is shown to align TIM (a/b) barrel superfamilies, which contain both a helices and b strands

Crossref

Enlighten

Brunel University Research Archive

Unsupervised Human Action Detection by Action Matching

Author: Fernando Basura
Gould Stephen
Shirazi Sareh
Publication venue
Publication date: 01/01/2017
Field of study

We propose a new task of unsupervised action detection by action matching. Given two long videos, the objective is to temporally detect all pairs of matching video segments. A pair of video segments are matched if they share the same human action. The task is category independent---it does not matter what action is being performed---and no supervision is used to discover such video segments. Unsupervised action detection by action matching allows us to align videos in a meaningful manner. As such, it can be used to discover new action categories or as an action proposal technique within, say, an action detection pipeline. Moreover, it is a useful pre-processing step for generating video highlights, e.g., from sports videos. We present an effective and efficient method for unsupervised action detection. We use an unsupervised temporal encoding method and exploit the temporal consistency in human actions to obtain candidate action segments. We evaluate our method on this challenging task using three activity recognition benchmarks, namely, the MPII Cooking activities dataset, the THUMOS15 action detection benchmark and a new dataset called the IKEA dataset. On the MPII Cooking dataset we detect action segments with a precision of 21.6% and recall of 11.7% over 946 long video pairs and over 5000 ground truth action segments. Similarly, on THUMOS dataset we obtain 18.4% precision and 25.1% recall over 5094 ground truth action segment pairs.Comment: IEEE International Conference on Computer Vision and Pattern Recognition CVPR 2017 Workshop

arXiv.org e-Print Archive

Crossref

Queensland University of Technology ePrints Archive

Computing Similarity between a Pair of Trajectories

Author: Agarwal Pankaj K.
Boedihardjo Arnold P.
Mølhave Thomas
Sankararaman Swaminathan
Publication venue
Publication date: 06/03/2013
Field of study

With recent advances in sensing and tracking technology, trajectory data is becoming increasingly pervasive and analysis of trajectory data is becoming exceedingly important. A fundamental problem in analyzing trajectory data is that of identifying common patterns between pairs or among groups of trajectories. In this paper, we consider the problem of identifying similar portions between a pair of trajectories, each observed as a sequence of points sampled from it. We present new measures of trajectory similarity --- both local and global --- between a pair of trajectories to distinguish between similar and dissimilar portions. Our model is robust under noise and outliers, it does not make any assumptions on the sampling rates on either trajectory, and it works even if they are partially observed. Additionally, the model also yields a scalar similarity score which can be used to rank multiple pairs of trajectories according to similarity, e.g. in clustering applications. We also present efficient algorithms for computing the similarity under our measures; the worst-case running time is quadratic in the number of sample points. Finally, we present an extensive experimental study evaluating the effectiveness of our approach on real datasets, comparing with it with earlier approaches, and illustrating many issues that arise in trajectory data. Our experiments show that our approach is highly accurate in distinguishing similar and dissimilar portions as compared to earlier methods even with sparse sampling

arXiv.org e-Print Archive

CiteSeerX