Search CORE

7,459 research outputs found

Insertions Yielding Equivalent Double Occurrence Words

Author: Cruz Daniel A.
Ferrari Margherita Maria
Jonoska Natasa
Nabergall Lukas
Saito Masahico
Publication venue
Publication date: 26/09/2019
Field of study

A double occurrence word (DOW) is a word in which every symbol appears exactly twice; two DOWs are equivalent if one is a symbol-to-symbol image of the other. We consider the so called repeat pattern (

\alpha\alpha

) and the return pattern (

\alpha\alpha^R

), with gaps allowed between the

\alpha

's. These patterns generalize square and palindromic factors of DOWs, respectively. We introduce a notion of inserting repeat/return words into DOWs and study how two distinct insertions into the same word can produce equivalent DOWs. Given a DOW

w

, we characterize the structure of

w

which allows two distinct insertions to yield equivalent DOWs. This characterization depends on the locations of the insertions and on the length of the inserted repeat/return words and implies that when one inserted word is a repeat word and the other is a return word, then both words must be trivial (i.e., have only one symbol). The characterization also introduces a method to generate families of words recursively

arXiv.org e-Print Archive

USFSP Digital Archive

Scholar Commons - University of South Florida

RasBhari: optimizing spaced seeds for database searching, read mapping and alignment-free sequence comparison

Author: Hahn Lars
Leimeister Chris-André
Lonardi Stefano
Morgenstern Burkhard
Ounit Rachid
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 20/07/2016
Field of study

Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don't-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de

arXiv.org e-Print Archive

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

Evolution of antigen binding receptors

Author: Anderson Michele K.
Litman Gary W.
Rast Jonathan P.
Publication venue: 'Annual Reviews'
Publication date: 01/04/1999
Field of study

This review addresses issues related to the evolution of the complex multigene families of antigen binding receptors that function in adaptive immunity. Advances in molecular genetic technology now permit the study of immunoglobulin (Ig) and T cell receptor (TCR) genes in many species that are not commonly studied yet represent critical branch points in vertebrate phylogeny. Both Ig and TCR genes have been defined in most of the major lineages of jawed vertebrates, including the cartilaginous fishes, which represent the most phylogenetically divergent jawed vertebrate group relative to the mammals. Ig genes in cartilaginous fish are encoded by multiple individual loci that each contain rearranging segmental elements and constant regions. In some loci, segmental elements are joined in the germline, i.e. they do not undergo genetic rearrangement. Other major differences in Ig gene organization and the mechanisms of somatic diversification have occurred throughout vertebrate evolution. However, relating these changes to adaptive immune function in lower vertebrates is challenging. TCR genes exhibit greater sequence diversity in individual segmental elements than is found in Ig genes but have undergone fewer changes in gene organization, isotype diversity, and mechanisms of diversification. As of yet, homologous forms of antigen binding receptors have not been identified in jawless vertebrates; however, acquisition of large amounts of structural data for the antigen binding receptors that are found in a variety of jawed vertebrates has defined shared characteristics that provide unique insight into the distant origins of the rearranging gene systems and their relationships to both adaptive and innate recognition processes

Caltech Authors

BOOL-AN: A method for comparative sequence analysis and phylogenetic reconstruction

Author: Ari Eszter
Horváth Arnold
Ittzés Péter
Jakó Éena
Podani János
Publication venue: 'Elsevier BV'
Publication date: 01/01/2009
Field of study

A novel discrete mathematical approach is proposed as an additional tool for molecular systematics which does not require prior statistical assumptions concerning the evolutionary process. The method is based on algorithms generating mathematical representations directly from DNA/RNA or protein sequences, followed by the output of numerical (scalar or vector) and visual characteristics (graphs). The binary encoded sequence information is transformed into a compact analytical form, called the Iterative Canonical Form (or ICF) of Boolean functions, which can then be used as a generalized molecular descriptor. The method provides raw vector data for calculating different distance matrices, which in turn can be analyzed by neighbor-joining or UPGMA to derive a phylogenetic tree, or by principal coordinates analysis to get an ordination scattergram. The new method and the associated software for inferring phylogenetic trees are called the Boolean analysis or BOOL-AN

Crossref

Repository of the Academy's Library

V-H gene usage differs in germline and mutated B-cell chronic lymphocytic leukemia

Author: Amlot P
Duke VM
Foroni L
Gandini D
Heelan B
Hoffbrand AV
Lin K
Mehta AB
Sherrington PD
Publication venue
Publication date: 01/11/2003
Field of study

UCL Discovery

GenomeFingerprinter and universal genome fingerprint analysis for systematic comparative genomics

Author: Ai Hannan
Ai Yuncan
Meng Fanmei
Zhao Lei
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 09/03/2013
Field of study

How to compare whole genome sequences at large scale has not been achieved via conventional methods based on pair-wisely base-to-base comparison; nevertheless, no attention was paid to handle in-one-sitting a number of genomes crossing genetic category (chromosome, plasmid, and phage) with farther divergences (much less or no homologous) over large size ranges (from Kbp to Mbp). We created a new method, GenomeFingerprinter, to unambiguously produce three-dimensional coordinates from a sequence, followed by one three-dimensional plot and six two-dimensional trajectory projections to illustrate whole genome fingerprints. We further developed a set of concepts and tools and thereby established a new method, universal genome fingerprint analysis. We demonstrated their applications through case studies on over a hundred of genome sequences. Particularly, we defined the total genetic component configuration (TGCC) (i.e., chromosome, plasmid, and phage) for describing a strain as a system, and the universal genome fingerprint map (UGFM) of TGCC for differentiating a strain as a universal system, as well as the systematic comparative genomics (SCG) for comparing in-one-sitting a number of genomes crossing genetic category in diverse strains. By using UGFM, UGFM-TGCC, and UGFM-TGCC-SCG, we compared a number of genome sequences with farther divergences (chromosome, plasmid, and phage; bacterium, archaeal bacterium, and virus) over large size ranges (6Kbp~5Mbp), giving new insights into critical problematic issues in microbial genomics in the post-genomic era. This paper provided a new method for rapidly computing, geometrically visualizing, and intuitively comparing genome sequences at fingerprint level, and hence established a new method of universal genome fingerprint analysis for systematic comparative genomics.Comment: 63 pages, 15 figures, 5 table

arXiv.org e-Print Archive

Directory of Open Access Journals

PubMed Central

FigShare