121 research outputs found

    Boundary values of holomorphic semigroups of unbounded operators and similarity of certain perturbations

    Get PDF
    AbstractWe obtain sufficient conditions for a ā€œholomorphicā€ semigroup of unbounded operators to possess a boundary group of bounded operators. The theorem is applied to generalize to unbounded operators results of Kantorovitz about the similarity of certain perturbations. Our theory includes a result of Fisher on the Riemann-Liouville semigroup in Lp(0, āˆž) 1 < p < āˆž. In this particular case we give also an alternative approach, where the boundary group is obtained as the limit of groups in the weak operator topology

    Approximate word matches between two random sequences

    Full text link
    Given two sequences over a finite alphabet L\mathcal{L}, the D2D_2 statistic is the number of mm-letter word matches between the two sequences. This statistic is used in bioinformatics for expressed sequence tag database searches. Here we study a generalization of the D2D_2 statistic in the context of DNA sequences, under the assumption of strand symmetric Bernoulli text. For k<mk<m, we look at the count of mm-letter word matches with up to kk mismatches. For this statistic, we compute the expectation, give upper and lower bounds for the variance and prove its distribution is asymptotically normal.Comment: Published in at http://dx.doi.org/10.1214/07-AAP452 the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Empirical distribution of k-word matches in biological sequences

    Full text link
    This study focuses on an alignment-free sequence comparison method: the number of words of length k shared between two sequences, also known as the D_2 statistic. The advantages of the use of this statistic over alignment-based methods are firstly that it does not assume that homologous segments are contiguous, and secondly that the algorithm is computationally extremely fast, the runtime being proportional to the size of the sequence under scrutiny. Existing applications of the D_2 statistic include the clustering of related sequences in large EST databases such as the STACK database. Such applications have typically relied on heuristics without any statistical basis. Rigorous statistical characterisations of the distribution of D_2 have subsequently been undertaken, but have focussed on the distribution's asymptotic behaviour, leaving the distribution of D_2 uncharacterised for most practical cases. The work presented here bridges these two worlds to give usable approximations of the distribution of D_2 for ranges of parameters most frequently encountered in the study of biological sequences.Comment: 23 pages, 10 figure

    Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences

    Get PDF
    BACKGROUND: The number of k-words shared between two sequences is a simple and effcient alignment-free sequence comparison method. This statistic, D(2), has been used for the clustering of EST sequences. Sequence comparison based on D(2 )is extremely fast, its runtime is proportional to the size of the sequences under scrutiny, whereas alignment-based comparisons have a worst-case run time proportional to the square of the size. Recent studies have tackled the rigorous study of the statistical distribution of D(2), and asymptotic regimes have been derived. The distribution of approximate k-word matches has also been studied. RESULTS: We have computed the D(2 )optimal word size for various sequence lengths, and for both perfect and approximate word matches. Kolmogorov-Smirnov tests show D(2 )to have a compound Poisson distribution at the optimal word size for small sequence lengths (below 400 letters) and a normal distribution at the optimal word size for large sequence lengths (above 1600 letters). We find that the D(2 )statistic outperforms BLAST in the comparison of artificially evolved sequences, and performs similarly to other methods based on exact word matches. These results obtained with randomly generated sequences are also valid for sequences derived from human genomic DNA. CONCLUSION: We have characterized the distribution of the D(2 )statistic at optimal word sizes. We find that the best trade-off between computational efficiency and accuracy is obtained with exact word matches. Given that our numerical tests have not included sequence shuffling, transposition or splicing, the improvements over existing methods reported here underestimate that expected in real sequences. Because of the linear run time and of the known normal asymptotic behavior, D(2)-based methods are most appropriate for large genomic sequences

    Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts

    Get PDF
    Motivation: The identity of cells and tissues is to a large degree governed by transcriptional regulation. A major part is accomplished by the combinatorial binding of transcription factors at regulatory sequences, such as enhancers. Even though binding of transcription factors is sequence-specific, estimating the sequence similarity of two functionally similar enhancers is very difficult. However, a similarity measure for regulatory sequences is crucial to detect and understand functional similarities between two enhancers and will facilitate large-scale analyses like clustering, prediction and classification of genome-wide datasets

    Metric operators, generalized hermiticity and lattices of Hilbert lpaces

    Full text link
    A quasi-Hermitian operator is an operator that is similar to its adjoint in some sense, via a metric operator, i.e., a strictly positive self-adjoint operator. Whereas those metric operators are in general assumed to be bounded, we analyze the structure generated by unbounded metric operators in a Hilbert space. It turns out that such operators generate a canonical lattice of Hilbert spaces, that is, the simplest case of a partial inner product space (PIP-space). We introduce several generalizations of the notion of similarity between operators, in particular, the notion of quasi-similarity, and we explore to what extend they preserve spectral properties. Then we apply some of the previous results to operators on a particular PIP-space, namely, a scale of Hilbert spaces generated by a metric operator. Finally, motivated by the recent developments of pseudo-Hermitian quantum mechanics, we reformulate the notion of pseudo-Hermitian operators in the preceding formalism.Comment: 51pages; will appear as a chapter in \textit{Non-Selfadjoint Operators in Quantum Physics: Mathematical Aspects}; F. Bagarello, J-P. Gazeau, F. H. Szafraniec and M. Znojil, eds., J. Wiley, 201

    Improved accuracy of supervised CRM discovery with interpolated Markov models and cross-species comparison

    Get PDF
    Despite recent advances in experimental approaches for identifying transcriptional cis-regulatory modules (CRMs, ā€˜enhancersā€™), direct empirical discovery of CRMs for all genes in all cell types and environmental conditions is likely to remain an elusive goal. Effective methods for computational CRM discovery are thus a critically needed complement to empirical approaches. However, existing computational methods that search for clusters of putative binding sites are ineffective if the relevant TFs and/or their binding specificities are unknown. Here, we provide a significantly improved method for ā€˜motif-blindā€™ CRM discovery that does not depend on knowledge or accurate prediction of TF-binding motifs and is effective when limited knowledge of functional CRMs is available to ā€˜superviseā€™ the search. We propose a new statistical method, based on ā€˜Interpolated Markov Modelsā€™, for motif-blind, genome-wide CRM discovery. It captures the statistical profile of variable length words in known CRMs of a regulatory network and finds candidate CRMs that match this profile. The method also uses orthologs of the known CRMs from closely related genomes. We perform in silico evaluation of predicted CRMs by assessing whether their neighboring genes are enriched for the expected expression patterns. This assessment uses a novel statistical test that extends the widely used Hypergeometric test of gene set enrichment to account for variability in intergenic lengths. We find that the new CRM prediction method is superior to existing methods. Finally, we experimentally validate 12 new CRM predictions by examining their regulatory activity in vivo in Drosophila; 10 of the tested CRMs were found to be functional, while 6 of the top 7 predictions showed the expected activity patterns. We make our program available as downloadable source code, and as a plugin for a genome browser installed on our servers

    Decoding the genome with an integrative analysis tool: Combinatorial CRM Decoder

    Get PDF
    The identification of genome-wide cis-regulatory modules (CRMs) and characterization of their associated epigenetic features are fundamental steps toward the understanding of gene regulatory networks. Although integrative analysis of available genome-wide information can provide new biological insights, the lack of novel methodologies has become a major bottleneck. Here, we present a comprehensive analysis tool called combinatorial CRM decoder (CCD), which utilizes the publicly available information to identify and characterize genome-wide CRMs in a species of interest. CCD first defines a set of the epigenetic features which is significantly associated with a set of known CRMs as a code called ā€˜trace codeā€™, and subsequently uses the trace code to pinpoint putative CRMs throughout the genome. Using 61 genome-wide data sets obtained from 17 independent mouse studies, CCD successfully catalogued āˆ¼12ā€‰600 CRMs (five distinct classes) including polycomb repressive complex 2 target sites as well as imprinting control regions. Interestingly, we discovered that āˆ¼4% of the identified CRMs belong to at least two different classes named ā€˜multi-functional CRMā€™, suggesting their functional importance for regulating spatiotemporal gene expression. From these examples, we show that CCD can be applied to any potential genome-wide datasets and therefore will shed light on unveiling genome-wide CRMs in various species
    • ā€¦
    corecore