5 research outputs found

    Reconstruction of Trees from Jumbled and Weighted Subtrees

    Get PDF
    Let T be an edge-labeled graph, where the labels are from a finite alphabet Sigma. For a subtree U of T the Parikh vector of U is a vector of length |Sigma| which specifies the multiplicity of each label in U. We ask when T can be reconstructed from the multiset of Parikh vectors of all its subtrees, or all of its paths, or all of its maximal paths. We consider the analogous problems for weighted trees. We show how several well-known reconstruction problems on labeled strings, weighted strings and point sets on a line can be included in this framework. We present reconstruction algorithms and non-reconstructibility results, and extend the polynomial method, previously applied to jumbled strings [Acharya et al., SIAM J. on Discr. Math, 2015] and weighted strings [Bansal et al., CPM 2004], to deal with general trees and special tree classes

    Reconstruction of Trees from Jumbled and Weighted Subtrees

    Get PDF
    Let T be an edge-labeled graph, where the labels are from a finite alphabet Sigma. For a subtree U of T the Parikh vector of U is a vector of length |Sigma| which specifies the multiplicity of each label in U. We ask when T can be reconstructed from the multiset of Parikh vectors of all its subtrees, or all of its paths, or all of its maximal paths. We consider the analogous problems for weighted trees. We show how several well-known reconstruction problems on labeled strings, weighted strings and point sets on a line can be included in this framework. We present reconstruction algorithms and non-reconstructibility results, and extend the polynomial method, previously applied to jumbled strings [Acharya et al., SIAM J. on Discr. Math, 2015] and weighted strings [Bansal et al., CPM 2004], to deal with general trees and special tree classes

    Finding Submasses in Weighted Strings with Fast Fourier Transform

    Get PDF
    We study the Submass Finding Problem: given a string s over a weighted alphabet, i.e., an alphabet with a weight function :→N,werefertoamassM ∈NasasubmassofsifshasasubstringwhoseweightssumuptoM.Now,forasetofinput masses {M1 , . . . , Mk }, we want to find those Mi which are submasses of s, and return one or all occurrences of substrings with mass Mi . We present efficient algorithms for both the decision and the search problem. Furthermore, our approach allows us to compute efficiently the number of different submasses of s. The main idea of our algorithms is to define appropriate polynomials such that we can determine the solution for the Submass Finding Problem from the coefficients of the product of these polynomials. We obtain very efficient running times by using Fast Fourier Transform to compute this product. Our main algorithm for the decision problem runs in time O(s log s ), where s is the total mass of string s. Employing methods for compressing sparse polynomials, this runtime can be viewed as O((s) log2 (s)), where (s) denotes the number of different submasses of s. In this case, the runtime is independent of the size of the individual masses of characters

    Strings in proteomics and transcriptomics : algorithmic and combinatorial questions in mass spectrometry and EST clustering

    Get PDF
    Lipták Z. Strings in proteomics and transcriptomics : algorithmic and combinatorial questions in mass spectrometry and EST clustering. Bielefeld (Germany): Bielefeld University; 2005.This thesis treats two problem areas in bioinformatics which can both be beneficially formalized as string problems. The first (and larger) part deals with weighted string problems as they arise from biotechnological mass spectrometry applications. In a mass spectrometry experiment - put in a simplified manner - the molecular mass of the sample molecules is determined. The aim is to identify the sample, either with or without additional information from a database, relying on the fact that the different building blocks of proteins (namely, amino acids) and of DNA molecules (nucleotides) have different molecular masses. Viewing proteins and DNA molecules as strings, this leads naturally to the definition of weighted strings as strings over a finite alphabet with an additional weight (or mass) function. We develop some results in weighted string combinatorics, and then present efficient algorithms for three weighted string problems which are motivated by mass spectrometry. First, the mass decomposition problem is the problem of finding all compomers whose mass equals a query mass. Here, a compomer is an integer vector specifying the number of occurrences of each character. A compomer abstracts from the order of characters in a string and identifies instead all strings with the same number of occurrences of each character - since these strings all have the same mass. We also present simulation results and tables detailing the number of compomers for different biomolecules, in the ranges appropriate for mass spectrometry applications. Next, we give several efficient algorithms for the submass problem: to test, for a given weighted string s and a query mass M, whether s has a substring with mass M; to find where such a substring occurs; and other variants. We present an algorithm for binary alphabets which runs in time logarithmic in the length of s. Furthermore, we present several algorithms for the problem where multiple masses are sought; these algorithms encode the submasses of s in a polynomial and rely for their efficiency on Fast Fourier Transform for polynomial multiplication. The third weighted string problem we discuss is de novo peptide sequencing, i.e., recovering the amino acid sequence of a sample peptide from tandem MS data, a specialized mass spectrometry experiment. We describe an algorithm which is an enhancement of a dynamic programming algorithm presented by Chen et al. in 2000. We describe our implementation and present some simulation results. The second part of the thesis deals with EST clustering. ESTs (expressed sequence tags) are short DNA sequences which are partial copies of mRNAs; they are produced in a high-throughput manner and submitted to public databases. In EST clustering, the aim is to produce a partition of a large set of ESTs, where each cluster corresponds to a gene; thus enabling the researcher to identify which genes were being expressed. Clearly, the quality of the clustering is highly dependent on the dissimilarity measure (distance/similarity measure) for strings and on the clustering algorithm employed. We have developed a method for evaluating different string dissimilarity measures and clustering algorithms. Finally, we present simulation results for five dissimilarity measures and one clustering algorithm
    corecore