36 research outputs found

    Tight(er) bounds for similarity measures, smoothed approximation and broadcasting

    Get PDF
    In this thesis, we prove upper and lower bounds on the complexity of sequence similarity measures, the approximability of geometric problems on realistic inputs, and the performance of randomized broadcasting protocols. The first part approaches the question why a number of fundamental polynomial-time problems - specifically, Dynamic Time Warping, Longest Common Subsequence (LCS), and the Levenshtein distance - resists decades-long attempts to obtain polynomial improvements over their simple dynamic programming solutions. We prove that any (strongly) subquadratic algorithm for these and related sequence similarity measures would refute the Strong Exponential Time Hypothesis (SETH). Focusing particularly on LCS, we determine a tight running time bound (up to lower order factors and conditional on SETH) when the running time is expressed in terms of all input parameters that have been previously exploited in the extensive literature. In the second part, we investigate the approximation performance of the popular 2-Opt heuristic for the Traveling Salesperson Problem using the smoothed analysis paradigm. For the Fréchet distance, we design an improved approximation algorithm for the natural input class of c-packed curves, matching a conditional lower bound. Finally, in the third part we prove tighter performance bounds for processes that disseminate a piece of information, either as quickly as possible (rumor spreading) or as anonymously as possible (cryptogenography).Die vorliegende Dissertation beweist obere und untere Schranken an die Komplexität von Sequenzähnlichkeitsmaßen, an die Approximierbarkeit geometrischer Probleme auf realistischen Eingaben und an die Effektivität randomisierter Kommunikationsprotokolle. Der erste Teil befasst sich mit der Frage, warum für eine Vielzahl fundamentaler Probleme im Polynomialzeitbereich - insbesondere für das Dynamic-Time-Warping, die längste gemeinsame Teilfolge (LCS) und die Levenshtein-Distanz - seit Jahrzehnten keine Algorithmen gefunden werden konnten, die polynomiell schneller sind als ihre einfachen Lösungen mittels dynamischer Programmierung. Wir zeigen, dass ein (im strengen Sinne) subquadratischer Algorithmus für diese und verwandte Ähnlichkeitsmaße die starke Exponentialzeithypothese (SETH) widerlegen würde. Für LCS zeigen wir eine scharfe Schranke an die optimale Laufzeit (unter der SETH und bis auf Faktoren niedrigerer Ordnung) in Abhängigkeit aller bisher untersuchten Eingabeparameter. Im zweiten Teil untersuchen wir die Approximationsgüte der klassischen 2-Opt-Heuristik für das Problem des Handlungsreisenden anhand des Smoothed-Analysis-Paradigmas. Weiterhin entwickeln wir einen verbesserten Approximationsalgorithmus für die Fréchet-Distanz auf einer Klasse natürlicher Eingaben. Der letzte Teil beweist neue Schranken für die Effektivität von Prozessen, die Informationen entweder so schnell wie möglich (Rumor-Spreading) oder so anonym wie möglich (Kryptogenografie) verbreiten

    Novel stochastic and entropy-based Expectation-Maximisation algorithm for transcription factor binding site motif discovery

    Get PDF
    The discovery of transcription factor binding site (TFBS) motifs remains an important and challenging problem in computational biology. This thesis presents MITSU, a novel algorithm for TFBS motif discovery which exploits stochastic methods as a means of both overcoming optimality limitations in current algorithms and as a framework for incorporating relevant prior knowledge in order to improve results. The current state of the TFBS motif discovery field is surveyed, with a focus on probabilistic algorithms that typically take the promoter regions of coregulated genes as input. A case is made for an approach based on the stochastic Expectation- Maximisation (sEM) algorithm; its position amongst existing probabilistic algorithms for motif discovery is shown. The algorithm developed in this thesis is unique amongst existing motif discovery algorithms in that it combines the sEM algorithm with a derived data set which leads to an improved approximation to the likelihood function. This likelihood function is unconstrained with regard to the distribution of motif occurrences within the input dataset. MITSU also incorporates a novel heuristic to automatically determine TFBS motif width. This heuristic, known as MCOIN, is shown to outperform current methods for determining motif width. MITSU is implemented in Java and an executable is available for download. MITSU is evaluated quantitatively using realistic synthetic data and several collections of previously characterised prokaryotic TFBS motifs. The evaluation demonstrates that MITSU improves on a deterministic EM-based motif discovery algorithm and an alternative sEM-based algorithm, in terms of previously established metrics. The ability of the sEM algorithm to escape stable fixed points of the EM algorithm, which trap deterministic motif discovery algorithms and the ability of MITSU to discover multiple motif occurrences within a single input sequence are also demonstrated. MITSU is validated using previously characterised Alphaproteobacterial motifs, before being applied to motif discovery in uncharacterised Alphaproteobacterial data. A number of novel results from this analysis are presented and motivate two extensions of MITSU: a strategy for the discovery of multiple different motifs within a single dataset and a higher order Markov background model. The effects of incorporating these extensions within MITSU are evaluated quantitatively using previously characterised prokaryotic TFBS motifs and demonstrated using Alphaproteobacterial motifs. Finally, an information-theoretic measure of motif palindromicity is presented and its advantages over existing approaches for discovering palindromic motifs discussed

    Multiple Biolgical Sequence Alignment: Scoring Functions, Algorithms, and Evaluations

    Get PDF
    Aligning multiple biological sequences such as protein sequences or DNA/RNA sequences is a fundamental task in bioinformatics and sequence analysis. These alignments may contain invaluable information that scientists need to predict the sequences\u27 structures, determine the evolutionary relationships between them, or discover drug-like compounds that can bind to the sequences. Unfortunately, multiple sequence alignment (MSA) is NP-Complete. In addition, the lack of a reliable scoring method makes it very hard to align the sequences reliably and to evaluate the alignment outcomes. In this dissertation, we have designed a new scoring method for use in multiple sequence alignment. Our scoring method encapsulates stereo-chemical properties of sequence residues and their substitution probabilities into a tree-structure scoring scheme. This new technique provides a reliable scoring scheme with low computational complexity. In addition to the new scoring scheme, we have designed an overlapping sequence clustering algorithm to use in our new three multiple sequence alignment algorithms. One of our alignment algorithms uses a dynamic weighted guidance tree to perform multiple sequence alignment in progressive fashion. The use of dynamic weighted tree allows errors in the early alignment stages to be corrected in the subsequence stages. Other two algorithms utilize sequence knowledge-bases and sequence consistency to produce biological meaningful sequence alignments. To improve the speed of the multiple sequence alignment, we have developed a parallel algorithm that can be deployed on reconfigurable computer models. Analytically, our parallel algorithm is the fastest progressive multiple sequence alignment algorithm

    Tweets on a tree: Index-based clustering of tweets

    Get PDF
    Computer-mediated communication, CMC, is a type of communication that occurs through use of two or more electronic devices. With the advancement of technology, CMC has started to become a more preferred type of communication between humans. Through computer-mediated technologies, news portals, search engines and social media platforms such as Facebook, Twitter, Reddit and many other platforms are created. In social media platforms, a user can post and discuss his/her own opinion and also read and share other users' opinions. This generates a signi cant amount of data which, if ltered and analyzed, can give researchers important insights about public opinion and culture. Twitter is a social networking service founded in 2006 and became widespread throughout the world in a very short time frame. The service has more than 310 million monthly active users and throughout these users more than 500 million tweets are generated daily as of 2016. Due the volume, velocity and variety of Twitter data, it cannot be analyzed by using conventional methods. A clustering or sampling method is necessary to reduce the amount of data for analysis. To cluster documents, in a very broad sense two similarity measures can be used: Lexical similarity and semantic similarity. Lexical similarity looks for syntactic similarity between documents. It is usually computationally light to compute lexical similarity, however for clustering purposes it may not be very accurate as it disregards the semantic value of words. On the other hand, semantic similarity looks for semantic value and relations between words to calculate the similarity and while it is generally more accurate than lexical similarity, it is computationally di cult to calculate semantic similarity. In our work we aim to create computationally light and accurate clustering of short documents which have the characteristics of big data. We propose a hybrid approach of clustering where lexical and semantic similarity is combined together. In our approach, we use string similarity to create clusters and semantic vector representations of words to interactively merge clusters

    28th Annual Symposium on Combinatorial Pattern Matching : CPM 2017, July 4-6, 2017, Warsaw, Poland

    Get PDF
    Peer reviewe

    Survey of Deoxyribonucleic Acid Motif Finding Algorithms

    Get PDF
    An important task in biology is to identify binding sites in DNA for transcription factors. These binding sites are short DNA segments which are called motifs. Given a set of DNA sequences, the motif finding problem is to detect overrepresented motifs that are good candidates for being transcription factor binding sites. The current study is a survey of motif finding algorithms. The study shows that a sensible approach to detect motif is to search for statistically overrepresented motifs in the promoter region of a set of co-regulated genes. The weak point of the available motif finding algorithms is that they tend to be sensitive to the noise, i.e., the presence of upstream sequences in data set that do not contain the motif. We conclude that instead of relying on a single motif finding tool, biologists should use a few complementary tools and pursue the top few predicted motifs of each.Computer Science Departmen
    corecore