36 research outputs found
Recommended from our members
The Swiss army knife of time series data mining: ten useful things you can do with the matrix profile and ten lines of code
Tight(er) bounds for similarity measures, smoothed approximation and broadcasting
In this thesis, we prove upper and lower bounds on the complexity of sequence similarity measures, the approximability of geometric problems on realistic inputs, and the performance of randomized broadcasting protocols.
The first part approaches the question why a number of fundamental polynomial-time problems - specifically, Dynamic Time Warping, Longest Common Subsequence (LCS), and the Levenshtein distance - resists decades-long attempts to obtain polynomial improvements over their simple dynamic programming solutions. We prove that any (strongly) subquadratic algorithm for these and related sequence similarity measures would refute the Strong Exponential Time Hypothesis (SETH). Focusing particularly on LCS, we determine a tight running time bound (up to lower order factors and conditional on SETH) when the running time is expressed in terms of all input parameters that have been previously exploited in the extensive literature.
In the second part, we investigate the approximation performance of the popular 2-Opt heuristic for the Traveling Salesperson Problem using the smoothed analysis paradigm. For the Fréchet distance, we design an improved approximation algorithm for the natural input class of c-packed curves, matching a conditional lower bound.
Finally, in the third part we prove tighter performance bounds for processes that disseminate a piece of information, either as quickly as possible (rumor spreading) or as anonymously as possible (cryptogenography).Die vorliegende Dissertation beweist obere und untere Schranken an die Komplexität von Sequenzähnlichkeitsmaßen, an die Approximierbarkeit geometrischer Probleme auf realistischen Eingaben und an die Effektivität randomisierter Kommunikationsprotokolle.
Der erste Teil befasst sich mit der Frage, warum für eine Vielzahl fundamentaler Probleme im Polynomialzeitbereich - insbesondere für das Dynamic-Time-Warping, die längste gemeinsame Teilfolge (LCS) und die Levenshtein-Distanz - seit Jahrzehnten keine Algorithmen gefunden werden konnten, die polynomiell schneller sind als ihre einfachen Lösungen mittels dynamischer Programmierung. Wir zeigen, dass ein (im strengen Sinne) subquadratischer Algorithmus für diese und verwandte Ähnlichkeitsmaße die starke Exponentialzeithypothese (SETH) widerlegen würde. Für LCS zeigen wir eine scharfe Schranke an die optimale Laufzeit (unter der SETH und bis auf Faktoren niedrigerer Ordnung) in Abhängigkeit aller bisher untersuchten Eingabeparameter.
Im zweiten Teil untersuchen wir die Approximationsgüte der klassischen 2-Opt-Heuristik für das Problem des Handlungsreisenden anhand des Smoothed-Analysis-Paradigmas. Weiterhin entwickeln wir einen verbesserten Approximationsalgorithmus für die Fréchet-Distanz auf einer Klasse natürlicher Eingaben.
Der letzte Teil beweist neue Schranken für die Effektivität von Prozessen, die Informationen entweder so schnell wie möglich (Rumor-Spreading) oder so anonym wie möglich (Kryptogenografie) verbreiten
Novel stochastic and entropy-based Expectation-Maximisation algorithm for transcription factor binding site motif discovery
The discovery of transcription factor binding site (TFBS) motifs remains an important
and challenging problem in computational biology. This thesis presents MITSU,
a novel algorithm for TFBS motif discovery which exploits stochastic methods as a
means of both overcoming optimality limitations in current algorithms and as a framework
for incorporating relevant prior knowledge in order to improve results.
The current state of the TFBS motif discovery field is surveyed, with a focus
on probabilistic algorithms that typically take the promoter regions of coregulated
genes as input. A case is made for an approach based on the stochastic Expectation-
Maximisation (sEM) algorithm; its position amongst existing probabilistic algorithms
for motif discovery is shown. The algorithm developed in this thesis is unique amongst
existing motif discovery algorithms in that it combines the sEM algorithm with a derived
data set which leads to an improved approximation to the likelihood function.
This likelihood function is unconstrained with regard to the distribution of motif occurrences
within the input dataset. MITSU also incorporates a novel heuristic to automatically
determine TFBS motif width. This heuristic, known as MCOIN, is shown to
outperform current methods for determining motif width. MITSU is implemented in
Java and an executable is available for download.
MITSU is evaluated quantitatively using realistic synthetic data and several collections
of previously characterised prokaryotic TFBS motifs. The evaluation demonstrates
that MITSU improves on a deterministic EM-based motif discovery algorithm
and an alternative sEM-based algorithm, in terms of previously established metrics.
The ability of the sEM algorithm to escape stable fixed points of the EM algorithm,
which trap deterministic motif discovery algorithms and the ability of MITSU to discover
multiple motif occurrences within a single input sequence are also demonstrated.
MITSU is validated using previously characterised Alphaproteobacterial motifs,
before being applied to motif discovery in uncharacterised Alphaproteobacterial data.
A number of novel results from this analysis are presented and motivate two extensions
of MITSU: a strategy for the discovery of multiple different motifs within a single
dataset and a higher order Markov background model. The effects of incorporating
these extensions within MITSU are evaluated quantitatively using previously characterised
prokaryotic TFBS motifs and demonstrated using Alphaproteobacterial motifs.
Finally, an information-theoretic measure of motif palindromicity is presented and its
advantages over existing approaches for discovering palindromic motifs discussed
Recommended from our members
Identifying Repeat Domains in Large Genomes
We present a graph-based method for the analysis of repeat families in a repeat library. We build a repeat domain graph that decomposes a repeat library into repeat domains, short subsequences shared by multiple repeat families, and reveals the mosaic structure of repeat families. Our method recovers documented mosaic repeat structures and suggests additional putative ones. Our method is useful for elucidating the evolutionary history of repeats and annotating de novo generated repeat libraries
Multiple Biolgical Sequence Alignment: Scoring Functions, Algorithms, and Evaluations
Aligning multiple biological sequences such as protein sequences or DNA/RNA sequences is a fundamental task in bioinformatics and sequence analysis. These alignments may contain invaluable information that scientists need to predict the sequences\u27 structures, determine the evolutionary relationships between them, or discover drug-like compounds that can bind to the sequences. Unfortunately, multiple sequence alignment (MSA) is NP-Complete. In addition, the lack of a reliable scoring method makes it very hard to align the sequences reliably and to evaluate the alignment outcomes.
In this dissertation, we have designed a new scoring method for use in multiple sequence alignment. Our scoring method encapsulates stereo-chemical properties of sequence residues and their substitution probabilities into a tree-structure scoring scheme. This new technique provides a reliable scoring scheme with low computational complexity.
In addition to the new scoring scheme, we have designed an overlapping sequence clustering algorithm to use in our new three multiple sequence alignment algorithms. One of our alignment algorithms uses a dynamic weighted guidance tree to perform multiple sequence alignment in progressive fashion. The use of dynamic weighted tree allows errors in the early alignment stages to be corrected in the subsequence stages. Other two algorithms utilize sequence knowledge-bases and sequence consistency to produce biological meaningful sequence alignments. To improve the speed of the multiple sequence alignment, we have developed a parallel algorithm that can be deployed on reconfigurable computer models. Analytically, our parallel algorithm is the fastest progressive multiple sequence alignment algorithm
Tweets on a tree: Index-based clustering of tweets
Computer-mediated communication, CMC, is a type of communication that occurs through use of two or more electronic devices. With the advancement of technology, CMC has started to become a more preferred type of communication between humans. Through computer-mediated technologies, news portals, search engines and social media platforms such as Facebook, Twitter, Reddit and many other platforms are created. In social media platforms, a user can post and discuss his/her own opinion and also read and share other users' opinions. This generates a signi cant amount of data which, if ltered and analyzed, can give researchers important insights about public opinion and culture. Twitter is a social networking service founded in 2006 and became widespread throughout the world in a very short time frame. The service has more than 310 million monthly active users and throughout these users more than 500 million tweets are generated daily as of 2016. Due the volume, velocity and variety of Twitter data, it cannot be analyzed by using conventional methods. A clustering or sampling method is necessary to reduce the amount of data for analysis. To cluster documents, in a very broad sense two similarity measures can be used: Lexical similarity and semantic similarity. Lexical similarity looks for syntactic similarity between documents. It is usually computationally light to compute lexical similarity, however for clustering purposes it may not be very accurate as it disregards the semantic value of words. On the other hand, semantic similarity looks for semantic value and relations between words to calculate the similarity and while it is generally more accurate than lexical similarity, it is computationally di cult to calculate semantic similarity. In our work we aim to create computationally light and accurate clustering of short documents which have the characteristics of big data. We propose a hybrid approach of clustering where lexical and semantic similarity is combined together. In our approach, we use string similarity to create clusters and semantic vector representations of words to interactively merge clusters
28th Annual Symposium on Combinatorial Pattern Matching : CPM 2017, July 4-6, 2017, Warsaw, Poland
Peer reviewe
Survey of Deoxyribonucleic Acid Motif Finding Algorithms
An important task in biology is to identify binding sites in DNA for transcription factors. These binding sites are short DNA segments which are called motifs. Given a set of DNA sequences, the motif finding problem is to detect overrepresented motifs that are good candidates for being transcription factor binding sites. The current study is a survey of motif finding algorithms. The study shows that a sensible approach to detect motif is to search for statistically overrepresented motifs in the promoter region of a set of co-regulated genes. The weak point of the available motif finding algorithms is that they tend to be sensitive to the noise, i.e., the presence of upstream sequences in data set that do not contain the motif. We conclude that instead of relying on a single motif finding tool, biologists should use a few complementary tools and pursue the top few predicted motifs of each.Computer Science Departmen