114 research outputs found

    Computing longest common square subsequences

    Get PDF
    A square is a non-empty string of form YY. The longest common square subsequence (LCSqS) problem is to compute a longest square occurring as a subsequence in two given strings A and B. We show that the problem can easily be solved in O(n^6) time or O(|M|n^4) time with O(n^4) space, where n is the length of the strings and M is the set of matching points between A and B. Then, we show that the problem can also be solved in O(sigma |M|^3 + n) time and O(|M|^2 + n) space, or in O(|M|^3 log^2 n log log n + n) time with O(|M|^3 + n) space, where sigma is the number of distinct characters occurring in A and B. We also study lower bounds for the LCSqS problem for two or more strings

    Eddy current defect response analysis using sum of Gaussian methods

    Get PDF
    This dissertation is a study of methods to automatedly detect and produce approximations of eddy current differential coil defect signatures in terms of a summed collection of Gaussian functions (SoG). Datasets consisting of varying material, defect size, inspection frequency, and coil diameter were investigated. Dimensionally reduced representations of the defect responses were obtained utilizing common existing reduction methods and novel enhancements to them utilizing SoG Representations. Efficacy of the SoG enhanced representations were studied utilizing common Machine Learning (ML) interpretable classifier designs with the SoG representations indicating significant improvement of common analysis metrics

    Using signal processing, evolutionary computation, and machine learning to identify transposable elements in genomes

    Get PDF
    About half of the human genome consists of transposable elements (TE's), sequences that have many copies of themselves distributed throughout the genome. All genomes, from bacterial to human, contain TE's. TE's affect genome function by either creating proteins directly or affecting genome regulation. They serve as molecular fossils, giving clues to the evolutionary history of the organism. TE's are often challenging to identify because they are fragmentary or heavily mutated. In this thesis, novel features for the detection and study of TE's are developed. These features are of two types. The first type are statistical features based on the Fourier transform used to assess reading frame use. These features measure how different the reading frame use is from that of a random sequence, which reading frames the sequence is using, and the proportion of use of the active reading frames. The second type of feature, called side effect machine (SEM) features, are generated by finite state machines augmented with counters that track the number of times the state is visited. These counters then become features of the sequence. The number of possible SEM features is super-exponential in the number of states. New methods for selecting useful feature subsets that incorporate a genetic algorithm and a novel clustering method are introduced. The features produced reveal structural characteristics of the sequences of potential interest to biologists. A detailed analysis of the genetic algorithm, its fitness functions, and its fitness landscapes is performed. The features are used, together with features used in existing exon finding algorithms, to build classifiers that distinguish TE's from other genomic sequences in humans, fruit flies, and ciliates. The classifiers achieve high accuracy (> 85%) on a variety of TE classification problems. The classifiers are used to scan large genomes for TE's. In addition, the features are used to describe the TE's in the newly sequenced ciliate, Tetrahymena thermophile to provide information for biologists useful to them in forming hypotheses to test experimentally concerning the role of these TE's and the mechanisms that govern them

    Data Structures for Efficient String Algorithms

    Get PDF
    This thesis deals with data structures that are mostly useful in the area of string matching and string mining. Our main result is an O(n)-time preprocessing scheme for an array of n numbers such that subsequent queries asking for the position of a minimum element in a specified interval can be answered in constant time (so-called RMQs for Range Minimum Queries). The space for this data structure is 2n+o(n) bits, which is shown to be asymptotically optimal in a general setting. This improves all previous results on this problem. The main techniques for deriving this result rely on combinatorial properties of arrays and so-called Cartesian Trees. For compressible input arrays we show that further space can be saved, while not affecting the time bounds. For the two-dimensional variant of the RMQ-problem we give a preprocessing scheme with quasi-optimal time bounds, but with an asymptotic increase in space consumption of a factor of log(n). It is well known that algorithms for answering RMQs in constant time are useful for many different algorithmic tasks (e.g., the computation of lowest common ancestors in trees); in the second part of this thesis we give several new applications of the RMQ-problem. We show that our preprocessing scheme for RMQ (and a variant thereof) leads to improvements in the space- and time-consumption of the Enhanced Suffix Array, a collection of arrays that can be used for many tasks in pattern matching. In particular, we will see that in conjunction with the suffix- and LCP-array 2n+o(n) bits of additional space (coming from our RMQ-scheme) are sufficient to find all occ occurrences of a (usually short) pattern of length m in a (usually long) text of length n in O(m*s+occ) time, where s denotes the size of the alphabet. This is certainly optimal if the size of the alphabet is constant; for non-constant alphabets we can improve this to O(m*log(s)+occ) locating time, replacing our original scheme with a data structure of size approximately 2.54n bits. Again by using RMQs, we then show how to solve frequency-related string mining tasks in optimal time. In a final chapter we propose a space- and time-optimal algorithm for computing suffix arrays on texts that are logically divided into words, if one is just interested in finding all word-aligned occurrences of a pattern. Apart from the theoretical improvements made in this thesis, most of our algorithms are also of practical value; we underline this fact by empirical tests and comparisons on real-word problem instances. In most cases our algorithms outperform previous approaches by all means

    Statistics and Evolution of Functional Genomic Sequence

    Get PDF
    In this thesis, three separate problems of genomics are addressed, utilizing methods related to the field of statistical mechanics. The goal of the project discussed in the first chapter is the elucidation of post-transcriptional gene regulation imposed by microRNAs, a recently discovered class of tiny non-coding RNAs. A probabilistic algorithm for the computational identification of genes regulated by microRNAs is introduced, which was developed based on experimental data and statistical analysis of whole genome data. In particular, the application of this algorithm to multiple-alignments of groups of related species allows for the specific and sensitive detection of genes targeted by microRNAs on a genome-wide level. Examination of clade-specific predictions and cross-clade comparison yields deeper insights into microRNA biology and first clues about long-term evolution of microRNA regulation, which are discussed in detail. Modeling evolutionary dynamics of microsatellites, an abundant class of repetitive sequence in eukaryotic genomes, was the objective of the second project and is discussed in chapter two. Inspired by the putative functionality of some of these elements and the difficulty of constructing correct sequence alignments that reflect the evolutionary relationships between microsatellites, a neutral model for microsatellite evolution is developed and tested in the fruit fly Drosophila melanogaster by comparing evolutionary rates predicted by the model to independent measurements of these rates from multiple alignments of three closely relates Drosophila species. The model is applied separately to genomic sequence categories of different functional annotations in order to assess the varying influence of selective constraint among these categories. In the last chapter, a general population genetic model is introduced that allows for the determination of transcription factor binding site stability as a function of selection strength, mutation rate and effective population size at arbitrary values of these parameters. The analytical solution of this model indicates the probability of a binding site to be functional. The model is used to compute the population fraction of functional binding sites at fixed selection pressure across a variety of different taxa. The results lead to the conclusion that a decreasing effective population size, such as observed at the evolutionary transition from prokaryotes to eukaryotes, could result in loss of binding site stability. An extension to our model serves us to assess the compensatory effect of the emergence of multiple binding sites for the same transcription factor in order to maintain the existing regulatory relationship

    : Classifying and Generating Repetitive Elements in the Genome Using Deep Learning

    Get PDF
    Repetitive elements are sequence patterns in the genome which are duplicated in large quantity. They serve important functions both in genomic preservation and evolution, leading to the need for their fast and accurate classification. The current gold standard for repeat identification can be achieved by establishing correspondences between a well-annotated library of repetitive elements and a given query sequence. However, annotation quality is highly variable across species. Therefore, for genomes whose repeats are poorly annotated, de novo methods must be used. A common approach of de novo methods is to first check the sequence for protein domain conservation. The presence and order of these protein domains are used as features for an expert-crafted rule-based system or an optimized machine learning classifier. Although de novo approaches have achieved modest success, two problems remain. Firstly, they require lengthy consensus sequences which take time to assemble, and may not be representative of the true diversity of repetitive elements in the sample. Secondly, these approaches are heavily reliant on hand-picking a comprehensive set of protein domains, which may need to be constantly adjusted as new repetitive elements are discovered. In this thesis I show that deep learning models are competitive with pattern matching based approaches at the level of a shotgun sequencing strand for de novo classification of repeat elements. I also explore ways of embedding sequences using deep learning models. Finally, I made these tools available through a web-based interface.Bachelor of Scienc

    Techniques To Facilitate the Understanding of Inter-process Communication Traces

    Get PDF
    High Performance Computing (HPC) systems play an important role in today’s heavily digitized world, which is in a constant demand for higher speed of calculation and performance. HPC applications are used in multiple domains such as telecommunication, health, scientific research, and more. With the emergence of multi-core and cloud computing platforms, the HPC paradigm is quickly becoming the design of choice of many service providers. HPC systems are also known to be complex to debug and analyze due to the large number of processes they involve and the way these processes communicate with each other to perform specific tasks. As a result, software engineers must spend extensive amount of time understanding the complex interactions among a system’s processes. This is usually done through the analysis of execution traces generated from running the system at hand. Traces, however, are very difficult to work with due to the overwhelming size of typical traces. The objective of this research is to present a set of techniques that facilitates the understanding of the behaviour of HPC applications through the analysis of system traces. The first technique consists of building an exchange format called MTF (MPI Trace Format) for representing and exchanging traces generated from HPC applications based on the MPI (Message Passing Interface) standard, which is a de facto standard for inter-process communication for high performance computing systems. The design of MTF is validated against well-known requirements for a standard exchange format. The second technique aims to facilitate the understanding of large traces of inter-process communication by automatically extracting communication patterns that characterize their main behaviour. Two algorithms are presented. The first one permits the recognition of repeating patterns in traces of MPI (Message Passing Interaction) applications whereas the second algorithm searches if a given communication pattern occurs in a trace. Both algorithms are based on the n-gram extraction technique used in natural language processing. Finally, we developed a technique to abstract MPI traces by detecting the different execution phases in a program based on concepts from information theory. Using this approach, software engineers can examine the trace as a sequence of high-level computational phases instead of a mere flow of low-level events. The techniques presented in this thesis have been tested on traces generated from real HPC programs. The results from several case studies demonstrate the usefulness and effectiveness of our techniques
    • …
    corecore