109 research outputs found

    Robust and Efficient Algorithms for Protein 3-D Structure Alignment and Genome Sequence Comparison

    Get PDF
    Sequence analysis and structure analysis are two of the fundamental areas of bioinformatics research. This dissertation discusses, specifically, protein structure related problems including protein structure alignment and query, and genome sequence related problems including haplotype reconstruction and genome rearrangement. It first presents an algorithm for pairwise protein structure alignment that is tested with structures from the Protein Data Bank (PDB). In many cases it outperforms two other well-known algorithms, DaliLite and CE. The preliminary algorithm is a graph-theory based approach, which uses the concept of \stars to reduce the complexity of clique-finding algorithms. The algorithm is then improved by introducing \double-center stars in the graph and applying a self-learning strategy. The updated algorithm is tested with a much larger set of protein structures and shown to be an improvement in accuracy, especially in cases of weak similarity. A protein structure query algorithm is designed to search for similar structures in the PDB, using the improved alignment algorithm. It is compared with SSM and shows better performance with lower maximum and average Q-score for missing proteins. An interesting problem dealing with the calculation of the diameter of a 3-D sequence of points arose and its connection to the sublinear time computation is discussed. The diameter calculation of a 3-D sequence is approximated by a series of sublinear time deterministic, zero-error and bounded-error randomized algorithms and we have obtained a series of separations about the power of sublinear time computations. This dissertation also discusses two genome sequence related problems. A probabilistic model is proposed for reconstructing haplotypes from SNP matrices with incomplete and inconsistent errors. The experiments with simulated data show both high accuracy and speed, conforming to the theoretically provable e ciency and accuracy of the algorithm. Finally, a genome rearrangement problem is studied. The concept of non-breaking similarity is introduced. Approximating the exemplar non-breaking similarity to factor n1..f is proven to be NP-hard. Interestingly, for several practical cases, several polynomial time algorithms are presented

    Gene family-free genome comparison

    Get PDF
    Dörr D. Gene family-free genome comparison. Bielefeld: Universität Bielefeld; 2016.Computational comparative genomics offers valuable insights into the shared and individual evolutionary histories of living and extinct species and expands our understanding of cellular processes in living cells. Comparing genomes means identifying differences that originated from mutational modifications in their evolutionary past. In studying genome evolution, one differentiates between point mutations, genome rearrangements, and content modifications. Point mutations affect one or few consecutive nucleotide bases in the DNA sequence, whereas genome rearrangements operate on larger genomic regions, thereby altering the order and composition of genes in chromosomal sequences. Lastly, content modifications are a result of gene family evolution that causes gene duplications and losses. Genome rearrangement studies commonly assume that evolutionary relationships between all pairs of genes are resolved. Based on the biological concept of homology, the set of genes can be partitioned into gene families. All genes in a gene family are homologous, i.e., they evolved from the same ancestral sequence. Homology information is generally not given, hence gene families are commonly predicted computationally on the basis of sequence similarity or higher order features of their gene products. These predictions are often unreliable, leading to errors in subsequent genome rearrangement studies. In an attempt to avoid errors resulting from incorrect or incomplete gene family assignments, we develop new methods for genome rearrangement studies that do not require prior knowledge of gene family assignments of genes. Our approach, called gene family-free genome comparison, is innovative in that we account for differences between genes caused by point mutations while studying their order and composition in chromosomes. In lieu of gene family assignments, our proposed methods rely on pairwise similarities between genes. In practice, we obtain gene similarities from the conservation of their protein sequences. Two genes that are located next to each other on a chromosome are said to be adjacent, their adjoining extremities form an adjacency. The number of conserved adjacencies, i.e., those adjacencies that are common to two genomes, gives rise to a measure for gene~order-based genome similarity. If the gene content of both genomes is identical, the number of conserved adjacencies is the dual measure of the well-known breakpoint distance. We study the problem of computing the number of conserved adjacencies in a family-free setting, which relies on pairwise similarities between genes. We analyze its computational complexity and develop exact and heuristic algorithms for its solution in pairwise comparisons. We then advance to the problem of reconstructing ancestral sequences. Given three genomes, we study the problem of constructing a fourth genome, called the median, which maximizes a family-free, pairwise measure of conserved adjacencies between the median and each of the three given genomes. Our model is a family-free generalization of the well-studied mixed multichromosomal breakpoint median. We show that this problem is NP-hard and devise an exact algorithm for its solution. Gene orders become increasingly scrambled over longer evolutionary periods of time. In distant genomes, gene order analyses based on identifying pairs of conserved adjacencies might no longer be informative. Yet, relaxed constraints of gene order conservation are still able to capture weaker, but nonetheless existing remnants of common ancestral gene order, which leads to the problem of identifying syntenic blocks in two or more genomes. Knowing the evolutionary relationships between genes, one can assign a unique character to each gene family and represent a chromosome by a string drawn from the alphabet of gene family characters. Two intervals from two strings are called common intervals if the sets of characters within these intervals are identical. We extend this concept to indeterminate strings, which are a class of strings that have at every position a non-empty set of characters. We propose several models of common intervals in indeterminate strings and devise efficient algorithms for their corresponding discovery problems. Subsequently, we use the concept of common intervals in indeterminate strings to identify syntenic regions in a gene family-free setting. We evaluate all our proposed models and algorithms on simulated or biological datasets and assess their performance and applicability in gene family-free genome analyses

    Massively Parallel Approach to Modeling 3D Objects in Machine Vision

    Get PDF
    Electrical Engineerin

    Statistical Methods for Image Registration and Denoising

    Get PDF
    This dissertation describes research into image processing techniques that enhance military operational and support activities. The research extends existing work on image registration by introducing a novel method that exploits local correlations to improve the performance of projection-based image registration algorithms. The dissertation also extends the bounds on image registration performance for both projection-based and full-frame image registration algorithms and extends the Barankin bound from the one-dimensional case to the problem of two-dimensional image registration. It is demonstrated that in some instances, the Cramer-Rao lower bound is an overly-optimistic predictor of image registration performance and that under some conditions, the Barankin bound is a better predictor of shift estimator performance. The research also looks at the related problem of single-frame image denoising using block-based methods. The research introduces three algorithms that operate by identifying regions of interest within a noise-corrupted image and then generating noise free estimates of the regions as averages of similar regions in the image

    Eddy current defect response analysis using sum of Gaussian methods

    Get PDF
    This dissertation is a study of methods to automatedly detect and produce approximations of eddy current differential coil defect signatures in terms of a summed collection of Gaussian functions (SoG). Datasets consisting of varying material, defect size, inspection frequency, and coil diameter were investigated. Dimensionally reduced representations of the defect responses were obtained utilizing common existing reduction methods and novel enhancements to them utilizing SoG Representations. Efficacy of the SoG enhanced representations were studied utilizing common Machine Learning (ML) interpretable classifier designs with the SoG representations indicating significant improvement of common analysis metrics
    • …
    corecore