8 research outputs found

    Data mining in computational proteomics and genomics

    Get PDF
    This dissertation addresses data mining in bioinformatics by investigating two important problems, namely peak detection and structure matching. Peak detection is useful for biological pattern discovery while structure matching finds many applications in clustering and classification. The first part of this dissertation focuses on elastic peak detection in 2D liquid chromatographic mass spectrometry (LC-MS) data used in proteomics research. These data can be modeled as a time series, in which the X-axis represents time points and the Y-axis represents intensity values. A peak occurs in a set of 2D LC-MS data when the sum of the intensity values in a sliding time window exceeds a user-determined threshold. The elastic peak detection problem is to locate all peaks across multiple window sizes of interest in the dataset. A new method, called PeakID, is proposed in this dissertation, which solves the elastic peak detection problem in 2D LC-MS data without yielding any false negative. PeakID employs a novel data structure, called a Shifted Aggregation Tree or AggTree for short, to find the different peaks in the dataset. This method works by first constructing an AggTree in a bottom-up manner from the dataset, and then searching the AggTree for the peaks in a top-down manner. PeakID uses a state-space algorithm to find the topology and structure of an efficient AggTree. Experimental results demonstrate the superiority of the proposed method over other methods on both synthetic and real-world data. The second part of this dissertation focuses on RNA pseudoknot structure matching and alignment. RNA pseudoknot structures play important roles in many genomic processes. Previous methods for comparative pseudoknot analysis mainly focus on simultaneous folding and alignment of RNA sequences. Little work has been done to align two known RNA secondary structures with pseudoknots taking into account both sequence and structure information of the two RNAs. A new method, called RKalign, is proposed in this dissertation for aligning two known RNA secondary structures with pseudoknots. RKalign adopts the partition function methodology to calculate the posterior log-odds scores of the alignments between bases or base pairs of the two RNAs with a dynamic programming algorithm. The posterior log-odds scores are then used to calculate the expected accuracy of an alignment between the RNAs. The goal is to find an optimal alignment with the maximum expected accuracy. RKalign employs a greedy algorithm to achieve this goal. The performance of RKalign is investigated and compared with existing tools for RNA structure alignment. An extension of the proposed method to multiple alignment of pseudoknot structures is also discussed. RKalign is implemented in Java and freely accessible on the Internet. As more and more pseudoknots are revealed, collected and stored in public databases, it is anticipated that a tool like RKalign will play a significant role in data comparison, annotation, analysis, and retrieval in these databases

    Kisses, ambivalent models and more: Contributions to the analysis of RNA secondary structure.

    Get PDF
    Janssen S. Kisses, ambivalent models and more: Contributions to the analysis of RNA secondary structure. Bielefeld: Universitätsbibliothek; 2014.The full functional role of RNA in all domains of life is yet to be explored. Deep sequencing technologies generate massive data about RNA transcripts with functional potential. To decipher this information, bioinformatics methods for structural analysis are in demand. With this thesis at hand, we want to improve current secondary structure prediction in different respects. The introductory chapter explains ADP with a focus on its comfortable, but atypical style of specifying algorithms. Then, we present five contributions to the analysis of RNA secondary structures. 1. It is the nature of models to abstract and simplify reality in order to master its complexity. Chapter 3 is an in depth analysis of four popular computational models of RNA secondary structure (Programs RNAshapes and RNAalishapes). 2. The secondary structure of RNA is too dynamic to be described by a single structure and in turn, there is no single optimal secondary structure. Thus, we compute the most likely abstract shape of a given RNA sequence. Improvements of the algorithms for computing the likelihood of abstract shapes are discussed in Chapter 4, specifically with regards to computational speed (Program RapidShapes). 3. For computational complexity reasons, models of RNA structures commonly exclude crossing base-pairs, the so-called "pseudoknots", from the secondary structure. In Chapter 5, we introduce a heuristic for mastering a frequent type of pseudoknots: "kissing-hairpins" (Program pKiss). 4. In Chapter 6 we revisit the old algorithmic idea of outside-in computation for the new programming framework Bellman’s GAP. This broadens the arsenal of rapid prototyping algorithms for RNA and other sequential problems. It adds "outside" and "MEA" functionality to RNAshapes and RNAalishapes. 5. Covariance Models representing RNA families assume a single consensus secondary structure for a set of related RNAs and serve as statistical tools to search for additional members. In Chapter 7, we evaluate CM scorings that are more structurespecific than the standard sequence-to-model alignments. Furthermore, we introduce a technique to incorporate "ambivalent" consensus structures into covariance models (Program aCMs). The results of this work are available at the Bielefeld Bioinformatic Server. The RNA Studio (http://bibiserv.cebitec.uni-bielefeld.de/rna) supports ready to use web-submissions, web-services and cloud computing for the programs developed in this thesis. debian packages foster a simple way to install our software on your local machine. Developers can benefit from our algorithmic analyses or use our sources for rapid prototyping as a primer for new implementations: http://bibiserv.cebitec.uni-bielefeld.de/fold-grammars

    MapReduce Algorithms for Inferring Gene Regulatory Networks from Time-Series Microarray Data Using an Information-Theoretic Approach

    Full text link
    Gene regulation is a series of processes that control gene expression and its extent. The connections among genes and their regulatory molecules, usually transcription factors, and a descriptive model of such connections, are known as gene regulatory networks (GRNs). Elucidating GRNs is crucial to understand the inner workings of the cell and the complexity of gene interactions. To date, numerous algorithms have been developed to infer gene regulatory networks. However, as the number of identified genes increases and the complexity of their interactions is uncovered, networks and their regulatory mechanisms become cumbersome to test. Furthermore, prodding through experimental results requires an enormous amount of computation, resulting in slow data processing. Therefore, new approaches are needed to expeditiously analyze copious amounts of experimental data resulting from cellular GRNs. To meet this need, cloud computing is promising as reported in the literature. Here we propose new MapReduce algorithms for inferring gene regulatory networks on a Hadoop cluster in a cloud environment. These algorithms employ an information-theoretic approach to infer GRNs using time-series microarray data. Experimental results show that our MapReduce program is much faster than an existing tool while achieving slightly better prediction accuracy than the existing tool.Comment: 19 pages, 5 figure

    A data science approach to pattern discovery in complex structures with applications in bioinformatics

    Get PDF
    Pattern discovery aims to find interesting, non-trivial, implicit, previously unknown and potentially useful patterns in data. This dissertation presents a data science approach for discovering patterns or motifs from complex structures, particularly complex RNA structures. RNA secondary and tertiary structure motifs are very important in biological molecules, which play multiple vital roles in cells. A lot of work has been done on RNA motif annotation. However, pattern discovery in RNA structure is less studied. In the first part of this dissertation, an ab initio algorithm, named DiscoverR, is introduced for pattern discovery in RNA secondary structures. This algorithm works by representing RNA secondary structures as ordered labeled trees and performs tree pattern discovery using a quadratic time dynamic programming algorithm. The algorithm is able to identify and extract the largest common substructures from two RNA molecules of different sizes, without prior knowledge of locations and topologies of these substructures. One application of DiscoverR is to locate the RNA structural elements in genomes. Experimental results show that this tool complements the currently used approaches for mining conserved structural RNAs in the human genome. DiscoverR can also be extended to find repeated regions in an RNA secondary structure. Specifically, this extended method is used to detect structural repeats in the 3\u27-untranslated region of a protein kinase gene

    Revising the evolutionary imprint of RNA structure in mammalian genomes

    Get PDF

    Probabilistic Modelling of Morphologically Rich Languages

    Full text link
    This thesis investigates how the sub-structure of words can be accounted for in probabilistic models of language. Such models play an important role in natural language processing tasks such as translation or speech recognition, but often rely on the simplistic assumption that words are opaque symbols. This assumption does not fit morphologically complex language well, where words can have rich internal structure and sub-word elements are shared across distinct word forms. Our approach is to encode basic notions of morphology into the assumptions of three different types of language models, with the intention that leveraging shared sub-word structure can improve model performance and help overcome data sparsity that arises from morphological processes. In the context of n-gram language modelling, we formulate a new Bayesian model that relies on the decomposition of compound words to attain better smoothing, and we develop a new distributed language model that learns vector representations of morphemes and leverages them to link together morphologically related words. In both cases, we show that accounting for word sub-structure improves the models' intrinsic performance and provides benefits when applied to other tasks, including machine translation. We then shift the focus beyond the modelling of word sequences and consider models that automatically learn what the sub-word elements of a given language are, given an unannotated list of words. We formulate a novel model that can learn discontiguous morphemes in addition to the more conventional contiguous morphemes that most previous models are limited to. This approach is demonstrated on Semitic languages, and we find that modelling discontiguous sub-word structures leads to improvements in the task of segmenting words into their contiguous morphemes.Comment: DPhil thesis, University of Oxford, submitted and accepted 2014. http://ora.ox.ac.uk/objects/uuid:8df7324f-d3b8-47a1-8b0b-3a6feb5f45c

    Fuelling the zero-emissions road freight of the future: routing of mobile fuellers

    Get PDF
    The future of zero-emissions road freight is closely tied to the sufficient availability of new and clean fuel options such as electricity and Hydrogen. In goods distribution using Electric Commercial Vehicles (ECVs) and Hydrogen Fuel Cell Vehicles (HFCVs) a major challenge in the transition period would pertain to their limited autonomy and scarce and unevenly distributed refuelling stations. One viable solution to facilitate and speed up the adoption of ECVs/HFCVs by logistics, however, is to get the fuel to the point where it is needed (instead of diverting the route of delivery vehicles to refuelling stations) using "Mobile Fuellers (MFs)". These are mobile battery swapping/recharging vans or mobile Hydrogen fuellers that can travel to a running ECV/HFCV to provide the fuel they require to complete their delivery routes at a rendezvous time and space. In this presentation, new vehicle routing models will be presented for a third party company that provides MF services. In the proposed problem variant, the MF provider company receives routing plans of multiple customer companies and has to design routes for a fleet of capacitated MFs that have to synchronise their routes with the running vehicles to deliver the required amount of fuel on-the-fly. This presentation will discuss and compare several mathematical models based on different business models and collaborative logistics scenarios

    Applied Ecology and Environmental Research 2017

    Get PDF
    corecore