    Sequential Pattern Mining Aide to Bio-Informatics

    Practical Bio-Informatic is the study of all vicinities of development, testing and novel appliances for statistical and computational techniques for prototype and study of all types of scientific data, in addition to further areas of Information Technology and Sciences. Bio-Informatics is a novel approach to conceptualize the natural science in provisions of molecules and apply Informatic methods is derived from computer science and applied mathematics regulation, for instance, info to be grateful for and systematize them in order to relate with these molecules, on a large scale for use in future research studies. Bio-Informatics is the study of upcoming appliances in natural science, chemistry, pharmaceuticals, medicine, and agriculture and various additional fields of research and development. Many pharmaceutical manufacturing companies are attracted in mining sequential patterns from the databases. Sequential Pattern Mining is doing good technique of data mining, which recognizes the temporal relationship between different drugs and it can help in estimating the treatment course for patients. These studies give an improvement in the sympathetic of the loom of Sequential Pattern Mining and Bio-Informatics play a part to a vital role in a biomedical study in the storage of patient’s case reports which is useful in providing treatment to other patients

    University Authors, 2018

    Big data analytics in computational biology and bioinformatics

    Big data analytics in computational biology and bioinformatics refers to an array of operations including biological pattern discovery, classification, prediction, inference, clustering as well as data mining in the cloud, among others. This dissertation addresses big data analytics by investigating two important operations, namely pattern discovery and network inference. The dissertation starts by focusing on biological pattern discovery at a genomic scale. Research reveals that the secondary structure in non-coding RNA (ncRNA) is more conserved during evolution than its primary nucleotide sequence. Using a covariance model approach, the stems and loops of an ncRNA secondary structure are represented as a statistical image against which an entire genome can be efficiently scanned for matching patterns. The covariance model approach is then further extended, in combination with a structural clustering algorithm and a random forests classifier, to perform genome-wide search for similarities in ncRNA tertiary structures. The dissertation then presents methods for gene network inference. Vast bodies of genomic data containing gene and protein expression patterns are now available for analysis. One challenge is to apply efficient methodologies to uncover more knowledge about the cellular functions. Very little is known concerning how genes regulate cellular activities. A gene regulatory network (GRN) can be represented by a directed graph in which each node is a gene and each edge or link is a regulatory effect that one gene has on another gene. By evaluating gene expression patterns, researchers perform in silico data analyses in systems biology, in particular GRN inference, where the “reverse engineering” is involved in predicting how a system works by looking at the system output alone. Many algorithmic and statistical approaches have been developed to computationally reverse engineer biological systems. However, there are no known bioin-formatics tools capable of performing perfect GRN inference. Here, extensive experiments are conducted to evaluate and compare recent bioinformatics tools for inferring GRNs from time-series gene expression data. Standard performance metrics for these tools based on both simulated and real data sets are generally low, suggesting that further efforts are needed to develop more reliable GRN inference tools. It is also observed that using multiple tools together can help identify true regulatory interactions between genes, a finding consistent with those reported in the literature. Finally, the dissertation discusses and presents a framework for parallelizing GRN inference methods using Apache Hadoop in a cloud environment

    The Many Qualities of a New Directly Accessible Compression Scheme

    We present a new variable-length computation-friendly encoding scheme, named SFDC (Succinct Format with Direct aCcesibility), that supports direct and fast accessibility to any element of the compressed sequence and achieves compression ratios often higher than those offered by other solutions in the literature. The SFDC scheme provides a flexible and simple representation geared towards either practical efficiency or compression ratios, as required. For a text of length nn over an alphabet of size σ\sigma and a fixed parameter λ\lambda, the access time of the proposed encoding is proportional to the length of the character's code-word, plus an expected O((Fσλ+33)/Fσ+1)\mathcal{O}((F_{\sigma - \lambda + 3} - 3)/F_{\sigma+1}) overhead, where FjF_j is the jj-th number of the Fibonacci sequence. In the overall it uses N+O(n(λ(Fσ+33)/Fσ+1))=N+O(n)N+\mathcal{O}\big(n \left(\lambda - (F_{\sigma+3}-3)/F_{\sigma+1}\big) \right) = N + \mathcal{O}(n) bits, where NN is the length of the encoded string. Experimental results show that the performance of our scheme is, in some respects, comparable with the performance of DACs and Wavelet Tees, which are among of the most efficient schemes. In addition our scheme is configured as a \emph{computation-friendly compression} scheme, as it counts several features that make it very effective in text processing tasks. In the string matching problem, that we take as a case study, we experimentally prove that the new scheme enables results that are up to 29 times faster than standard string-matching techniques on plain texts.Comment: 33 page

    Randomized and Deterministic Parameterized Algorithms and Their Applications in Bioinformatics

    Parameterized NP-hard problems are NP-hard problems that are associated with special variables called parameters. One example of the problem is to find simple paths of length k in a graph, where the integer k is the parameter. We call this problem the p-path problem. The p-path problem is the parameterized version of the well-known NP-complete problem - the longest simple path problem. There are two main reasons why we study parameterized NP-hard problems. First, many application problems are naturally associated with certain parameters. Hence we need to solve these parameterized NP-hard problems. Second, if parameters take only small values, we can take advantage of these parameters to design very effective algorithms. If a parameterized NP-hard problem can be solved by an algorithm of running time in form of f(k)nO(1), where k is the parameter, f(k) is independent of n, and n is the input size of the problem instance, we say that this parameterized NP-hard problem is fixed parameter tractable (FPT). If a problem is FPT and the parameter takes only small values, the problem can be solved efficiently (it can be solved almost in polynomial time). In this dissertation, first, we introduce several techniques that can be used to design efficient algorithms for parameterized NP-hard problems. These techniques include branch and bound, divide and conquer, color coding and dynamic programming, iterative compression, iterative expansion and kernelization. Then we present our results about how to use these techniques to solve parameterized NP-hard problems, such as the p-path problem and the pd-feedback vertex set problem. Especially, we designed the first algorithm of running time in form of f(k)nO(1) for the pd-feedback vertex set problem. Thus solved an outstanding open problem, i.e. if the pd-feedback vertex set problem is FPT. Finally, we will introduce how to use parameterized algorithm techniques to solve the signaling pathway problem and the motif finding problem from bioinformatics

    Bioinformatics Database Systems

    We argue the significance of a fundamental shift in bioinformatics, from in-the-small to inthe-large. Adopting a large-scale perspective is a way to manage the problems endemic to the world of the small—constellations of incompatible tools for which the effort required to assemble an integrated system exceeds the perceived benefit of the integration. Where bioinformatics in-the-small is about data and tools, bioinformatics in-the-large is about metadata and dependencies. Dependencies represent the complexities of large-scale integration, including the requirements and assumptions governing the composition of tools. The popular make utility is a very effective system for defining and maintaining simple dependencies, and it offers a number of insights about the essence of bioinformatics in-the-large. Keeping an in-the-large perspective has been very useful to us in large bioinformatics projects. We give two fairly different examples, and extract lessons from them showing how it has helped. These examples both suggest the benefit of explicitly defining and managing knowledge flows and knowledge maps (which represent metadata regarding types, flows, and dependencies), and also suggest approaches for developing bioinformatics database systems. Generally, we argue that large-scale engineering principles can be successfully adapted from disciplines such as software engineering and data management, and that having an in-the-large perspective will be a key advantage in the next phase of bioinformatics development

    Evolving from Bioinformatics in the Small to Bioinformatics

