43 research outputs found

    A big data approach for sequences indexing on the cloud via burrows wheeler transform

    Get PDF
    Indexing sequence data is important in the context of Precision Medicine, where large amounts of "omics"data have to be daily collected and analyzed in order to categorize patients and identify the most effective therapies. Here we propose an algorithm for the computation of Burrows Wheeler transform relying on Big Data technologies, i.e., Apache Spark and Hadoop. Our approach is the first that distributes the index computation and not only the input dataset, allowing to fully benefit of the available cloud resources. Copyright © 2020 for this paper by its authors

    Searching for repetitions in biological networks: methods, resources and tools

    Get PDF
    We present here a compact overview of the data, models and methods proposed for the analysis of biological networks based on the search for significant repetitions. In particular, we concentrate on three problems widely studied in the literature: 'network alignment', 'network querying' and 'network motif extraction'. We provide (i) details of the experimental techniques used to obtain the main types of interaction data, (ii) descriptions of the models and approaches introduced to solve such problems and (iii) pointers to both the available databases and software tools. The intent is to lay out a useful roadmap for identifying suitable strategies to analyse cellular data, possibly based on the joint use of different interaction data types or analysis techniques

    Extracting string motif bases for quorum higher than two

    Get PDF
    Bases of generators of motifs consisting of strings in which some positions can be occupied by a don’t care provide a useful conceptual tool for their description and a way to reduce the time and space involved in the discovery process. In the last few years, a few algorithms have been proposed for the extraction of a basis, building in large part on combinatorial properties of strings and their autocorrelations. Currently, the most efficient techniques for binary alphabets and quorum q = 2 require time quadratic in the length of the host string. The present paper explores properties of motif bases for quorum q ≥ 2, both with binary and general alphabets, by also showing that important results holding for quorum q = 2 cannot be extended to this, more general, case. Furthermore, the extraction of motifs in which a bound is set on the maximum allowed number of don’t cares is addressed, and suitable algorithms are proposed whose computational complexity depends on the fixed bound

    Optimal extraction of motif patterns in 2D

    No full text
    The combinatorial explosion of motif patterns occurring in 1D and 2D arrays leads to the consideration of special classes of motifs growing linearly with the size of the input array. Such motifs, called irredundant motifs, are able to succinctly represent all of the other motifs occurring in the same array within reasonable time and space bounds. In previous work irredundant motifs were extracted from 2D arrays in O (N 2 log 2 n log log n) and O (N 3) time, where N is the size of the 2D input array and n is its largest dimension. In this paper, we present an algorithm to extract irredundant motifs from 2D arrays that is quadratic in the size of the input. The input is defined on a binary alphabet. It is shown that the algorithm is optimal and practically faster than the previous ones. © 2009 Elsevier B.V. All rights reserved

    DNA combinatorial messages and Epigenomics: The case of chromatin organization and nucleosome occupancy in eukaryotic genomes

    Get PDF
    Epigenomics is the study of modifications on the genetic material of a cell that do not depend on changes in the DNA sequence, since those latter involve specific proteins around which DNA wraps. The end result is that Epigenomic changes have a fundamental role in the proper working of each cell in Eukaryotic organisms. A particularly important part of Epigenomics concentrates on the study of chromatin, that is, a fiber composed of a DNA-protein complex and very characterizing of Eukaryotes. Understanding how chromatin is assembled and how it changes is fundamental for Biology. In more than thirty years of research in this area, Mathematics and Theoretical Computer Science have gained a prominent role, in terms of modeling and mining, regarding in particular the so-called 10 nm fiber. Starting from some very basic notions of Biology, we briefly illustrate the recent advances obtained via laboratory experiments on the organization and dynamics of chromatin. Then, we mainly concentrate our attention on the contributions given by Combinatorial and Informational Methodologies, that are at the hearth of Theoretical Computer Science, to the understanding of mechanisms determining the 10 nm fiber. We conclude highlighting several directions of investigation that are perceived as important and where Theoretical Computer Science can provide high impact result

    Epigenomic k-mer dictionaries: shedding light on how sequence composition influences in vivo nucleosome positioning

    Get PDF
    Motivation: Information-theoretic and compositional analysis of biological sequences, in terms of k-mer dictionaries, has a well established role in genomic and proteomic studies. Much less so in epigenomics, although the role of k-mers in chromatin organization and nucleosome positioning is particularly relevant. Fundamental questions concerning the informational content and compositional structure of nucleosome favouring and disfavoring sequences with respect to their basic building blocks still remain open. Results: We present the first analysis on the role of k-mers in the composition of nucleosome enriched and depleted genomic regions (NER and NDR for short) that is: (i) exhaustive and within the bounds dictated by the information-theoretic content of the sample sets we use and (ii) informative for comparative epigenomics. We analize four different organisms and we propose a paradigmatic formalization of k-mer dictionaries, providing two different and complementary views of the k-mers involved in NER and NDR. The first extends well known studies in this area, its comparative nature being its major merit. The second, very novel, brings to light the rich variety of k-mers involved in influencing nucleosome positioning, for which an initial classification in terms of clusters is also provided. Although such a classification offers many insights, the following deserves to be singled-out: short poly(dA:dT) tracts are reported in the literature as fundamental for nucleosome depletion, however a global quantitative look reveals that their role is much less prominent than one would expect based on previous studies

    Foreword: Algorithms, Strings and Theoretical Approaches in the Big Data Era – Special Issue in Honor of the 60th Birthday of Professor Raffaele Giancarlo(Editorial)

    No full text
    Raffaele Giancarlo was born in 1957 in Salerno, Italy. He received his Laurea Degree in Computer Science from the University of Salerno in 1982. His Laurea thesis on combinatorial algorithms on words was supervised by Professor Alberto Apostolico. Some years later, in 1984, he was one of the few young researchers attending the Advanced Research Workshop on Combinatorial Algorithms on Words held at Maratea (Italy). In the same year, he won a public competition for an Assistant Professor position at University of Salerno. He also decided to pursue graduate studies in the US. Raffaele Giancarlo obtained his Ph.D. in Computer Science from Columbia University in 1990, defending one of the first Ph.D. thesis on algorithms and computational biology under the supervision of Professor Zvi Galil. In 1991 he attended the first formal edition of the Combinatorial Pattern Matching (CPM) conference, which was organized at Royal Halloway and Bedford New College (University of London). He has held several permanent or visiting scientist positions at many research labs and universities both in USA and Europe such as A T & T Bell Labs, Bell Labs of Lucent Technologies, A T & T Shannon Laboratories, University of Salerno, University of Palermo, Max Plank Institute for Molecular Genetics, INRIA, CNRS. He is currently Full Professor of Computer Science at University of Palermo. Professor Giancarlo is a specialist of design and analysis of combinatorial algorithms, with particular emphasis on string algorithms, ranging from data compression to bioinformatics. His scientific production consists of more than 90 papers appeared in established journals, conferences and book chapters. Moreover, he is coauthor of many patents, granted by the US Patent Office, in information retrieval. He is one of the funding members of Algorithms for Molecular Biology and of the Workshop on Algorithms in Bioinformatics (WABI). He has been the scientific director of one of the first Summer Schools in Bioinformatics and Computational Biology, which received more than 200 applications. He1 serves on the editorial boards of Theoretical Computer Science, BMC Bioinformatics, BMC Research Notes and he has served either as chairman or as a member of the scientific committees for many conferences, such as CPM, SPIRE, WABI, RECOMB, ICALP, COCOON. He has been invited keynote speaker to several conferences and summer schools, including SIAM International Conference in Applied Mathematics and the Ettore Majorana Center for the Advancement of Science. He has also been the principal investigator of several Italian Ministry of Education research projects in bioinformatics and one CNRS Grant. Professor Giancarlo has been the first President of the Computer Science Curricula at University of Palermo and he has been member of the Italian Computer Science Curricula Commission of the Italian Association of Computer Science Researchers (GRIN). He is currently on the board of directors of the CINI Consortium, that represents all of the academic competences in Computer Science and Engineering present in Italy. In particular, for that Consortium, he is founding member and on the board of InfoLife, a national laboratory covering all aspects of bioinformatics in Italy. He is delegate to the research for the Department of Mathematics and Informatics, and delegate to the placement students for the School of Basic and Applied Sciences, at University of Palermo. Finally, he is a member of the University's Scientific Council, University of Palermo
    corecore