    Towards the true tree: Bioinformatic approaches in the phylogenetics and molecular evolution of the Endopterygota

    In this thesis, I use bioinformatic approaches to address new and existing issues surrounding large-scale phylogenetic analysis. A phylogenetic analysis pipeline is developed to aid an investigation of the suitability of integrating Cytochrome Oxidase Subunit 1 (cox1) into phylogenetic supermatrices. In the first two chapters I assess the effect of varying cox1 sample size within a large variable phylogenetic context. As well as intuitive results on increased quality with greater taxon sampling, there are clear monophyly patters relating to local taxonomic sampling. Specifically, more monophyletic resampled taxa in cases when fewer consubfamilials are represented, with a tendency for these to remain unchanged in the degree of monophyly when rarefied. Sampling analyses are extended in chapter two using a mined Scarabaeoidea multilocus dataset, where taxa from given loci are used to improve existing matrices. Improvement in phylogenetic signal is best achieved by targeting cox1 to existing taxa, which suggests minimum parameters for cox1 adoption in large-scale phylogenetics. In chapter 3 I address recently-arisen issues related to phyloinformatic analysis of sequence-delineated matrices. There is ongoing work on setting species boundaries by sequence variation alone, but incongruence results in methodological issues upon integrating multiple loci delineated in this way. In the final chapter I assess the impact of heterogeneous substitution rates on large scale cox1 datasets. Although the number of heterogeneous sites in Coleoptera cox1 is substantial, their presence is found to be beneficial, as their removal negatively impacts the ability of the alignment to generate the 'known' topology. The homoplasy and heterogeneous characteristics of cox1 have not substantially impacted its utility, thus the cox1 datasets have potential to play a substantial role in the tree-of-life

    A Study of Pseudo-Periodic and Pseudo-Bordered Words for Functions Beyond Identity and Involution

    Periodicity, primitivity and borderedness are some of the fundamental notions in combinatorics on words. Motivated by the Watson-Crick complementarity of DNA strands wherein a word (strand) over the DNA alphabet \{A, G, C, T\} and its Watson-Crick complement are informationally ``identical , these notions have been extended to consider pseudo-periodicity and pseudo-borderedness obtained by replacing the ``identity function with ``pseudo-identity functions (antimorphic involution in case of Watson-Crick complementarity). For a given alphabet Σ\Sigma, an antimorphic involution θ\theta is an antimorphism, i.e., θ(uv)=θ(v)θ(u)\theta(uv)=\theta(v) \theta(u) for all u,vΣu,v \in \Sigma^{*} and an involution, i.e., θ(θ(u))=u\theta(\theta(u))=u for all uΣu \in \Sigma^{*}. In this thesis, we continue the study of pseudo-periodic and pseudo-bordered words for pseudo-identity functions including involutions. To start with, we propose a binary word operation, θ\theta-catenation, that generates θ\theta-powers (pseudo-powers) of a word for any morphic or antimorphic involution θ\theta. We investigate various properties of this operation including closure properties of various classes of languages under it, and its connection with the previously defined notion of θ\theta-primitive words. A non-empty word uu is said to be θ\theta-bordered if there exists a non-empty word vv which is a prefix of uu while θ(v)\theta(v) is a suffix of uu. We investigate the properties of θ\theta-bordered (pseudo-bordered) and θ\theta-unbordered (pseudo-unbordered) words for pseudo-identity functions θ\theta with the property that θ\theta is either a morphism or an antimorphism with θn=I\theta^{n}=I, for a given n2n \geq 2, or θ\theta is a literal morphism or an antimorphism. Lastly, we initiate a new line of study by exploring the disjunctivity properties of sets of pseudo-bordered and pseudo-unbordered words and some other related languages for various pseudo-identity functions. In particular, we consider such properties for morphic involutions θ\theta and prove that, for any i2i \geq 2, the set of all words with exactly ii θ\theta-borders is disjunctive (under certain conditions)

    Universal Indexes for Highly Repetitive Document Collections

    Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

    DNA Hairpin Secondary Structure Design

    In this thesis, we propose a bottom-up method to design single-stranded DNA sequences that form consecutive hairpin structures. This work was inspired by the hairpin-based DNA multi-state machine proposed by Takahashi et al. in 2004. They have successfully achieved this DNA multiple-hairpin structure in a laboratory experiment and proposed two possible applications. The first one is to construct a random access memory (RAM) by using the DNA machines as the access address for the data. The second one is to solve the maximum independent set problem (MISP). It is interesting thus to investigate how to design DNA sequences which form consecutive hairpin structures as mentioned above. We propose a bottom-up approach to construct consecutive hairpin structures, grounded on a so-called bond-free property, and several combinatorial constraints. A software is implemented to study the behavior of our bottom-up approach. We also calculate the maximal number of sequences that correctly fold into the desired multiple-hairpin structure. This calculation provides an estimation for the size of the memory that can be constructed using Takahashi et aVs method. Lastly, by selecting suitable parameters, we successfully construct a set of sequences that can fold in to the desirable multiple-hairpin structure. For example, our software is able to generate 120 sequences that can fold into a four-hairpin structure where the length of each hairpin stem is 20, the length of each hairpin loop is 7 and the external segment is 20. We validate these sequences using the molecule secondary structure prediction package, Vienna RNA secondary structure package

    Image Recognition Using Text and Audio Translation for the Visually Challenged

    WHO has expressed that out of the general populace on the planet there are 253 million individuals are outwardly impeded around the world. It comes to the standpoint that visually impaired individuals are finding burdensome to curve out their ordinary life. It is vital for take significant measure with the current innovations so they can experience the ongoing scene with next to no troubles. To lift the visually impaired people in the public, this project has been proposed, which can identify images and translates the description of image into text and then produce the audio. This can assist the individual with perusing any text and recognize the image and get the result in vocal structure. Motivated by late work in machine interpretation also, object recognition, a CNN-RNN based attention model is presented in this project. Through the proposed framework, an image is converted into text description first; then, utilizing a basic text-to-speech API, the extracted caption/subtitle is converted into speech which further assists the visually impaired to understand the image or visuals they are looking at. So, the focal part is centered on building the subtitle/text model while the subsequent part, which is changing the text-to-speech, is moderately simple with the text-to-speech API. When the model is fabricated, it is deployed on the local framework utilizing a Flask-based model to produce audio-based caption for any image fed to the model

    DNA Sequence Classification: It’s Easier Than You Think: An open-source k-mer based machine learning tool for fast and accurate classification of a variety of genomic datasets

    Supervised classification of genomic sequences is a challenging, well-studied problem with a variety of important applications. We propose an open-source, supervised, alignment-free, highly general method for sequence classification that operates on k-mer proportions of DNA sequences. This method was implemented in a fully standalone general-purpose software package called Kameris, publicly available under a permissive open-source license. Compared to competing software, ours provides key advantages in terms of data security and privacy, transparency, and reproducibility. We perform a detailed study of its accuracy and performance on a wide variety of classification tasks, including virus subtyping, taxonomic classification, and human haplogroup assignment. We demonstrate the success of our method on whole mitochondrial, nuclear, plastid, plasmid, and viral genomes, as well as randomly sampled eukaryote genomes and transcriptomes. Further, we perform head-to-head evaluations on the tasks of HIV-1 virus subtyping and bacterial taxonomic classification with a number of competing state-of-the-art software solutions, and show that we match or exceed all other tested software in terms of accuracy and speed

    An Introduction to Programming for Bioscientists: A Python-based Primer

    Computing has revolutionized the biological sciences over the past several decades, such that virtually all contemporary research in the biosciences utilizes computer programs. The computational advances have come on many fronts, spurred by fundamental developments in hardware, software, and algorithms. These advances have influenced, and even engendered, a phenomenal array of bioscience fields, including molecular evolution and bioinformatics; genome-, proteome-, transcriptome- and metabolome-wide experimental studies; structural genomics; and atomistic simulations of cellular-scale molecular assemblies as large as ribosomes and intact viruses. In short, much of post-genomic biology is increasingly becoming a form of computational biology. The ability to design and write computer programs is among the most indispensable skills that a modern researcher can cultivate. Python has become a popular programming language in the biosciences, largely because (i) its straightforward semantics and clean syntax make it a readily accessible first language; (ii) it is expressive and well-suited to object-oriented programming, as well as other modern paradigms; and (iii) the many available libraries and third-party toolkits extend the functionality of the core language into virtually every biological domain (sequence and structure analyses, phylogenomics, workflow management systems, etc.). This primer offers a basic introduction to coding, via Python, and it includes concrete examples and exercises to illustrate the language's usage and capabilities; the main text culminates with a final project in structural bioinformatics. A suite of Supplemental Chapters is also provided. Starting with basic concepts, such as that of a 'variable', the Chapters methodically advance the reader to the point of writing a graphical user interface to compute the Hamming distance between two DNA sequences.Comment: 65 pages total, including 45 pages text, 3 figures, 4 tables, numerous exercises, and 19 pages of Supporting Information; currently in press at PLOS Computational Biolog

    Unique reporter-based sensor platforms to monitor signalling in cells

    Introduction: In recent years much progress has been made in the development of tools for systems biology to study the levels of mRNA and protein, and their interactions within cells. However, few multiplexed methodologies are available to study cell signalling directly at the transcription factor level. <p/>Methods: Here we describe a sensitive, plasmid-based RNA reporter methodology to study transcription factor activation in mammalian cells, and apply this technology to profiling 60 transcription factors in parallel. The methodology uses two robust and easily accessible detection platforms; quantitative real-time PCR for quantitative analysis and DNA microarrays for parallel, higher throughput analysis. <p/>Findings: We test the specificity of the detection platforms with ten inducers and independently validate the transcription factor activation. <p/>Conclusions: We report a methodology for the multiplexed study of transcription factor activation in mammalian cells that is direct and not theoretically limited by the number of available reporters