294 research outputs found

    Protein sequence classification using feature hashing

    Get PDF
    Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks

    Homology sequence analysis using GPU acceleration

    Get PDF
    A number of problems in bioinformatics, systems biology and computational biology field require abstracting physical entities to mathematical or computational models. In such studies, the computational paradigms often involve algorithms that can be solved by the Central Processing Unit (CPU). Historically, those algorithms benefit from the advancements of computing power in the serial processing capabilities of individual CPU cores. However, the growth has slowed down over recent years, as scaling out CPU has been shown to be both cost-prohibitive and insecure. To overcome this problem, parallel computing approaches that employ the Graphics Processing Unit (GPU) have gained attention as complementing or replacing traditional CPU approaches. The premise of this research is to investigate the applicability of various parallel computing platforms to several problems in the detection and analysis of homology in biological sequence. I hypothesize that by exploiting the sheer amount of computation power and sequencing data, it is possible to deduce information from raw sequences without supplying the underlying prior knowledge to come up with an answer. I have developed such tools to perform analysis at scales that are traditionally unattainable with general-purpose CPU platforms. I have developed a method to accelerate sequence alignment on the GPU, and I used the method to investigate whether the Operational Taxonomic Unit (OTU) classification problem can be improved with such sheer amount of computational power. I have developed a method to accelerate pairwise k-mer comparison on the GPU, and I used the method to further develop PolyHomology, a framework to scaffold shared sequence motifs across large numbers of genomes to illuminate the structure of the regulatory network in yeasts. The results suggest that such approach to heterogeneous computing could help to answer questions in biology and is a viable path to new discoveries in the present and the future.Includes bibliographical reference

    Discovery of Unconventional Patterns for Sequence Analysis: Theory and Algorithms

    Get PDF
    The biology community is collecting a large amount of raw data, such as the genome sequences of organisms, microarray data, interaction data such as gene-protein interactions, protein-protein interactions, etc. This amount is rapidly increasing and the process of understanding the data is lagging behind the process of acquiring it. An inevitable first step towards making sense of the data is to study their regularities focusing on the non-random structures appearing surprisingly often in the input sequences: patterns. In this thesis we discuss three incarnations of the pattern discovery task, exploring three types of patterns that can model different regularities of the input dataset. While mask patterns have been designed to model short repeated biological sequences, showing a high conservation of their content at some specific positions, permutation patterns have been designed to detect repeated patterns whose parts maintain their physical adjacency but not their ordering in all the pattern occurrences. Transposons, instead, model mobile sequences in the input dataset, which can be discovered by comparing different copies of the same input string, detecting large insertions and deletions in their alignment

    Diversity profiling and rapid detection of spoilage yeasts in a typical fruit juice bottling factory

    Get PDF
    ThesisSpoilage caused by yeasts is a constant and widespread problem in the beverage industry which can result in major economic losses. Fruit juices provide an environment which allows the proliferation of yeasts, leading to spoilage of the product. Some factories do not have the laboratory facilities to identify spoiler yeasts and it becomes a prolonged process if outsourced, which obstructs the planning of corrective actions. This study aimed to establish yeast diversity and apply a rapid method for preliminary identification of spoiler yeasts associated with a small scale fruit juice bottling factory. The yeast population in the factory was determined by isolating yeasts from the production environment, process equipment and the spoiled products. Yeasts were identified by PCR-RFLP analysis targeting the 5.8S-ITS region and sequencing the D1/D2 domain of the 26S rRNA gene. A total of 201 yeasts belonging to ten different genera (Candida, Lodderomyces, Wickerhamomyces, Yarrowia, Zygosaccharomyces, Zygoascus, Cryptococcus, Filobasidium, Rhodotorula/Cystobasidium and Trichosporon) were isolated and identified from the production environment and processing equipment. The overall yeast distribution showed that Candida parapsilosis and Lodderomyces elongisporus were widely distributed in the factory, with Candida parapsilosis being reported as an opportunistic pathogen. Zygosaccharomyces bailii, Zygoascus hellenicus and Saccharomyces cerevisiae were isolated from the spoiled products and are known to be highly fermentative. In addition, Zygosaccharomyces bailii and Zygoascus hellenicus were found to be present inside the refrigerator where the fruit pulp is stored, which makes it a potential point of contamination. The data also provided a yeast control panel which was successfully utilized to identify unknown yeast in spoiled product from this factory using PCR-RFLP analysis

    On the structure and evolution of protein interaction networks

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 107-114).The study of protein interactions from the networks point of view has yielded new insights into systems biology [Bar03, MA03, RSM+02, WS98]. In particular, "network motifs" become apparent as a useful and systematic tool for describing and exploring networks [BP06, MKFV06, MSOI+02, SOMMA02, SV06]. Finding motifs has involved either exact counting (e.g. [MSOI+02]) or subgraph sampling (e.g. [BP06, KIMA04a, MZW05]). In this thesis we develop an algorithm to count all instances of a particular subgraph, which can be used to query whether a given subgraph is a significant motif. This method can be used to perform exact counting of network motifs faster and with less memory than previous methods, and can also be combined with subgraph sampling to find larger motifs than ever before -- we have found motifs with up to 15 nodes and explored subgraphs up to 20 nodes. Unlike previous methods, this method can also be used to explore motif clustering and can be combined with network alignment techniques [FNS+06, KSK+03]. We also present new methods of estimating parameters for models of biological network growth, and present a new model based on these parameters and underlying binding domains. Finally, we propose an experiment to explore the effect of the whole genome duplication [KBL04] on the protein-protein interaction network of S. cerevisiae, allowing us to distinguish between cases of subfunctionalization and neofunctionalization.by Joshua A. Grochow.M.Eng

    Automated modelling of multimeric protein complexes from heterogeneous structures

    Get PDF
    Protein interaction networks provide an increasingly complex picture of the relationships between macromolecules in the cell. Complementing these interactions with structural data provides critical insights into interaction mechanisms. However, structural information is available only for a tiny fraction of protein interactions and complexes currently known. To address this gap, we have developed a method to predict macromolecular complex structures by systematic combination of pairwise interactions of known structure. We first identify all interactions within a network that are of known structure or sufficiently similar to known structure to permit homology modelling. We then use these structural constraints to construct models of complexes. We tackle combinatorial explosion by developing an efficient algorithm that exploits heuristics to reduce the large search space and complement this with an automated scoring system to filter out the exponentially large number of unrealistic complexes, leaving a ranked set of the most plausible models. To test the approach, we defined a benchmark set of complexes of known structure, and show that many complexes can be re-created with good accuracy, using templates below 75% sequence identity. Certain models are much larger and more complete than what is capable with traditional modelling techniques. The approach can identify the most plausible homology models for a complex of dozens of proteins in less than a few hours. We applied the approach to whole-proteome sets of complexes from S. cerevisiae. For the complexes of known structure, we are able to identify the native complex in the majority of cases. We provide promising models for several dozen additional complexes, including multiple isoforms for each. Modelled complexes also provide functional classification, particularly for unannotated complexes from structural genomics initiatives. We show that the best results are achieved when the stoichiometry of the components is known and when the modelling is approached hierarchically, where core components, representing high-confidence interactions, are modelled before non-obligate interactions. We are refining this aspect of the automated modelling and making the procedure publicly available via a web service, to aid in the analysis of models. As the rate of structurally resolved interactions grows, our ability to model larger and more diverse complexes will grow exponentially

    Selected Works in Bioinformatics

    Get PDF
    This book consists of nine chapters covering a variety of bioinformatics subjects, ranging from database resources for protein allergens, unravelling genetic determinants of complex disorders, characterization and prediction of regulatory motifs, computational methods for identifying the best classifiers and key disease genes in large-scale transcriptomic and proteomic experiments, functional characterization of inherently unfolded proteins/regions, protein interaction networks and flexible protein-protein docking. The computational algorithms are in general presented in a way that is accessible to advanced undergraduate students, graduate students and researchers in molecular biology and genetics. The book should also serve as stepping stones for mathematicians, biostatisticians, and computational scientists to cross their academic boundaries into the dynamic and ever-expanding field of bioinformatics

    An aptamer-based sensing platform for luteinising hormone pulsatility measurement

    Get PDF
    Normal fertility in human involves highly orchestrated communication across the hypothalamic-pituitary-gonadal (HPG) axis. The pulsatile release of Luteinising Hormone (LH) is a critical element for downstream regulation of sex steroid hormone synthesis and the production of mature eggs. Changes in LH pulsatile pattern have been linked to hypothalamic dysfunction, resulting in multiple reproductive and growth disorders including Polycystic Ovary Syndrome (PCOS), Hypothalamic Amenorrhea (HA), and delayed/precocious puberty. Therefore, assessing the pulsatility of LH is important not only for academic investigation of infertility, but also for clinical decisions and monitoring of treatment. However, there is currently no clinically available tool for measuring human LH pulsatility. The immunoassay system is expensive and requires large volumes of patient blood, limiting its application for LH pulsatility monitoring. In this thesis, I propose a novel method using aptamer-enabled sensing technology to develop a device platform to measure LH pulsatility. I first generated a novel aptamer binding molecule against LH by a nitrocellulose membrane-based in vitro selection then characterised its high affinity and specific binding properties by multiple biophysical/chemical methods. I then developed a sensitive electrochemical-based detection method using this aptamer. The principal mechanism is that structure switching upon binding is associated with the electron transfer rate changes of the MB redox label. I then customised this assay to numerous device platforms under our rapid prototyping strategy including 96 well automated platform, continuous sensing platform and chip-based multiple electrode platform. The best-performing device was found to be the AELECAP (Automated ELEctroChemical Aptamer Platform) – a 96-well plate based automatic micro-wire sensing platform capable of measuring a series of low volume luteinising hormone within a short time. Clinical samples were evaluated using AELECAP. A series of clinical samples were measured including LH pulsatility profile of menopause female (high LH amplitude), normal female/male (normal LH amplitude) and female with hypothalamic amenorrhea (no LH pulsatility). Total patient numbers were 12 of each type, with 50 blood samples collected every 10 mins in 8 hours. Results showed that the system can distinguish LH pulsatile pattern among the cohorts and pulsatility profiles were consistent with the result measured by clinical assays. AELECAP shows high potential as a novel approach for clinical aptamer-based sensing. AELECAP competes with current automated immunometric assays system with lower costs, lower reagent use, and a simpler setup. There is potential for this approach to be further developed as a tool for infertility research and to assist clinicians in personalised treatment with hormonal therapy.Open Acces
    corecore