Search CORE

83,654 research outputs found

Accelerated large-scale multiple sequence alignment

Author: A Szalkowski
A Wilm
A Wirawan
AV Bhatt
C Grasso
C Notredame
D Mikhailov
DF Feng
E Eskin
G Tan
GM Amdahl
H Carroll
H Vandierendonck
I Letunic
J Cheetham
J Ebedes
J Nickolls
JD Thompson
JD Thompson
JD Thompson
K Katoh
KB Li
M Farrar
M Feldman
M Friedman
OpenMP
Quinn O Snell
RC Edgar
S Lloyd
S Washietl
Scott Lloyd
SR Eddy
T Lassmann
T Oliver
T Ramdas
T Wang
X Deng
X Lin
Y Li
Y Liu
Y Liu
Publication venue: BioMed Central
Publication date: 01/12/2011
Field of study

Abstract Background Multiple sequence alignment (MSA) is a fundamental analysis method used in bioinformatics and many comparative genomic applications. Prior MSA acceleration attempts with reconfigurable computing have only addressed the first stage of progressive alignment and consequently exhibit performance limitations according to Amdahl's Law. This work is the first known to accelerate the third stage of progressive alignment on reconfigurable hardware. Results We reduce subgroups of aligned sequences into discrete profiles before they are pairwise aligned on the accelerator. Using an FPGA accelerator, an overall speedup of up to 150 has been demonstrated on a large data set when compared to a 2.4 GHz Core2 processor. Conclusions Our parallel algorithm and architecture accelerates large-scale MSA with reconfigurable computing and allows researchers to solve the larger problems that confront biologists today. Program source is available from <url>http://dna.cs.byu.edu/msa/</url>.</p

Crossref

Directory of Open Access Journals

PubMed Central

MSAProbs-MPI: parallel multiple sequence aligner for distributed-memory systems

Author: González-Domínguez Jorge
Liu Yongchao
Schmidt Bertil
Touriño Juan
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2016
Field of study

This is a pre-copyedited, author-produced version of an article accepted for publication in Bioinformatics following peer review. The version of recordJorge González-Domínguez, Yongchao Liu, Juan Touriño, Bertil Schmidt; MSAProbs-MPI: parallel multiple sequence aligner for distributed-memory systems, Bioinformatics, Volume 32, Issue 24, 15 December 2016, Pages 3826–3828, https://doi.org/10.1093/bioinformatics/btw558is available online at: https://doi.org/10.1093/bioinformatics/btw558[Abstracts] MSAProbs is a state-of-the-art protein multiple sequence alignment tool based on hidden Markov models. It can achieve high alignment accuracy at the expense of relatively long runtimes for large-scale input datasets. In this work we present MSAProbs-MPI, a distributed-memory parallel version of the multithreaded MSAProbs tool that is able to reduce runtimes by exploiting the compute capabilities of common multicore CPU clusters. Our performance evaluation on a cluster with 32 nodes (each containing two Intel Haswell processors) shows reductions in execution time of over one order of magnitude for typical input datasets. Furthermore, MSAProbs-MPI using eight nodes is faster than the GPU-accelerated QuickProbs running on a Tesla K20. Another strong point is that MSAProbs-MPI can deal with large datasets for which MSAProbs and QuickProbs might fail due to time and memory constraints, respectively

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

METHODS FOR HIGH-THROUGHPUT COMPARATIVE GENOMICS AND DISTRIBUTED SEQUENCE ANALYSIS

Author: Angiuoli Samuel Vincent
Publication venue
Publication date: 01/01/2011
Field of study

High-throughput sequencing has accelerated applications of genomics throughout the world. The increased production and decentralization of sequencing has also created bottlenecks in computational analysis. In this dissertation, I provide novel computational methods to improve analysis throughput in three areas: whole genome multiple alignment, pan-genome annotation, and bioinformatics workflows. To aid in the study of populations, tools are needed that can quickly compare multiple genome sequences, millions of nucleotides in length. I present a new multiple alignment tool for whole genomes, named Mugsy, that implements a novel method for identifying syntenic regions. Mugsy is computationally efficient, does not require a reference genome, and is robust in identifying a rich complement of genetic variation including duplications, rearrangements, and large-scale gain and loss of sequence in mixtures of draft and completed genome data. Mugsy is evaluated on the alignment of several dozen bacterial chromosomes on a single computer and was the fastest program evaluated for the alignment of assembled human chromosome sequences from four individuals. A distributed version of the algorithm is also described and provides increased processing throughput using multiple CPUs. Numerous individual genomes are sequenced to study diversity, evolution and classify pan-genomes. Pan-genome annotations contain inconsistencies and errors that hinder comparative analysis, even within a single species. I introduce a new tool, Mugsy-Annotator, that identifies orthologs and anomalous gene structure across a pan-genome using whole genome multiple alignments. Identified anomalies include inconsistently located translation initiation sites and disrupted genes due to draft genome sequencing or pseudogenes. An evaluation of pan-genomes indicates that such anomalies are common and alternative annotations suggested by the tool can improve annotation consistency and quality. Finally, I describe the Cloud Virtual Resource, CloVR, a desktop application for automated sequence analysis that improves usability and accessibility of bioinformatics software and cloud computing resources. CloVR is installed on a personal computer as a virtual machine and requires minimal installation, addressing challenges in deploying bioinformatics workflows. CloVR also seamlessly accesses remote cloud computing resources for improved processing throughput. In a case study, I demonstrate the portability and scalability of CloVR and evaluate the costs and resources for microbial sequence analysis

CiteSeerX

Digital Repository at the University of Maryland

HH-suite3 for fast remote homology detection and deep protein annotation.

Author: Haunsberger S.
Meier M.
Mirdita M.
Steinegger M.
Söding J.
Vöhringer H.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 14/09/2019
Field of study

BACKGROUND: HH-suite is a widely used open source software suite for sensitive sequence similarity searches and protein fold recognition. It is based on pairwise alignment of profile Hidden Markov models (HMMs), which represent multiple sequence alignments of homologous proteins. RESULTS: We developed a single-instruction multiple-data (SIMD) vectorized implementation of the Viterbi algorithm for profile HMM alignment and introduced various other speed-ups. These accelerated the search methods HHsearch by a factor 4 and HHblits by a factor 2 over the previous version 2.0.16. HHblits3 is ∼10× faster than PSI-BLAST and ∼20× faster than HMMER3. Jobs to perform HHsearch and HHblits searches with many query profile HMMs can be parallelized over cores and over cluster servers using OpenMP and message passing interface (MPI). The free, open-source, GPLv3-licensed software is available at https://github.com/soedinglab/hh-suite . CONCLUSION: The added functionalities and increased speed of HHsearch and HHblits should facilitate their use in large-scale protein structure and function prediction, e.g. in metagenomics and genomics projects

RCSI Repository

MPG.PuRe

Accelerated Evolution of the ASPM Gene Controlling Brain Size Begins Prior to Human Brain Expansion

Author: Abeysinghe
Aota
Benson
Bernardi
Bond
Bond
Chen
Chenn
Chuzhanova
Clark
Crandall
Dewyse
do Carmo Avides
Duret
Endo
Evans
Galtier
Gould
Hughes
Jurka
Kent
Kouprina
Kreitman
McCreary
McDonald
Mochida
Morgenstern
Noskov
Polushin
Rehen
Riparbelli
Ripoll
Roberts
Rudiger
Sharp
Sharp
Tobias
Wood
Yang
Yang
Zhang
Publication venue: Public Library of Science
Publication date: 01/01/2004
Field of study

Primary microcephaly (MCPH) is a neurodevelopmental disorder characterized by global reduction in cerebral cortical volume. The microcephalic brain has a volume comparable to that of early hominids, raising the possibility that some MCPH genes may have been evolutionary targets in the expansion of the cerebral cortex in mammals and especially primates. Mutations in ASPM, which encodes the human homologue of a fly protein essential for spindle function, are the most common known cause of MCPH. Here we have isolated large genomic clones containing the complete ASPM gene, including promoter regions and introns, from chimpanzee, gorilla, orangutan, and rhesus macaque by transformation-associated recombination cloning in yeast. We have sequenced these clones and show that whereas much of the sequence of ASPM is substantially conserved among primates, specific segments are subject to high Ka/Ks ratios (nonsynonymous/synonymous DNA changes) consistent with strong positive selection for evolutionary change. The ASPM gene sequence shows accelerated evolution in the African hominoid clade, and this precedes hominid brain expansion by several million years. Gorilla and human lineages show particularly accelerated evolution in the IQ domain of ASPM. Moreover, ASPM regions under positive selection in primates are also the most highly diverged regions between primates and nonprimate mammals. We report the first direct application of TAR cloning technology to the study of human evolution. Our data suggest that evolutionary selection of specific segments of the ASPM sequence strongly relates to differences in cerebral cortical size

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Genetic Sequence Matching Using D4M Big Data Approaches

Author: Dodson Stephanie
Kepner Jeremy
Ricke Darrell O.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 31/07/2014
Field of study

Recent technological advances in Next Generation Sequencing tools have led to increasing speeds of DNA sample collection, preparation, and sequencing. One instrument can produce over 600 Gb of genetic sequence data in a single run. This creates new opportunities to efficiently handle the increasing workload. We propose a new method of fast genetic sequence analysis using the Dynamic Distributed Dimensional Data Model (D4M) - an associative array environment for MATLAB developed at MIT Lincoln Laboratory. Based on mathematical and statistical properties, the method leverages big data techniques and the implementation of an Apache Acculumo database to accelerate computations one-hundred fold over other methods. Comparisons of the D4M method with the current gold-standard for sequence analysis, BLAST, show the two are comparable in the alignments they find. This paper will present an overview of the D4M genetic sequence algorithm and statistical comparisons with BLAST.Comment: 6 pages; to appear in IEEE High Performance Extreme Computing (HPEC) 201

arXiv.org e-Print Archive

Crossref

Recommended from our members

Computational Strategies for Scalable Genomics Analysis.

Author: Shi Lizhen
Wang Zhong
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

The revolution in next-generation DNA sequencing technologies is leading to explosive data growth in genomics, posing a significant challenge to the computing infrastructure and software algorithms for genomics analysis. Various big data technologies have been explored to scale up/out current bioinformatics solutions to mine the big genomics data. In this review, we survey some of these exciting developments in the applications of parallel distributed computing and special hardware to genomics. We comment on the pros and cons of each strategy in the context of ease of development, robustness, scalability, and efficiency. Although this review is written for an audience from the genomics and bioinformatics fields, it may also be informative for the audience of computer science with interests in genomics applications

eScholarship - University of California

Comparative Genomic Characterization of the Multimammate Mouse Mastomys coucha.

Author: Aaron Hardin
Andersen
Bao
Besemer
Bickhart
Blanchette
Bolger
Bonwitt
Booker
Bourque
Camacho
Capra
Chapman
Chen
Chikhi
Chu
Colangelo
Dewey
Dierckxsens
Eblaghie
Grabherr
Hayssen
Heinz
Helfrich
Holloway
Holt
Jiang
Kannan
Kiełbasa
Kim
Kimberly A Nevonen
Kolmogorov
Korf
Krueger
Lecompte
Li
Li
Lok
Lowe
Lucia Carbone
MacManes
McLean
Modlin
Nadav Ahituv
Nagy
Nilsson
Närhi
Pennacchio
Pertea
Pimentel
Pollard
Sands
Schep
Scott
Siepel
Siepel
Simão
Smit
Snell
Song
Stanke
UniProt Consortium
Van der Auwera
Veltmaat
Veltmaat
Walter L Eckalbar
Publication venue: eScholarship, University of California
Publication date: 01/12/2019
Field of study

Mastomys are the most widespread African rodent and carriers of various diseases such as the plague or Lassa virus. In addition, mastomys have rapidly gained a large number of mammary glands. Here, we generated a genome, variome, and transcriptomes for Mastomys coucha. As mastomys diverged at similar times from mouse and rat, we demonstrate their utility as a comparative genomic tool for these commonly used animal models. Furthermore, we identified over 500 mastomys accelerated regions, often residing near important mammary developmental genes or within their exons leading to protein sequence changes. Functional characterization of a noncoding mastomys accelerated region, located in the HoxD locus, showed enhancer activity in mouse developing mammary glands. Combined, our results provide genomic resources for mastomys and highlight their potential both as a comparative genomic tool and for the identification of mammary gland number determining factors

Crossref

eScholarship - University of California