Search CORE

142,012 research outputs found

Distributed Representation of Protein Sequence Based on Multi-Alignment Results

Author: Liu He
Shi Cheng
Siqi Wang
Xiaohu Shi
Publication venue: 'Mechanical Engineering Faculty in Slavonski Brod'
Publication date: 01/01/2020
Field of study

Protein sequence representation is a key problem for protein studies, especially for those sequence-based models. In this paper, a distributed representation model of protein sequence is proposed, which involves evolutionary information by introducing multi-alignment results. Firstly, we construct a non-redundancy protein dataset and perform multi-alignment for each protein. Then k-mer amino acids "biology corpus" was abstracted from the alignment results which are "evolutionary information" enriched. Using the "biology corpus", k-mer amino acids distributed embedding vectors could be trained according to word2vec method. We compared the amino acid pair distance derived from our produced 1-mer amino acids distributed embedding vectors with that derived from BLOSUM62; it was found that their Pearson coefficient is 0.937, showing they have strong correlation. Then we applied the obtained amino acids distributed embedding representation to protein secondary structure recognition and solubility prediction. For both of the experiments, our proposed alignment results based amino acid distributed representation outperforms that derived directly from protein sequences. Moreover, compared to those existing up-to-date algorithms, our method could get better or comparative results, on condition of only using the feature of our produced amino acid distributed vectors

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

Identification of Protein Alignment for Elder Health Care

Author: Sneha A. Khaire, Prof. N.R. Wankhade
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 31/12/2016
Field of study

Over many years protein sequence alignment problem has grabbed attention of biologists as it implicates, more than two biological sequences. It states all the important aspects of big data and how medical and health informatics, translational bioinformatics will benefit personalized health care both structured and unstructured, covering genomics, proteomics, metabolism. The system develop approach for biological sequence alignment to increase efficiency of analysis operation that speed up the calculation of alignment for huge real time sequences, to develop distributed scan approach in Smith-waterman algorithm for presenting fast solution and optimize the Smith Waterman(SW) alignment algorithm using Distributed approach

International Journal on Recent and Innovation Trends in Computing and Communication

A Parallel Algorithm for Large-Scale Multiple Sequence Alignment

Author: Lima Carlos R. Erig
Lopes Heitor S.
Moritz Guilherme L.
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 26/01/2012
Field of study

Multiple sequence alignment is a central topic of extensive research in computational biology. Basically, two or more protein sequences are compared to evaluate their similarity and to identify conserved regions. This work reports a methodology for parallel processing of a multiple sequence alignment algorithm (ClustalW) in an environment of networked computers. A detailed description of the modules that compose the distributed system is provided, giving special attention to the way a dynamic programming algorithm is run in multilevel parallelism. Extensive experiments were done to evaluate performance and scalability of the reported method. Results suggest that the proposed method is very promising for large-scale multiple protein sequence alignment

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

ClustalXeed: a GUI-based grid computation version for high performance and terabyte size multiple sequence alignment

Author: A Boukerche
B Rost
D Mikhailov
DG Higgins
Hyun Joo
J Garnier
J Kleinjung
JD Thompson
JD Thompson
JD Thompson
K-B Li
M Schmollinger
MA Larkin
N Essoussi
O Trelles
R Chenna
RD Page
T Hagerup
Taeho Kim
V Chaudhary
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background There is an increasing demand to assemble and align large-scale biological sequence data sets. The commonly used multiple sequence alignment programs are still limited in their ability to handle very large amounts of sequences because the system lacks a scalable high-performance computing (HPC) environment with a greatly extended data storage capacity. Results We designed ClustalXeed, a software system for multiple sequence alignment with incremental improvements over previous versions of the ClustalX and ClustalW-MPI software. The primary advantage of ClustalXeed over other multiple sequence alignment software is its ability to align a large family of protein or nucleic acid sequences. To solve the conventional memory-dependency problem, ClustalXeed uses both physical random access memory (RAM) and a distributed file-allocation system for distance matrix construction and pair-align computation. The computation efficiency of disk-storage system was markedly improved by implementing an efficient load-balancing algorithm, called "idle node-seeking task algorithm" (INSTA). The new editing option and the graphical user interface (GUI) provide ready access to a parallel-computing environment for users who seek fast and easy alignment of large DNA and protein sequence sets. Conclusions ClustalXeed can now compute a large volume of biological sequence data sets, which were not tractable in any other parallel or single MSA program. The main developments include: 1) the ability to tackle larger sequence alignment problems than possible with previous systems through markedly improved storage-handling capabilities. 2) Implementing an efficient task load-balancing algorithm, INSTA, which improves overall processing times for multiple sequence alignment with input sequences of non-uniform length. 3) Support for both single PC and distributed cluster systems.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Distributed Many-to-Many Protein Sequence Alignment using Sparse Matrices

Author: Azad Ariful
Buluc Aydin
Ekanayake Saliya
Guidi Giulia
Pavlopoulos Georgios
Selvitopi Oguz
Publication venue
Publication date: 30/09/2020
Field of study

Identifying similar protein sequences is a core step in many computational biology pipelines such as detection of homologous protein sequences, generation of similarity protein graphs for downstream analysis, functional annotation and gene location. Performance and scalability of protein similarity searches have proven to be a bottleneck in many bioinformatics pipelines due to increases in cheap and abundant sequencing data. This work presents a new distributed-memory software, PASTIS. PASTIS relies on sparse matrix computations for efficient identification of possibly similar proteins. We use distributed sparse matrices for scalability and show that the sparse matrix infrastructure is a great fit for protein similarity searches when coupled with a fully-distributed dictionary of sequences that allows remote sequence requests to be fulfilled. Our algorithm incorporates the unique bias in amino acid sequence substitution in searches without altering the basic sparse matrix model, and in turn, achieves ideal scaling up to millions of protein sequences.Comment: To appear in International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'20

arXiv.org e-Print Archive

eScholarship - University of California

GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains

Author: Abascal
Abhiman
Addou
Alexeyenko
Andreeva
Attwood
Berman
Brenner
Brown
Brown
Bru
Chen
Christine Orengo
Cuff
David A. Lee
Dessailly
Devos
Edgar
Eisen
Engelhardt
Enright
Eramian
Finn
Friedberg
Godzik
Haft
Jensen
John
Kaplan
Katoh
Kersey
Krishnamurthy
Lee
Letunic
Li
Loewenstein
Mulder
O’Brien
Pegg
Petryszak
Pieper
Reeves
Rentzsch
Robert Rentzsch
Rost
Sadreyev
Sali
Sigrist
Thomas
Tian
Wicker
Wilson
Wu
Yeats
Publication venue: Oxford University Press
Publication date: 01/01/2009
Field of study

GeMMA (Genome Modelling and Model Annotation) is a new approach to automatic functional subfamily classification within families and superfamilies of protein sequences. A major advantage of GeMMA is its ability to subclassify very large and diverse superfamilies with tens of thousands of members, without the need for an initial multiple sequence alignment. Its performance is shown to be comparable to the established high-performance method SCI-PHY. GeMMA follows an agglomerative clustering protocol that uses existing software for sensitive and accurate multiple sequence alignment and profile–profile comparison. The produced subfamilies are shown to be equivalent in quality whether whole protein sequences are used or just the sequences of component predicted structural domains. A faster, heuristic version of GeMMA that also uses distributed computing is shown to maintain the performance levels of the original implementation. The use of GeMMA to increase the functional annotation coverage of functionally diverse Pfam families is demonstrated. It is further shown how GeMMA clusters can help to predict the impact of experimentally determining a protein domain structure on comparative protein modelling coverage, in the context of structural genomics

CiteSeerX

Crossref

PubMed Central

Predicting Flavonoid UGT Regioselectivity

Author: Jackson Rhydon
Knisley Debra
McIntosh Cecilia
Pfeiffer Phillip
Publication venue: Hindawi Publishing Corporation
Publication date: 01/01/2011
Field of study

Machine learning was applied to a challenging and biologically significant protein classification problem: the prediction of avonoid UGT acceptor regioselectivity from primary sequence. Novel indices characterizing graphical models of residues were proposed and found to be widely distributed among existing amino acid indices and to cluster residues appropriately. UGT subsequences biochemically linked to regioselectivity were modeled as sets of index sequences. Several learning techniques incorporating these UGT models were compared with classifications based on standard sequence alignment scores. These techniques included an application of time series distance functions to protein classification. Time series distances defined on the index sequences were used in nearest neighbor and support vector machine classifiers. Additionally, Bayesian neural network classifiers were applied to the index sequences. The experiments identified improvements over the nearest neighbor and support vector machine classifications relying on standard alignment similarity scores, as well as strong correlations between specific subsequences and regioselectivities

Crossref

Directory of Open Access Journals

PubMed Central

East Tennessee State University

High-throughput Protein Sequence Alignment on Multi-core Systems

Author: Ali Syed Asad
Hasan Laiq
Yahya Muhammad
Publication venue: 'Penerbit UTHM'
Publication date: 03/08/2020
Field of study

Rapid evolution in sequencing technologies results in generating data on an enormous scale. A focal and main challenge in analyzing data at such a large scale is the alignment of the DNA/Protein sequences, whereby reads are compared to the reference sequences. To find similar sequences, alignment algorithms are used to align a query sequence with the database. Alignment algorithms can be utilized to classify the source of a sequence, to discover similarities among the organisms, or to deduce a progenitor connection. A wide range of algorithms for alignment has been developed in recent years.In this paper, an accurate method of accelerating such algorithms using GPUs has been investigated. A Swiss-Prot database has been processed using GPU implemented Smith-Waterman Sequence Alignment Algorithm. The first step in the process generates the alignment scores but not the actual alignment. Various available alignment tools like ssearch2 are then utilized to align the output file generated during the first step.The performance of GPU-accelerated implementation as compared to other techniques is then evaluated for performance /throughput improvement. Swiss-Prot database was aligned using various alignment tools. NVIDIA TESLA K40 GPU is being utilized for generating the results for this research. This implementation achieves the performance of 44.3 Giga cell updates per second (GCUPS), which is 22.9 times better than its implementation on GTX 275. Performance is improved as the workload of sequences of equal length is equally distributed among all the threads on Multiprocessors of GPU

Journals of Universiti Tun Hussein Onn Malaysia (UTHM)

International Journal of Integrated Engineering

Identification of accelerated evolution in the metalloproteinase domain of snake venom metalloproteinase sequences (SVMPs) through comparative analysis

Author: Foyasal Khaja
Islam Mahmudul
Islam Zohorul
Roly Zahida Yesmin
Ruhullah Mirza
Sharia Alsan
Tanvir Rafsan Zani
Publication venue: 'African Journals Online (AJOL)'
Publication date: 29/03/2016
Field of study

Computational protein sequence analysis is one of the most important tools used for understanding the evolution of closely related proteins sequences including snake venom metalloproteinase sequences (SVMPs) which give valuable information regarding genetic variations. The fundamental objective of the present study is to screen the evolution distributed in metalloproteinase domain regions of protein sequences among different SVMPs in snake species which are involved in a range of pathological disorders such as arthritis, atherosclerosis, liver fibrosis, cardiovascular, cancer, liver and neurodegenerative disorders. In fact, SVMPS are responsible for hemorrhage and may also interfere with the hemostatic system. A comparative characterization of the metalloproteinase sequences has been carried out to analyze their multiple sequence alignment, phylogenic tree, homology, physicochemical, secondary structural and functional properties. DNAMAN software was used for multiple sequence alignment, phylogenic tree and homology and Expasy’s Prot-param server was used for amino acid composition, physico-chemical and functional characterization of these SVMPs sequences. Studies of secondary structure of these SVMPs were carried out by computational program. Based on the observed patterns of occurrence of atypical features, we hypothesize that amino acids of metalloproteinase domain region (66.63% identity) of protein sequences are highly changeable; whereas, signal peptide region (93.98% identity) is the lowest changeable protein sequence and the remaining other three domains such as propeptide region (87.36% identity), desintegrin domain region (78.63% identity) and cysteine-rich domain region (75.70% identity) show moderate changeable protein sequence. SVMPs might be an accelerated evolution, which is a key player in causing diseases. From the data, it can be suggested that over -changed metalloproteinase domain regions in snake venom metalloproteinase might be responsible for the generation of functional variation of proteins expressed, which in turn may lead to different disorders in humans after snake bite. The results of this study would be an effective tool for the study of mutation, drugs resistance mechanisms and development of new drugs for different diseases.Key words: SVMPs, evolution, multiple sequence alignment, phylogenic tree, secondary structure, homology

AJOL - African Journals Online

TarO : a target optimisation system for structural biology

Author: Barton G J
Cameron S
Carter L G
Dawson A
Hunter W N
Martin D M
McMahon S A
Naismith Jim
Overton I M
van Niekerk C A
White Malcolm F
Publication venue
Publication date: 01/01/2008
Field of study

This work was funded by the UK Biotechnology and Biological Sciences Research Council (BBSRC) Structural Proteomics of Rational Targets (SPoRT) initiative, (Grant BBS/B/14434). Funding to pay the Open Access publication charges for this article was provided by BBSRC.TarO (http://www.compbio.dundee.ac.uk/taro) offers a single point of reference for key bioinformatics analyses relevant to selecting proteins or domains for study by structural biology techniques. The protein sequence is analysed by 17 algorithms and compared to 8 databases. TarO gathers putative homologues, including orthologues, and then obtains predictions of properties for these sequences including crystallisation propensity, protein disorder and post-translational modifications. Analyses are run on a high-performance computing cluster, the results integrated, stored in a database and accessed through a web-based user interface. Output is in tabulated format and in the form of an annotated multiple sequence alignment (MSA) that may be edited interactively in the program Jalview. TarO also simplifies the gathering of additional annotations via the Distributed Annotation System, both from the MSA in Jalview and through links to Dasty2. Routes to other information gateways are included, for example to relevant pages from UniProt, COG and the Conserved Domains Database. Open access to TarO is available from a guest account with private accounts for academic use available on request. Future development of TarO will include further analysis steps and integration with the Protein Information Management System (PIMS), a sister project in the BBSRC Structural Proteomics of Rational Targets initiative.Publisher PDFPeer reviewe

Queen's University Belfast Research Portal

CiteSeerX

Abertay Research Portal

PubMed Central

Edinburgh Research Explorer

University of Dundee Online Publications

University of St. Andrews - Pure

St Andrews Research Repository