142,012 research outputs found

    Distributed Representation of Protein Sequence Based on Multi-Alignment Results

    Get PDF
    Protein sequence representation is a key problem for protein studies, especially for those sequence-based models. In this paper, a distributed representation model of protein sequence is proposed, which involves evolutionary information by introducing multi-alignment results. Firstly, we construct a non-redundancy protein dataset and perform multi-alignment for each protein. Then k-mer amino acids "biology corpus" was abstracted from the alignment results which are "evolutionary information" enriched. Using the "biology corpus", k-mer amino acids distributed embedding vectors could be trained according to word2vec method. We compared the amino acid pair distance derived from our produced 1-mer amino acids distributed embedding vectors with that derived from BLOSUM62; it was found that their Pearson coefficient is 0.937, showing they have strong correlation. Then we applied the obtained amino acids distributed embedding representation to protein secondary structure recognition and solubility prediction. For both of the experiments, our proposed alignment results based amino acid distributed representation outperforms that derived directly from protein sequences. Moreover, compared to those existing up-to-date algorithms, our method could get better or comparative results, on condition of only using the feature of our produced amino acid distributed vectors

    Identification of Protein Alignment for Elder Health Care

    Get PDF
    Over many years protein sequence alignment problem has grabbed attention of biologists as it implicates, more than two biological sequences. It states all the important aspects of big data and how medical and health informatics, translational bioinformatics will benefit personalized health care both structured and unstructured, covering genomics, proteomics, metabolism. The system develop approach for biological sequence alignment to increase efficiency of analysis operation that speed up the calculation of alignment for huge real time sequences, to develop distributed scan approach in Smith-waterman algorithm for presenting fast solution and optimize the Smith Waterman(SW) alignment algorithm using Distributed approach

    A Parallel Algorithm for Large-Scale Multiple Sequence Alignment

    Get PDF
    Multiple sequence alignment is a central topic of extensive research in computational biology. Basically, two or more protein sequences are compared to evaluate their similarity and to identify conserved regions. This work reports a methodology for parallel processing of a multiple sequence alignment algorithm (ClustalW) in an environment of networked computers. A detailed description of the modules that compose the distributed system is provided, giving special attention to the way a dynamic programming algorithm is run in multilevel parallelism. Extensive experiments were done to evaluate performance and scalability of the reported method. Results suggest that the proposed method is very promising for large-scale multiple protein sequence alignment

    ClustalXeed: a GUI-based grid computation version for high performance and terabyte size multiple sequence alignment

    Get PDF
    Abstract Background There is an increasing demand to assemble and align large-scale biological sequence data sets. The commonly used multiple sequence alignment programs are still limited in their ability to handle very large amounts of sequences because the system lacks a scalable high-performance computing (HPC) environment with a greatly extended data storage capacity. Results We designed ClustalXeed, a software system for multiple sequence alignment with incremental improvements over previous versions of the ClustalX and ClustalW-MPI software. The primary advantage of ClustalXeed over other multiple sequence alignment software is its ability to align a large family of protein or nucleic acid sequences. To solve the conventional memory-dependency problem, ClustalXeed uses both physical random access memory (RAM) and a distributed file-allocation system for distance matrix construction and pair-align computation. The computation efficiency of disk-storage system was markedly improved by implementing an efficient load-balancing algorithm, called "idle node-seeking task algorithm" (INSTA). The new editing option and the graphical user interface (GUI) provide ready access to a parallel-computing environment for users who seek fast and easy alignment of large DNA and protein sequence sets. Conclusions ClustalXeed can now compute a large volume of biological sequence data sets, which were not tractable in any other parallel or single MSA program. The main developments include: 1) the ability to tackle larger sequence alignment problems than possible with previous systems through markedly improved storage-handling capabilities. 2) Implementing an efficient task load-balancing algorithm, INSTA, which improves overall processing times for multiple sequence alignment with input sequences of non-uniform length. 3) Support for both single PC and distributed cluster systems.</p

    Distributed Many-to-Many Protein Sequence Alignment using Sparse Matrices

    Full text link
    Identifying similar protein sequences is a core step in many computational biology pipelines such as detection of homologous protein sequences, generation of similarity protein graphs for downstream analysis, functional annotation and gene location. Performance and scalability of protein similarity searches have proven to be a bottleneck in many bioinformatics pipelines due to increases in cheap and abundant sequencing data. This work presents a new distributed-memory software, PASTIS. PASTIS relies on sparse matrix computations for efficient identification of possibly similar proteins. We use distributed sparse matrices for scalability and show that the sparse matrix infrastructure is a great fit for protein similarity searches when coupled with a fully-distributed dictionary of sequences that allows remote sequence requests to be fulfilled. Our algorithm incorporates the unique bias in amino acid sequence substitution in searches without altering the basic sparse matrix model, and in turn, achieves ideal scaling up to millions of protein sequences.Comment: To appear in International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'20

    GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains

    Get PDF
    GeMMA (Genome Modelling and Model Annotation) is a new approach to automatic functional subfamily classification within families and superfamilies of protein sequences. A major advantage of GeMMA is its ability to subclassify very large and diverse superfamilies with tens of thousands of members, without the need for an initial multiple sequence alignment. Its performance is shown to be comparable to the established high-performance method SCI-PHY. GeMMA follows an agglomerative clustering protocol that uses existing software for sensitive and accurate multiple sequence alignment and profile–profile comparison. The produced subfamilies are shown to be equivalent in quality whether whole protein sequences are used or just the sequences of component predicted structural domains. A faster, heuristic version of GeMMA that also uses distributed computing is shown to maintain the performance levels of the original implementation. The use of GeMMA to increase the functional annotation coverage of functionally diverse Pfam families is demonstrated. It is further shown how GeMMA clusters can help to predict the impact of experimentally determining a protein domain structure on comparative protein modelling coverage, in the context of structural genomics

    Predicting Flavonoid UGT Regioselectivity

    Get PDF
    Machine learning was applied to a challenging and biologically significant protein classification problem: the prediction of avonoid UGT acceptor regioselectivity from primary sequence. Novel indices characterizing graphical models of residues were proposed and found to be widely distributed among existing amino acid indices and to cluster residues appropriately. UGT subsequences biochemically linked to regioselectivity were modeled as sets of index sequences. Several learning techniques incorporating these UGT models were compared with classifications based on standard sequence alignment scores. These techniques included an application of time series distance functions to protein classification. Time series distances defined on the index sequences were used in nearest neighbor and support vector machine classifiers. Additionally, Bayesian neural network classifiers were applied to the index sequences. The experiments identified improvements over the nearest neighbor and support vector machine classifications relying on standard alignment similarity scores, as well as strong correlations between specific subsequences and regioselectivities

    High-throughput Protein Sequence Alignment on Multi-core Systems

    Get PDF
    Rapid evolution in sequencing technologies results in generating data on an enormous scale. A focal and main challenge in analyzing data at such a large scale is the alignment of the DNA/Protein sequences, whereby reads are compared to the reference sequences. To find similar sequences, alignment algorithms are used to align a query sequence with the database. Alignment algorithms can be utilized to classify the source of a sequence, to discover similarities among the organisms, or to deduce a progenitor connection. A wide range of algorithms for alignment has been developed in recent years.In this paper, an accurate method of accelerating such algorithms using GPUs has been investigated. A Swiss-Prot database has been processed using GPU implemented Smith-Waterman Sequence Alignment Algorithm. The first step in the process generates the alignment scores but not the actual alignment. Various available alignment tools like ssearch2 are then utilized to align the output file generated during the first step.The performance of GPU-accelerated implementation as compared to other techniques is then evaluated for performance /throughput improvement. Swiss-Prot database was aligned using various alignment tools. NVIDIA TESLA K40 GPU is being utilized for generating the results for this research. This implementation achieves the performance of 44.3 Giga cell updates per second (GCUPS), which is 22.9 times better than its implementation on GTX 275. Performance is improved as the workload of sequences of equal length is equally distributed among all the threads on Multiprocessors of GPU

    Identification of accelerated evolution in the metalloproteinase domain of snake venom metalloproteinase sequences (SVMPs) through comparative analysis

    Get PDF
    Computational protein sequence analysis is one of the most important tools used for understanding the evolution of closely related proteins sequences including snake venom metalloproteinase sequences (SVMPs) which give valuable information regarding genetic variations. The fundamental objective of the present study is to screen the evolution distributed in metalloproteinase domain regions of protein sequences among different SVMPs in snake species which are involved in a range of pathological disorders such as arthritis, atherosclerosis, liver fibrosis, cardiovascular, cancer, liver and neurodegenerative disorders. In fact, SVMPS are responsible for hemorrhage and may also interfere with the hemostatic system. A comparative characterization of the metalloproteinase sequences has been carried out to analyze their multiple sequence alignment, phylogenic tree, homology, physicochemical, secondary structural and functional properties. DNAMAN software was used for multiple sequence alignment, phylogenic tree and homology and Expasy’s Prot-param server was used for amino acid composition, physico-chemical and functional characterization of these SVMPs sequences. Studies of secondary structure of these SVMPs were carried out by computational program. Based on the observed patterns of occurrence of atypical features, we hypothesize that amino acids of metalloproteinase domain region (66.63% identity) of protein sequences are highly changeable; whereas, signal peptide region (93.98% identity) is the lowest changeable protein sequence and the remaining other three domains such as propeptide region (87.36% identity), desintegrin domain region (78.63% identity) and cysteine-rich domain region (75.70% identity) show moderate changeable protein sequence. SVMPs might be an accelerated evolution, which is a key player in causing diseases. From the data, it can be suggested that over -changed metalloproteinase domain regions in snake venom metalloproteinase might be responsible for the generation of functional variation of proteins expressed, which in turn may lead to different disorders in humans after snake bite. The results of this study would be an effective tool for the study of mutation, drugs resistance mechanisms and development of new drugs for different diseases.Key words: SVMPs, evolution, multiple sequence alignment, phylogenic tree, secondary structure, homology

    TarO : a target optimisation system for structural biology

    Get PDF
    This work was funded by the UK Biotechnology and Biological Sciences Research Council (BBSRC) Structural Proteomics of Rational Targets (SPoRT) initiative, (Grant BBS/B/14434). Funding to pay the Open Access publication charges for this article was provided by BBSRC.TarO (http://www.compbio.dundee.ac.uk/taro) offers a single point of reference for key bioinformatics analyses relevant to selecting proteins or domains for study by structural biology techniques. The protein sequence is analysed by 17 algorithms and compared to 8 databases. TarO gathers putative homologues, including orthologues, and then obtains predictions of properties for these sequences including crystallisation propensity, protein disorder and post-translational modifications. Analyses are run on a high-performance computing cluster, the results integrated, stored in a database and accessed through a web-based user interface. Output is in tabulated format and in the form of an annotated multiple sequence alignment (MSA) that may be edited interactively in the program Jalview. TarO also simplifies the gathering of additional annotations via the Distributed Annotation System, both from the MSA in Jalview and through links to Dasty2. Routes to other information gateways are included, for example to relevant pages from UniProt, COG and the Conserved Domains Database. Open access to TarO is available from a guest account with private accounts for academic use available on request. Future development of TarO will include further analysis steps and integration with the Protein Information Management System (PIMS), a sister project in the BBSRC Structural Proteomics of Rational Targets initiative.Publisher PDFPeer reviewe
    corecore