146 research outputs found

    Analyzing large-scale DNA Sequences on Multi-core Architectures

    Full text link
    Rapid analysis of DNA sequences is important in preventing the evolution of different viruses and bacteria during an early phase, early diagnosis of genetic predispositions to certain diseases (cancer, cardiovascular diseases), and in DNA forensics. However, real-world DNA sequences may comprise several Gigabytes and the process of DNA analysis demands adequate computational resources to be completed within a reasonable time. In this paper we present a scalable approach for parallel DNA analysis that is based on Finite Automata, and which is suitable for analyzing very large DNA segments. We evaluate our approach for real-world DNA segments of mouse (2.7GB), cat (2.4GB), dog (2.4GB), chicken (1GB), human (3.2GB) and turkey (0.2GB). Experimental results on a dual-socket shared-memory system with 24 physical cores show speed-ups of up to 17.6x. Our approach is up to 3x faster than a pattern-based parallel approach that uses the RE2 library.Comment: The 18th IEEE International Conference on Computational Science and Engineering (CSE 2015), Porto, Portugal, 20 - 23 October 201

    A Method of Protein Model Classification and Retrieval Using Bag-of-Visual-Features

    Get PDF
    In this paper we propose a novel visual method for protein model classification and retrieval. Different from the conventional methods, the key idea of the proposed method is to extract image features of proteins and measure the visual similarity between proteins. Firstly, the multiview images are captured by vertices and planes of a given octahedron surrounding the protein. Secondly, the local features are extracted from each image of the different views by the SURF algorithm and are vector quantized into visual words using a visual codebook. Finally, KLD is employed to calculate the similarity distance between two feature vectors. Experimental results show that the proposed method has encouraging performances for protein retrieval and categorization as shown in the comparison with other methods

    Array CGH data modeling and smoothing in Stationary Wavelet Packet Transform domain

    Get PDF
    Background: Array-based comparative genomic hybridization (array CGH) is a highly efficient technique, allowing the simultaneous measurement of genomic DNA copy number at hundreds or thousands of loci and the reliable detection of local one-copy-level variations. Characterization of these DNA copy number changes is important for both the basic understanding of cancer and its diagnosis. In order to develop effective methods to identify aberration regions from array CGH data, many recent research work focus on both smoothing-based and segmentation-based data processing. In this paper, we propose stationary packet wavelet transform based approach to smooth array CGH data. Our purpose is to remove CGH noise in whole frequency while keeping true signal by using bivariate model. Results: In both synthetic and real CGH data, Stationary Wavelet Packet Transform (SWPT) is the best wavelet transform to analyze CGH signal in whole frequency. We also introduce a new bivariate shrinkage model which shows the relationship of CGH noisy coefficients of two scales in SWPT. Before smoothing, the symmetric extension is considered as a preprocessing step to save information at the border. Conclusions: We have designed the SWTP and the SWPT-Bi which are using the stationary wavelet packet transform with the hard thresholding and the new bivariate shrinkage estimator respectively to smooth the array CGH data. We demonstrate the effectiveness of our approach through theoretical and experimental exploration of a set of array CGH data, including both synthetic data and real data. The comparison results show that our method outperforms the previous approaches

    A survey of models for inference of gene regulatory networks

    Get PDF
    In this article, I present the biological backgrounds of microarray, ChIP-chip and ChIPSeq technologies and the application of computational methods in reverse engineering of gene regulatory networks (GRNs). The most commonly used GRNs models based on Boolean networks, Bayesian networks, relevance networks, differential and difference equations are described. A novel model for integration of prior biological knowledge in the GRNs inference is presented, too. The advantages and disadvantages of the described models are compared. The GRNs validation criteria are depicted. Current trends and further directions for GRNs inference using prior knowledge are given at the end of the paper

    Alternative Splicing: Associating Frequency with Isoforms

    Get PDF
    In the simplest model of protein production, a gene gives rise to a single protein; DNA is transcribed to form pre-mRNA, which is converted to mRNA by splicing or removing introns. The result is a chain of exons that is translated to form a protein. Alternative splicing of exons may result in the formation of multiple proteins from the same gene sequence. However, not all of these proteins may be functional. Thus, we ask whether we can predict and rank (in order of frequency of occurrence and functional importance) the set of possible proteins for a gene. Herein we describe a tool that predicts the relative frequencies of isoforms that can be produced from a given gene

    A survey of models for inference of gene regulatory networks

    Get PDF
    In this article, I present the biological backgrounds of microarray, ChIP-chip and ChIPSeq technologies and the application of computational methods in reverse engineering of gene regulatory networks (GRNs). The most commonly used GRNs models based on Boolean networks, Bayesian networks, relevance networks, differential and difference equations are described. A novel model for integration of prior biological knowledge in the GRNs inference is presented, too. The advantages and disadvantages of the described models are compared. The GRNs validation criteria are depicted. Current trends and further directions for GRNs inference using prior knowledge are given at the end of the paper

    Determining Domain Similarity and Domain-Protein Similarity using Functional Similarity Measurements of Gene Ontology Terms

    Get PDF
    Protein domains typically correspond to major functional sites of a protein. Therefore, determining similarity between domains can aid in the comparison of protein functions, and can provide a basis for grouping domains based on function. One strategy for comparing domain similarity and domain-protein similarity is to use similarity measurements of annotation terms from the Gene Ontology (GO). In this paper five methods are analyzed in terms of their usefulness for comparing domains, and comparing domains to proteins based on GO terms

    Phylogenetic reconstruction from transpositions

    Get PDF
    Background Because of the advent of high-throughput sequencing and the consequent reduction in the cost of sequencing, many organisms have been completely sequenced and most of their genes identified. It thus has become possible to represent whole genomes as ordered lists of gene identifiers and to study the rearrangement of these entities through computational means. As a result, genome rearrangement data has attracted increasing attentions from both biologists and computer scientists as a new type of data for phylogenetic analysis. The main events of genome rearrangements include inversions, transpositions and transversions. To date, GRAPPA and MGR are the most accurate methods for rearrangement phylogeny, both assuming inversion as the only event. However, due to the complexity of computing transposition distance, it is very difficult to analyze datasets when transpositions are dominant. Results We extend GRAPPA to handle transpositions. The new method is named GRAPPA-TP, with two major extensions: a heuristic method to estimate transposition distance, and a new transposition median solver for three genomes. Although GRAPPA-TP uses a greedy approach to compute the transposition distance, it is very accurate when genomes are relatively close. The new GRAPPA-TP is available from http://phylo.cse.sc.edu/ Conclusion Our extensive testing using simulated datasets shows that GRAPPA-TP is very accurate in terms of ancestor genome inference and phylogenetic reconstruction. Simulation results also suggest that model match is critical in genome rearrangement analysis: it is not accurate to simulate transpositions with other events including inversions

    Selecting subsets of newly extracted features from PCA and PLS in microarray data analysis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Dimension reduction is a critical issue in the analysis of microarray data, because the high dimensionality of gene expression microarray data set hurts generalization performance of classifiers. It consists of two types of methods, i.e. feature selection and feature extraction. Principle component analysis (PCA) and partial least squares (PLS) are two frequently used feature extraction methods, and in the previous works, the top several components of PCA or PLS are selected for modeling according to the descending order of eigenvalues. While in this paper, we prove that not all the top features are useful, but features should be selected from all the components by feature selection methods.</p> <p>Results</p> <p>We demonstrate a framework for selecting feature subsets from all the newly extracted components, leading to reduced classification error rates on the gene expression microarray data. Here we have considered both an unsupervised method PCA and a supervised method PLS for extracting new components, genetic algorithms for feature selection, and support vector machines and <it>k </it>nearest neighbor for classification. Experimental results illustrate that our proposed framework is effective to select feature subsets and to reduce classification error rates.</p> <p>Conclusion</p> <p>Not only the top features newly extracted by PCA or PLS are important, therefore, feature selection should be performed to select subsets from new features to improve generalization performance of classifiers.</p

    Analyzing adjuvant radiotherapy suggests a non monotonic radio-sensitivity over tumor volumes

    Get PDF
    Background: Adjuvant Radiotherapy (RT) after surgical removal of tumors proved beneficial in long-term tumor control and treatment planning. For many years, it has been well concluded that radio-sensitivities of tumors upon radiotherapy decrease according to the sizes of tumors and RT models based on Poisson statistics have been used extensively to validate clinical data. Results: We found that Poisson statistics on RT is actually derived from bacterial cells despite of many validations from clinical data. However cancerous cells do have abnormal cellular communications and use chemical messengers to signal both surrounding normal and cancerous cells to develop new blood vessels and to invade, to metastasis and to overcome intercellular spatial confinements in general. We therefore investigated the cell killing effects on adjuvant RT and found that radio-sensitivity is actually not a monotonic function of volume as it was believed before. We present detailed analysis and explanation to justify above statement. Based on EUD, we present an equivalent radio-sensitivity model. Conclusion: We conclude that radio sensitivity is a sophisticated function over tumor volumes, since tumor responses upon radio therapy also depend on cellular communications
    • …
    corecore