1,028 research outputs found
Estimation of Similarity between DNA Sequences and Its Graphical Representation
Bioinformatics, which is now a well known field of study, originated in the
context of biological sequence analysis. Recently graphical representation
takes place for the research on DNA sequence. Research in biological sequence
is mainly based on the function and its structure. Bioinformatics finds wide
range of applications specifically in the domain of molecular biology which
focuses on the analysis of molecules viz. DNA, RNA, Protein etc. In this
review, we mainly deal with the similarity analysis between sequences and
graphical representation of DNA sequence.Comment: 8 pages, 13 Figures, 4 Table
Graphical Representation of Biological Sequences
Sequence comparison is one of the most fundamental tasks in bioinformatics. For biological sequence comparison, alignment is the most profitable method when the sequence lengths are not so large. However, as the time complexity of the alignment is the square order of the sequence length, the alignment requires a large amount of computational time for comparison of sequences of large size. Therefore, so-called alignment-free sequence comparison methods are needed for comparison between such as whole genome sequences in practical time. In this chapter, we reviewed the graphical representation of biological sequences, which is one of the major alignment-free sequence comparison methods. The notable effects of weighting during the course of the graphical representation introduced first by the author and co-workers were also mentioned
Empirical Relationship between Intra-Purine and Intra-Pyrimidine Differences in Conserved Gene Sequences
DNA sequences seen in the normal character-based representation appear to have a formidable mixing of the four nucleotides without any apparent order. Nucleotide frequencies and distributions in the sequences have been studied extensively, since the simple rule given by Chargaff almost a century ago that equates the total number of purines to the pyrimidines in a duplex DNA sequence. While it is difficult to trace any relationship between the bases from studies in the character representation of a DNA sequence, graphical representations may provide a clue. These novel representations of DNA sequences have been useful in providing an overview of base distribution and composition of the sequences and providing insights into many hidden structures. We report here our observation based on a graphical representation that the intra-purine and intra-pyrimidine differences in sequences of conserved genes generally follow a quadratic distribution relationship and show that this may have arisen from mutations in the sequences over evolutionary time scales. From this hitherto undescribed relationship for the gene sequences considered in this report we hypothesize that such relationships may be characteristic of these sequences and therefore could become a barrier to large scale sequence alterations that override such characteristics, perhaps through some monitoring process inbuilt in the DNA sequences. Such relationship also raises the possibility of intron sequences playing an important role in maintaining the characteristics and could be indicative of possible intron-late phenomena
An efficient and accurate framework for large-scale sequences of DNA barcodes
Dissertação de mestrado integrado em Engenharia InformáticaDNA barcodes are short sequences of pre-defined gene regions that contain a sufficient
amount of intra- and inter-species genetic information. High-throughput sequencing techniques are currently used to identify large sequences of DNA barcodes in a species genome, in a relatively short time.
Domain experts require adequate self-contained tools to accurately and efficiently process
DNA barcode data in a reasonable time, taking advantage of current parallel and heterogeneous computing systems. They also expect to use these tools on different computing platforms, from laptops to high-performance servers, without requiring a broad knowledge in software engineering to develop efficient computational applications.
The main goal of this project was to develop a framework and associated user-friendly tools
for domain experts to efficiently support DNA barcoding studies, providing an abstraction
of the performance issues.
4SpecID is the key outcome of this work: an application software that integrates a
semi-automated auditing and annotation tool for reference libraries, to ensure the quality
standards of the compiled data, aiming to enable a grounded decision when identifying
species from DNA barcodes. Its graphics interface aids the end user to specify the operations
and it also simplifies data filtering and remote file handling.
The C++ ported version (from MATLAB) was fully tested and is more robust than
the original version. Architecture features common to laptop and compute servers were
exploited, namely parallel programming techniques and memory models.
The presented validation and performance results show significant improvements on
execution times, not only on the sequential version, but also by using the available parallel
capabilities of the underlying computing platforms.Os códigos de barras de ADN são pequenas sequência de regiões genéticas predefinidas
que contêm uma quantidade suficiente de informação genética intra e interespécies.
Técnicas de sequenciamento de alto desempenho são usadas na identificação de grandes
sequências de códigos de barras de ADN no genoma de uma espécie.
No entanto, é necessário que sejam desenvolvidas ferramentas adequadas para que os
especialistas de domínio processem dados de código de barras de ADN de forma precisa e
num intervalo de tempo viável, utilizando os sistemas de computação paralelos e heterogêneos que existem. Destas ferramentas é esperado que possam ser utilizadas recorrendo a
diferentes plataformas de computação, de laptops a servidores de alto desempenho, sem
exigir um amplo conhecimento em engenharia de software para serem utilizadas ou usadas
para a criação de outras ferramentas.
O objetivo principal deste projeto é desenvolver uma estrutura que forneça uma abstração
dos possíveis desafios de desempenho e permitir que especialistas no domínio tenham
uma forma computacional eficiente para realizar um estudo de código de barras de DNA.
Neste projecto desenvolveu-se uma ferramenta, 4SpecID, que visa permitir uma decisão
fundamentada na identificação de espécies através de códigos de barras de DNA: uma
auditoria semi-automática e ferramenta de anotação para bibliotecas de referência, para
garantir os padrões de qualidade dos dados compilados.
Este projeto também explorou as vantagens das arquiteturas de servidores de computação
e laptops mais comuns, como técnicas de programação paralela e modelos de memória. Os
resultados de validação e desempenho apresentados mostram que é possível obter melhores
tempos de execução utilizando as características disponíveis das plataformas subjacentes
Ab initio RNA folding
RNA molecules are essential cellular machines performing a wide variety of
functions for which a specific three-dimensional structure is required. Over
the last several years, experimental determination of RNA structures through
X-ray crystallography and NMR seems to have reached a plateau in the number of
structures resolved each year, but as more and more RNA sequences are being
discovered, need for structure prediction tools to complement experimental data
is strong. Theoretical approaches to RNA folding have been developed since the
late nineties when the first algorithms for secondary structure prediction
appeared. Over the last 10 years a number of prediction methods for 3D
structures have been developed, first based on bioinformatics and data-mining,
and more recently based on a coarse-grained physical representation of the
systems. In this review we are going to present the challenges of RNA structure
prediction and the main ideas behind bioinformatic approaches and physics-based
approaches. We will focus on the description of the more recent physics-based
phenomenological models and on how they are built to include the specificity of
the interactions of RNA bases, whose role is critical in folding. Through
examples from different models, we will point out the strengths of
physics-based approaches, which are able not only to predict equilibrium
structures, but also to investigate dynamical and thermodynamical behavior, and
the open challenges to include more key interactions ruling RNA folding.Comment: 28 pages, 18 figure
DV-Curve Representation of Protein Sequences and Its Application
Based on the detailed hydrophobic-hydrophilic(HP) model of amino acids, we propose dual-vector curve (DV-curve) representation of protein sequences, which uses two vectors to represent one alphabet of protein sequences. This graphical representation not only avoids degeneracy, but also has good visualization no matter how long these sequences are, and can reflect the length of protein sequence. Then we transform the 2D-graphical representation into a numerical characterization that can facilitate quantitative comparison of protein sequences. The utility of this approach is illustrated by two examples: one is similarity/dissimilarity comparison among different ND6 protein sequences based on their DV-curve figures the other is the phylogenetic analysis among coronaviruses based on their spike proteins
Optimization of Algorithms for Triplex Detection
Současné studie naznačují, že triplexy hrají důležitou roli v mechanismech regulace transkripce, rekombinace DNA a mutageneze a mají proto velký význam pro biologii, biotechnologii a medicínu. Tato bakalářská práce optimalizuje nedávno publikovaný algoritmus pro vyhledávání potenciálních intramolekulárních triplexů na třech úrovních návrhu: uživatelské rozhraní, využití paměti a výpočetní náročnost. V úrovni uživatelského rozhraní byl algoritmus rozšířen o existující vizualizační funkce a transformován do podoby balíčku pro prostředí R/Bioconductor. Optimalizací využití paměti a cache procesoru v kombinaci s redukcí výpočtu na základě analýzy jeho stavu bylo dosaženo více než trojnásobného zrychlení oproti původní implementaci.Triplex-forming DNA sequences have been implicated as important players in several key processes, such as transcriptional regulation, DNA recombination and mutagenesis, which emphasize their importance for biology, biotechnology and medicine. This bachelor thesis optimizes recently publicated dynamic programming algorithm for identification of triplex-forming sequences on three levels of design: user interface, memory usage and computation time. On the level of user interface, the algorithm was extended with existing visualization functions and rewritten into R/Bioconductor package. Memory usage optimization and processor cache analysis in combination with computation time reduction based on current computation state analysis lead to more than three times acceleration.
Analysis of Similarity/Dissimilarity of DNA Sequences Based on Chaos Game Representation
The Chaos Game is an algorithm that can allow one to produce pictures of fractal structures. Considering that the four bases A, G, C, and T of DNA sequences can be divided into three classes according to their chemical structure, we propose different kinds of CGR-walk sequences. Based on CGR coordinates of random sequences, we introduce some invariants for the DNA primary sequences. As an application, we can make the examination of similarity/dissimilarity among the first exon of β-globin gene of different species. The results indicate that our method is efficient and can get more biological information
Human Promoter Prediction Using DNA Numerical Representation
With the emergence of genomic signal processing, numerical representation techniques for DNA alphabet set {A, G, C, T} play a key role in applying digital signal processing and machine learning techniques for processing and analysis of DNA sequences. The choice of the numerical representation of a DNA sequence affects how well the biological properties can be reflected in the numerical domain for the detection and identification of the characteristics of special regions of interest within the DNA sequence. This dissertation presents a comprehensive study of various DNA numerical and graphical representation methods and their applications in processing and analyzing long DNA sequences. Discussions on the relative merits and demerits of the various methods, experimental results and possible future developments have also been included. Another area of the research focus is on promoter prediction in human (Homo Sapiens) DNA sequences with neural network based multi classifier system using DNA numerical representation methods. In spite of the recent development of several computational methods for human promoter prediction, there is a need for performance improvement. In particular, the high false positive rate of the feature-based approaches decreases the prediction reliability and leads to erroneous results in gene annotation.To improve the prediction accuracy and reliability, DigiPromPred a numerical representation based promoter prediction system is proposed to characterize DNA alphabets in different regions of a DNA sequence.The DigiPromPred system is found to be able to predict promoters with a sensitivity of 90.8% while reducing false prediction rate for non-promoter sequences with a specificity of 90.4%. The comparative study with state-of-the-art promoter prediction systems for human chromosome 22 shows that our proposed system maintains a good balance between prediction accuracy and reliability. To reduce the system architecture and computational complexity compared to the existing system, a simple feed forward neural network classifier known as SDigiPromPred is proposed. The SDigiPromPred system is found to be able to predict promoters with a sensitivity of 87%, 87%, 99% while reducing false prediction rate for non-promoter sequences with a specificity of 92%, 94%, 99% for Human, Drosophila, and Arabidopsis sequences respectively with reconfigurable capability compared to existing system
- …