18 research outputs found

    tacg – a grep for DNA

    Get PDF
    BACKGROUND: Pattern matching is the core of bioinformatics; it is used in database searching, restriction enzyme mapping, and finding open reading frames. It is done repeatedly over increasingly long sequences, thus codes must be efficient and insensitive to sequence length. Such patterns of interest include simple motifs with IUPAC degeneracies, regular expressions, patterns allowing mismatches, and probability matrices. RESULTS: I describe a small application which allows searching for all the above pattern types individually, which further allows these atomic motifs to be assembled into logical rules for more sophisticated analysis. CONCLUSION: tacg is small, portable, faster and more capable than most alternatives, relatively easy to modify, and freely available in source code

    PatMatch: a program for finding patterns in peptide and nucleotide sequences

    Get PDF
    Here, we present PatMatch, an efficient, web-based pattern-matching program that enables searches for short nucleotide or peptide sequences such as cis-elements in nucleotide sequences or small domains and motifs in protein sequences. The program can be used to find matches to a user-specified sequence pattern that can be described using ambiguous sequence codes and a powerful and flexible pattern syntax based on regular expressions. A recent upgrade has improved performance and now supports both mismatches and wildcards in a single pattern. This enhancement has been achieved by replacing the previous searching algorithm, scan_for_matches [D'Souza et al. (1997), Trends in Genetics, 13, 497–498], with nondeterministic-reverse grep (NR-grep), a general pattern matching tool that allows for approximate string matching [Navarro (2001), Software Practice and Experience, 31, 1265–1312]. We have tailored NR-grep to be used for DNA and protein searches with PatMatch. The stand-alone version of the software can be adapted for use with any sequence dataset and is available for download at The Arabidopsis Information Resource (TAIR) at . The PatMatch server is available on the web at for searching Arabidopsis thaliana sequences

    Genomic sequence derived simple sequence repeat markers: case study with Medicago spp

    Get PDF
    Simple sequence repeats (SSR) or micro-satellites are becoming standard DNA markers for plant genome analysis and are being used as markers in marker assisted breeding. De novo generation of micro-satellite markers through laboratory-based screening of SSR-enriched genomic libraries is highly time consuming and expensive. An alternative is to screen the public databases of related model species where abundant sequence data is already available. All the genomic sequences of Medicago from the public domain database were searched and analysed of di, tri, and tetra nucleotide repeats. Of the total of about 156,000 sequences which were searched, 7325 sequences were found to contain repeat motif and may yield SSR which will yield product sizes of around 200 bp. Of these the most abundantly found repeats were the tri-nucleotide (5210) group. Except for a very small proportion (436), these link to the gene annotation database at TIGR (http://www.tigr.org)

    Genomic sequence derived simple sequence repeats markers: A case study with Medicago spp.

    Get PDF
    Simple sequence repeats (SSR) or micro-satellites are becoming standard DNA markers for plant genome analysis and are being used as markers in marker assisted breeding. De novo generation of micro-satellite markers through laboratory-based screening of SSR-enriched genomic libraries is highly time consuming and expensive. An alternative is to screen the public databases of related model species where abundant sequence data is already available. All the genomic sequences of Medicago from the public domain database were searched and analysed of di, tri, and tetra nucleotide repeats. Of the total of about 156,000 sequences which were searched, 7325 sequences were found to contain repeat motif and may yield SSR which will yield product sizes of around 200 bp. Of these the most abundantly found repeats were the tri-nucleotide (5210) group. Except for a very small proportion (436), these link to the gene annotation database at TIGR (http://www.tigr.org). To facilitate further exploration of this resource, a dynamic database with options to search and link to other resources is available at (http://www.icrisat.org/text/research/grep/homepage/genomics/medssrs1.asp) and on CDs from [email protected]

    Genomic sequence derived simple sequence repeats markers. A case study with medicago spp.

    Get PDF
    Simple sequence repeats (SSR) or microsatellites are becoming standard DNA markers for plant genome analysis and are being used as markers in marker assisted breeding. De novo generation of microsatellite markers through laboratory-based screening of SSR-enriched genomic libraries is highly time consuming and expensive. An alternative is to screen the public databases of related model species where abundant sequence data is already available. All the genomic sequences of Medicago from the public domain database were searched and analysed for di, tri and tetra nucleotide repeats. Of the total 156 000 sequences which were searched, 7325 sequences contained repeat motif and may yield SSR which will yield product sizes of around 200 bp. Of these, the most abundantly found repeats were the trinucleotide (5210) group. Except for a very small proportion (436), these link to the gene annotation database at TIGR (http://www.tigr.org). To facilitate further exploration of this resource, a dynamic database with options to search and link to other resources is available at (http://www.icrisat.org/text/research/ grep/homepage/genomics/medssrs1.asp) and on CDs from [email protected]

    Comparative genomics of Drosophila and human core promoters

    Get PDF
    BACKGROUND: The core promoter region plays a critical role in the regulation of eukaryotic gene expression. We have determined the non-random distribution of DNA sequences relative to the transcriptional start site in Drosophila melanogaster promoters to identify sequences that may be biologically significant. We compare these results with those obtained for human promoters. RESULTS: We determined the distribution of all 65,536 octamer (8-mers) DNA sequences in 10,914 Drosophila promoters and two sets of human promoters aligned relative to the transcriptional start site. In Drosophila, 298 8-mers have highly significant (p ≤ 1 × 10(-16)) non-random distributions peaking within 100 base-pairs of the transcriptional start site. These sequences were grouped into 15 DNA motifs. Ten motifs, termed directional motifs, occur only on the positive strand while the remaining five motifs, termed non-directional motifs, occur on both strands. The only directional motifs to localize in human promoters are TATA, INR, and DPE. The directional motifs were further subdivided into those precisely positioned relative to the transcriptional start site and those that are positioned more loosely relative to the transcriptional start site. Similar numbers of non-directional motifs were identified in both species and most are different. The genes associated with all 15 DNA motifs, when they occur in the peak, are enriched in specific Gene Ontology categories and show a distinct mRNA expression pattern, suggesting that there is a core promoter code in Drosophila. CONCLUSION: Drosophila and human promoters use different DNA sequences to regulate gene expression, supporting the idea that evolution occurs by the modulation of gene regulation

    Exact Pattern Matching with Feed-Forward Bloom Filters

    Full text link

    Logol : Modelling evolving sequence families through a dedicated constrained string language

    Get PDF
    The report reviews the key milestones that have been reached so far in applying formal languages to the analysis of genomic sequences. Then it introduces a new modelling language, Logol, that aims at expressing more easily complex structures on genomic sequences. It is based on a development of String Variable Grammars, a formal framework proposed by D. Searls

    Aplicación bioinformática para predicción de genes regulados por microARNs en plantas

    Get PDF
    Los microARNs (o miARNs) son ARN no codificantes que regulan la expresión génica en animales y plantas, implicados en procesos biológicos muy variables, como el desarrollo, la diferenciación y el metabolismo. Estos pequeños ARNs de aproximadamente 21 nucleótidos reconocen secuencias parcialmente complementarias en los ARNm blanco, provocando su corte o arresto de la traducción. Los microARNs han saltado rápidamente a la primera plana del interés de la comunidad científica como un nuevo nivel en el control de la expresión génica en eucariotas. Estudios recientes han puesto de manifiesto que los microARNs están estrechamente involucrados en distintas enfermedades de importancia. Algunos tienen relación con distintos tipos de Cáncer y otros están relacionados con enfermedades cardíacas donde los niveles de expresión de microARNs específicos cambian en el corazón humano cuando están presentes dichas enfermedades. Los cálculos actuales consideran que entre el 20% y el 40% de los genes de humanos se encuentran regulados por microARNs. Este trabajo propone estudiar en forma automatizada a los microARNs en plantas, su biogénesis y los genes que regulan, a través de un enfoque multidisciplinario. Para esto presentaremos una estrategia bioinformática para la identificación de genes blancos de microARNs y además una herramienta web para el análisis y selección de los mejores genes blancos candidatos. Considerando que muchos de estos ARNs pequeños están ampliamente distribuidos en plantas, la herramienta a desarrollar estará basada principalmente en la conservación durante la evolución de la interacción del par microARN-gen blanco en distintas especies. Además, al estar los microARNs utilizados en este proyecto conservados en especies de interés agronómico las aplicaciones potenciales de este trabajo propuesto podrían ser inmediatas.Fil: Chorostecki, Uciel. Tesista del Departamento de Ciencias de la Computación. Facultad de Ciencias Exactas, Ingeniería y Agrimensura. Universidad Nacional de Rosario; Argentina

    Utilization of near-isogenic lines to identify genes underlying iron-efficiency QTL

    Get PDF
    Nutrient deficiencies are a significant abiotic stress of soybean. Iron deficiency chlorosis is a major concern in the upper midwestern region of the United States due to the prevalence of calcareous soils. Soybean\u27s susceptibility to iron stress results in yield losses into the hundreds of millions each year. Understanding the molecular differences between resistant and susceptible cultivars will significantly affect future yield and revenue. Through the use of near-isogenic lines (NILs), molecular markers, and gene expression we have identified the donor parent introgressions through both classical SSR mapping and a novel method of SNP clustering which can be preformed using data generated through either chip-based SNP genotyping platforms or identified de novo though re-sequencing techniques. By aligning the newly constructed introgression map with the previously identified Fe efficiency QTL we identified a region on chromosome 3 where the two were positionally coincident. To further narrow this region of interest, the NIL was backcrossed an additional generation to the recurrent parent in order to identify recombinations within the chromosome 3 introgression. These lines were identified as Sub-NILs. Recombinants were identified in regular intervals throughout the introgression and phenotyped. Donor parent alleles identified within a 250 kb region represented the minimum interval differentiating the efficient and inefficient Sub-NILs. A second NIL sharing the same donor parent was screened for introgressions. The only region of the genome the two NILs shared alleles from the donor parent, introgressions, were localized to the same region on chromosome 3 further adding support to the importance of the these alleles. Eighteen genes were annotated within the region and were screened for gene expression differences in soybean roots 24 hours following the removal of iron in the growth medium. Two of the genes were differentially expressed between sufficient and insufficient iron conditions. Interestingly, these genes are homologs of two transcription factors in Arabidopsis thaliana known to function in the iron response pathway. Sanger sequencing of these two genes identified a significant mutation that deletes 4 amino acids in the susceptible lines. We hypothesize that this deletion disrupts the FIT / bHLH heterodimer that has been shown to induce known iron acquisition genes
    corecore