17 research outputs found

    AphanoDB: a genomic resource for Aphanomyces pathogens.

    Get PDF
    BACKGROUND: The Oomycete genus Aphanomyces comprises devastating plant and animal pathogens. However, little is known about the molecular mechanisms underlying pathogenicity of Aphanomyces species. In this study, we report on the development of a public database called AphanoDB which is dedicated to Aphanomyces genomic data. As a first step, a large collection of Expressed Sequence Tags was obtained from the legume pathogen A. euteiches, which was then processed and collected into AphanoDB. DESCRIPTION: Two cDNA libraries of A. euteiches were created: one from mycelium growing on synthetic medium and one from mycelium grown in contact to root tissues of the model legume Medicago truncatula. From these libraries, 18,684 expressed sequence tags were obtained and assembled into 7,977 unigenes which were compared to public databases for annotation. Queries on AphanoDB allow the users to retrieve information for each unigene including similarity to known protein sequences, protein domains and Gene Ontology classification. Statistical analysis of EST frequency from the two different growth conditions was also added to the database. CONCLUSION: AphanoDB is a public database with a user-friendly web interface. The sequence report pages are the main web interface which provides all annotation details for each unigene. These interactive sequence report pages are easily available through text, BLAST, Gene Ontology and expression profile search utilities. AphanoDB is available from URL: http://www.polebio.scsv.ups-tlse.fr/aphano/

    A Bayesian Nonparametric Method for Prediction in EST Analysis

    Get PDF
    In this work we propose a Bayesian nonparametric approach for tackling statistical problems related to EST surveys. In particular, we provide estimates for: a) the coverage, defined as the proportion of unique genes in the library represented in the given sample of reads; b) the number of new unique genes to be observed in a future sample; c) the discovery rate of new genes as a function of the future sample size. The Bayesian nonparametric model we adopt conveys, in a statistically rigorous way, the available information into prediction. Our proposal has appealing properties over frequentist nonparametric methods, which become unstable when prediction is required for large future samples. EST libraries studied in Susko and Roger (2004), with frequentist methods, are analyzed in detail.

    An EST resource for tilapia based on 17 normalized libraries and assembly of 116,899 sequence tags

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Large collections of expressed sequence tags (ESTs) are a fundamental resource for analysis of gene expression and annotation of genome sequences. We generated 116,899 ESTs from 17 normalized and two non-normalized cDNA libraries representing 16 tissues from tilapia, a cichlid fish widely used in aquaculture and biological research.</p> <p>Results</p> <p>The ESTs were assembled into 20,190 contigs and 36,028 singletons for a total of 56,218 unique sequences and a total assembled length of 35,168,415 bp. Over the whole project, a unique sequence was discovered for every 2.079 sequence reads. 17,722 (31.5%) of these unique sequences had significant BLAST hits (e-value < 10<sup>-10</sup>) to the UniProt database.</p> <p>Conclusion</p> <p>Normalization of the cDNA pools with double-stranded nuclease allowed us to efficiently sequence a large collection of ESTs. These sequences are an important resource for studies of gene expression, comparative mapping and annotation of the forthcoming tilapia genome sequence.</p

    Large deviation principles for the Ewens-Pitman sampling model

    Get PDF
    Let Ml,nM_{l,n} be the number of blocks with frequency ll in the exchangeable random partition induced by a sample of size nn from the Ewens-Pitman sampling model. We show that, as nn tends to infinity, n1Ml,nn^{-1}M_{l,n} satisfies a large deviation principle and we characterize the corresponding rate function. A conditional counterpart of this large deviation principle is also presented. Specifically, given an initial sample of size nn from the Ewens-Pitman sampling model, we consider an additional sample of size mm. For any fixed nn and as mm tends to infinity, we establish a large deviation principle for the conditional number of blocks with frequency ll in the enlarged sample, given the initial sample. Interestingly, the conditional and unconditional large deviation principles coincide, namely there is no long lasting impact of the given initial sample. Potential applications of our results are discussed in the context of Bayesian nonparametric inference for discovery probabilities.Comment: 30 pages, 2 figure

    Gene capture prediction and overlap estimation in EST sequencing from one or multiple libraries

    Get PDF
    BACKGROUND: In expressed sequence tag (EST) sequencing, we are often interested in how many genes we can capture in an EST sample of a targeted size. This information provides insights to sequencing efficiency in experimental design, as well as clues to the diversity of expressed genes in the tissue from which the library was constructed. RESULTS: We propose a compound Poisson process model that can accurately predict the gene capture in a future EST sample based on an initial EST sample. It also allows estimation of the number of expressed genes in one cDNA library or co-expressed in two cDNA libraries. The superior performance of the new prediction method over an existing approach is established by a simulation study. Our analysis of four Arabidopsis thaliana EST sets suggests that the number of expressed genes present in four different cDNA libraries of Arabidopsis thaliana varies from 9155 (root) to 12005 (silique). An observed fraction of co-expressed genes in two different EST sets as low as 25% can correspond to an actual overlap fraction greater than 65%. CONCLUSION: The proposed method provides a convenient tool for gene capture prediction and cDNA library property diagnosis in EST sequencing

    Rediscovery of Good-Turing estimators via Bayesian nonparametrics

    Get PDF
    The problem of estimating discovery probabilities originated in the context of statistical ecology, and in recent years it has become popular due to its frequent appearance in challenging applications arising in genetics, bioinformatics, linguistics, designs of experiments, machine learning, etc. A full range of statistical approaches, parametric and nonparametric as well as frequentist and Bayesian, has been proposed for estimating discovery probabilities. In this paper we investigate the relationships between the celebrated Good-Turing approach, which is a frequentist nonparametric approach developed in the 1940s, and a Bayesian nonparametric approach recently introduced in the literature. Specifically, under the assumption of a two parameter Poisson-Dirichlet prior, we show that Bayesian nonparametric estimators of discovery probabilities are asymptotically equivalent, for a large sample size, to suitably smoothed Good-Turing estimators. As a by-product of this result, we introduce and investigate a methodology for deriving exact and asymptotic credible intervals to be associated with the Bayesian nonparametric estimators of discovery probabilities. The proposed methodology is illustrated through a comprehensive simulation study and the analysis of Expressed Sequence Tags data generated by sequencing a benchmark complementary DNA library

    Bayesian nonparametric inference for discovery probabilities: credible intervals and large sample asymptotics

    Get PDF
    Given a sample of size nn from a population of individuals belonging to different species with unknown proportions, a popular problem of practical interest consists in making inference on the probability Dn(l)D_{n}(l) that the (n+1)(n+1)-th draw coincides with a species with frequency ll in the sample, for any l=0,1,,nl=0,1,\ldots,n. This paper contributes to the methodology of Bayesian nonparametric inference for Dn(l)D_{n}(l). Specifically, under the general framework of Gibbs-type priors we show how to derive credible intervals for a Bayesian nonparametric estimation of Dn(l)D_{n}(l), and we investigate the large nn asymptotic behaviour of such an estimator. Of particular interest are special cases of our results obtained under the specification of the two parameter Poisson--Dirichlet prior and the normalized generalized Gamma prior, which are two of the most commonly used Gibbs-type priors. With respect to these two prior specifications, the proposed results are illustrated through a simulation study and a benchmark Expressed Sequence Tags dataset. To the best our knowledge, this illustration provides the first comparative study between the two parameter Poisson--Dirichlet prior and the normalized generalized Gamma prior in the context of Bayesian nonparemetric inference for Dn(l)D_{n}(l)

    Pengenalpastian dan profil pengekspresan gen biosintesis asid amino yis psikrofil, Glaciozyma antarctica

    Get PDF
    Mekanisme pengambilan dan penghasilan asid amino bagi mikroorganisma psikrofil yang bermandiri dan berpoliferasi pada persekitaran sejuk melampau masih belum difahami sepenuhnya. Objektif kajian ini ialah untuk mengenal pasti gen yang terlibat dalam penjanaan asid amino bagi yis psikrofil, Glaciozyma antarctica serta menentukan pengekspresan gen tersebut semasa kehadiran dan kekurangan asid amino dalam medium pertumbuhan. Pengenalpastian gen telah dilakukan melalui penjanaan penanda jujukan terekspres (ESTs) daripada dua perpustakaan cDNA yang dibina daripada sel yang dikultur dalam medium pertumbuhan kompleks dan medium pertumbuhan minimum tanpa asid amino. Sebanyak 3552 klon cDNA daripada setiap perpustakaan dipilih secara rawak untuk dijujuk menghasilkan 1492 transkrip unik (medium kompleks) dan 1928 transkrip unik (medium minimum). Analisis pemadanan telah mengenl pasti gen mengekod protein yang terlibat di dalam pengambilan asid amino bebas, biosintesis asid amino serta gen yang terlibat dengan kitar semula asid amino berdasarkan tapak jalan yang digunakan oleh yis model, Saccharomyces cerevisiae. Analisis pengekspresan gen menggunakan kaedah RT-qPCR menunjukkan pengekspresan gen mengekod protein yang terlibat di dalam pengambilan asid amino bebas iaitu permease adalah tinggi pada medium kompleks manakala pengekspresan kebanyakan gen mengekod protein yang terlibat dalam kitar semula dan biosintesis asid amino adalah tinggi di dalam medium minimum. Kesimpulannya, gen yang terlibat dalam penjanaan dan pengambilan asid amino bagi mikroorganisma psikrofil adalah terpulihara seperti mikroorganisma mesofil dan pengekspresan gen-gen ini adalah diaruh oleh kehadiran atau ketiadaan asid amino bebas pada persekitaran
    corecore