17 research outputs found
AphanoDB: a genomic resource for Aphanomyces pathogens.
BACKGROUND: The Oomycete genus Aphanomyces comprises devastating plant and animal pathogens. However, little is known about the molecular mechanisms underlying pathogenicity of Aphanomyces species. In this study, we report on the development of a public database called AphanoDB which is dedicated to Aphanomyces genomic data. As a first step, a large collection of Expressed Sequence Tags was obtained from the legume pathogen A. euteiches, which was then processed and collected into AphanoDB. DESCRIPTION: Two cDNA libraries of A. euteiches were created: one from mycelium growing on synthetic medium and one from mycelium grown in contact to root tissues of the model legume Medicago truncatula. From these libraries, 18,684 expressed sequence tags were obtained and assembled into 7,977 unigenes which were compared to public databases for annotation. Queries on AphanoDB allow the users to retrieve information for each unigene including similarity to known protein sequences, protein domains and Gene Ontology classification. Statistical analysis of EST frequency from the two different growth conditions was also added to the database. CONCLUSION: AphanoDB is a public database with a user-friendly web interface. The sequence report pages are the main web interface which provides all annotation details for each unigene. These interactive sequence report pages are easily available through text, BLAST, Gene Ontology and expression profile search utilities. AphanoDB is available from URL: http://www.polebio.scsv.ups-tlse.fr/aphano/
A Bayesian Nonparametric Method for Prediction in EST Analysis
In this work we propose a Bayesian nonparametric approach for tackling statistical problems related to EST surveys. In particular, we provide estimates for: a) the coverage, defined as the proportion of unique genes in the library represented in the given sample of reads; b) the number of new unique genes to be observed in a future sample; c) the discovery rate of new genes as a function of the future sample size. The Bayesian nonparametric model we adopt conveys, in a statistically rigorous way, the available information into prediction. Our proposal has appealing properties over frequentist nonparametric methods, which become unstable when prediction is required for large future samples. EST libraries studied in Susko and Roger (2004), with frequentist methods, are analyzed in detail.
An EST resource for tilapia based on 17 normalized libraries and assembly of 116,899 sequence tags
<p>Abstract</p> <p>Background</p> <p>Large collections of expressed sequence tags (ESTs) are a fundamental resource for analysis of gene expression and annotation of genome sequences. We generated 116,899 ESTs from 17 normalized and two non-normalized cDNA libraries representing 16 tissues from tilapia, a cichlid fish widely used in aquaculture and biological research.</p> <p>Results</p> <p>The ESTs were assembled into 20,190 contigs and 36,028 singletons for a total of 56,218 unique sequences and a total assembled length of 35,168,415 bp. Over the whole project, a unique sequence was discovered for every 2.079 sequence reads. 17,722 (31.5%) of these unique sequences had significant BLAST hits (e-value < 10<sup>-10</sup>) to the UniProt database.</p> <p>Conclusion</p> <p>Normalization of the cDNA pools with double-stranded nuclease allowed us to efficiently sequence a large collection of ESTs. These sequences are an important resource for studies of gene expression, comparative mapping and annotation of the forthcoming tilapia genome sequence.</p
Large deviation principles for the Ewens-Pitman sampling model
Let be the number of blocks with frequency in the exchangeable
random partition induced by a sample of size from the Ewens-Pitman sampling
model. We show that, as tends to infinity, satisfies a
large deviation principle and we characterize the corresponding rate function.
A conditional counterpart of this large deviation principle is also presented.
Specifically, given an initial sample of size from the Ewens-Pitman
sampling model, we consider an additional sample of size . For any fixed
and as tends to infinity, we establish a large deviation principle for the
conditional number of blocks with frequency in the enlarged sample, given
the initial sample. Interestingly, the conditional and unconditional large
deviation principles coincide, namely there is no long lasting impact of the
given initial sample. Potential applications of our results are discussed in
the context of Bayesian nonparametric inference for discovery probabilities.Comment: 30 pages, 2 figure
Gene capture prediction and overlap estimation in EST sequencing from one or multiple libraries
BACKGROUND: In expressed sequence tag (EST) sequencing, we are often interested in how many genes we can capture in an EST sample of a targeted size. This information provides insights to sequencing efficiency in experimental design, as well as clues to the diversity of expressed genes in the tissue from which the library was constructed. RESULTS: We propose a compound Poisson process model that can accurately predict the gene capture in a future EST sample based on an initial EST sample. It also allows estimation of the number of expressed genes in one cDNA library or co-expressed in two cDNA libraries. The superior performance of the new prediction method over an existing approach is established by a simulation study. Our analysis of four Arabidopsis thaliana EST sets suggests that the number of expressed genes present in four different cDNA libraries of Arabidopsis thaliana varies from 9155 (root) to 12005 (silique). An observed fraction of co-expressed genes in two different EST sets as low as 25% can correspond to an actual overlap fraction greater than 65%. CONCLUSION: The proposed method provides a convenient tool for gene capture prediction and cDNA library property diagnosis in EST sequencing
Rediscovery of Good-Turing estimators via Bayesian nonparametrics
The problem of estimating discovery probabilities originated in the context
of statistical ecology, and in recent years it has become popular due to its
frequent appearance in challenging applications arising in genetics,
bioinformatics, linguistics, designs of experiments, machine learning, etc. A
full range of statistical approaches, parametric and nonparametric as well as
frequentist and Bayesian, has been proposed for estimating discovery
probabilities. In this paper we investigate the relationships between the
celebrated Good-Turing approach, which is a frequentist nonparametric approach
developed in the 1940s, and a Bayesian nonparametric approach recently
introduced in the literature. Specifically, under the assumption of a two
parameter Poisson-Dirichlet prior, we show that Bayesian nonparametric
estimators of discovery probabilities are asymptotically equivalent, for a
large sample size, to suitably smoothed Good-Turing estimators. As a by-product
of this result, we introduce and investigate a methodology for deriving exact
and asymptotic credible intervals to be associated with the Bayesian
nonparametric estimators of discovery probabilities. The proposed methodology
is illustrated through a comprehensive simulation study and the analysis of
Expressed Sequence Tags data generated by sequencing a benchmark complementary
DNA library
Bayesian nonparametric inference for discovery probabilities: credible intervals and large sample asymptotics
Given a sample of size from a population of individuals belonging to
different species with unknown proportions, a popular problem of practical
interest consists in making inference on the probability that the
-th draw coincides with a species with frequency in the sample, for
any . This paper contributes to the methodology of Bayesian
nonparametric inference for . Specifically, under the general
framework of Gibbs-type priors we show how to derive credible intervals for a
Bayesian nonparametric estimation of , and we investigate the large
asymptotic behaviour of such an estimator. Of particular interest are
special cases of our results obtained under the specification of the two
parameter Poisson--Dirichlet prior and the normalized generalized Gamma prior,
which are two of the most commonly used Gibbs-type priors. With respect to
these two prior specifications, the proposed results are illustrated through a
simulation study and a benchmark Expressed Sequence Tags dataset. To the best
our knowledge, this illustration provides the first comparative study between
the two parameter Poisson--Dirichlet prior and the normalized generalized Gamma
prior in the context of Bayesian nonparemetric inference for
Pengenalpastian dan profil pengekspresan gen biosintesis asid amino yis psikrofil, Glaciozyma antarctica
Mekanisme pengambilan dan penghasilan asid amino bagi mikroorganisma psikrofil yang bermandiri dan berpoliferasi pada persekitaran sejuk melampau masih belum difahami sepenuhnya. Objektif kajian ini ialah untuk mengenal pasti gen yang terlibat dalam penjanaan asid amino bagi yis psikrofil, Glaciozyma antarctica serta menentukan pengekspresan gen tersebut semasa kehadiran dan kekurangan asid amino dalam medium pertumbuhan. Pengenalpastian gen telah dilakukan melalui penjanaan penanda jujukan terekspres (ESTs) daripada dua perpustakaan cDNA yang dibina daripada sel yang dikultur dalam medium pertumbuhan kompleks dan medium pertumbuhan minimum tanpa asid amino. Sebanyak 3552 klon cDNA daripada setiap perpustakaan dipilih secara rawak untuk dijujuk menghasilkan 1492 transkrip unik (medium kompleks) dan 1928 transkrip unik (medium minimum). Analisis pemadanan telah mengenl pasti gen mengekod protein yang terlibat di dalam pengambilan asid amino bebas, biosintesis asid amino serta gen yang terlibat dengan kitar semula asid amino berdasarkan tapak jalan yang digunakan oleh yis model, Saccharomyces cerevisiae. Analisis pengekspresan gen menggunakan kaedah RT-qPCR menunjukkan pengekspresan gen mengekod protein yang terlibat di dalam pengambilan asid amino bebas iaitu permease adalah tinggi pada medium kompleks manakala pengekspresan kebanyakan gen mengekod protein yang terlibat dalam kitar semula dan biosintesis asid amino adalah tinggi di dalam medium minimum. Kesimpulannya, gen yang terlibat dalam penjanaan dan pengambilan asid amino bagi mikroorganisma psikrofil adalah terpulihara seperti mikroorganisma mesofil dan pengekspresan gen-gen ini adalah diaruh oleh kehadiran atau ketiadaan asid amino bebas pada persekitaran