89 research outputs found
PRISE2: software for designing sequence-selective PCR primers and probes.
BackgroundPRISE2 is a new software tool for designing sequence-selective PCR primers and probes. To achieve high level of selectivity, PRISE2 allows the user to specify a collection of target sequences that the primers are supposed to amplify, as well as non-target sequences that should not be amplified. The program emphasizes primer selectivity on the 3' end, which is crucial for selective amplification of conserved sequences such as rRNA genes. In PRISE2, users can specify desired properties of primers, including length, GC content, and others. They can interactively manipulate the list of candidate primers, to choose primer pairs that are best suited for their needs. A similar process is used to add probes to selected primer pairs. More advanced features include, for example, the capability to define a custom mismatch penalty function. PRISE2 is equipped with a graphical, user-friendly interface, and it runs on Windows, Macintosh or Linux machines.ResultsPRISE2 has been tested on two very similar strains of the fungus Dactylella oviparasitica, and it was able to create highly selective primers and probes for each of them, demonstrating the ability to create useful sequence-selective assays.ConclusionsPRISE2 is a user-friendly, interactive software package that can be used to design high-quality selective primers for PCR experiments. In addition to choosing primers, users have an option to add a probe to any selected primer pair, enabling design of Taqman and other primer-probe based assays. PRISE2 can also be used to design probes for FISH and other hybridization-based assays
Improving probe set selection for microbial community analysis by leveraging taxonomic information of training sequences
<p>Abstract</p> <p>Background</p> <p>Population levels of microbial phylotypes can be examined using a hybridization-based method that utilizes a small set of computationally-designed DNA probes targeted to a gene common to all. Our previous algorithm attempts to select a set of probes such that each training sequence manifests a unique theoretical hybridization pattern (a binary fingerprint) to a probe set. It does so without taking into account similarity between training gene sequences or their putative taxonomic classifications, however. We present an improved algorithm for probe set selection that utilizes the available taxonomic information of training gene sequences and attempts to choose probes such that the resultant binary fingerprints cluster into real taxonomic groups.</p> <p>Results</p> <p>Gene sequences manifesting identical fingerprints with probes chosen by the new algorithm are more likely to be from the same taxonomic group than probes chosen by the previous algorithm. In cases where they are from different taxonomic groups, underlying DNA sequences of identical fingerprints are more similar to each other in probe sets made with the new versus the previous algorithm. Complete removal of large taxonomic groups from training data does not greatly decrease the ability of probe sets to distinguish those groups.</p> <p>Conclusions</p> <p>Probe sets made from the new algorithm create fingerprints that more reliably cluster into biologically meaningful groups. The method can readily distinguish microbial phylotypes that were excluded from the training sequences, suggesting novel microbes can also be detected.</p
Development of a simple artificial intelligence method to accurately subtype breast cancers based on gene expression barcodes
>Magister Scientiae - MScINTRODUCTION:
Breast cancer is a highly heterogeneous disease. The complexity of achieving an accurate diagnosis and an effective treatment regimen lies within this heterogeneity. Subtypes of the disease are not simply molecular, i.e. hormone receptor over-expression or absence, but the tumour itself is heterogeneous in terms of tissue of origin, metastases, and histopathological variability. Accurate tumour classification vastly improves treatment decisions, patient outcomes and 5-year survival rates. Gene expression studies aided by transcriptomic technologies such as microarrays and next-generation sequencing (e.g. RNA-Sequencing) have aided oncology researcher and clinician understanding of the complex molecular portraits of malignant breast tumours. Mechanisms governing cancers, which include tumorigenesis, gene fusions, gene over-expression and suppression, cellular process and pathway involvementinvolvement, have been elucidated through comprehensive analyses of the cancer transcriptome. Over the past 20 years, gene expression signatures, discovered with both microarray and RNA-Seq have reached clinical and commercial application through the development of tests such as Mammaprint®, OncotypeDX®, and FoundationOne® CDx, all which focus on chemotherapy sensitivity, prediction of cancer recurrence, and tumour mutational level.
The Gene Expression Barcode (GExB) algorithm was developed to allow for easy interpretation and integration of microarray data through data normalization with frozen RMA (fRMA) preprocessing and conversion of relative gene expression to a sequence of 1's and 0's. Unfortunately, the algorithm has not yet been developed for RNA-Seq data. However, implementation of the GExB with feature-selection would contribute to a machine-learning based robust breast cancer and subtype classifier.
METHODOLOGY:
For microarray data, we applied the GExB algorithm to generate barcodes for normal breast and breast tumour samples. A two-class classifier for malignancy was developed through feature-selection on barcoded samples by selecting for genes with 85% stable absence or presence within a tissue type, and differentially stable between tissues. A multi-class feature-selection method was employed to identify genes with variable expression in one subtype, but 80% stable absence or presence in all other subtypes, i.e. 80% in n-1 subtypes.
For RNA-Seq data, a barcoding method needed to be developed which could mimic the GExB algorithm for microarray data. A z-score-to-barcode method was implemented and differential gene expression analysis with selection of the top 100 genes as informative features for classification purposes.
The accuracy and discriminatory capability of both microarray-based gene signatures and the RNA-Seq-based gene signatures was assessed through unsupervised and supervised machine-learning algorithms, i.e., K-means and Hierarchical clustering, as well as binary and multi-class Support Vector Machine (SVM) implementations.
RESULTS:
The GExB-FS method for microarray data yielded an 85-probe and 346-probe informative set for two-class and multi-class classifiers, respectively. The two-class classifier predicted samples as either normal or malignant with 100% accuracy and the multi-class classifier predicted molecular subtype with 96.5% accuracy with SVM.
Combining RNA-Seq DE analysis for feature-selection with the z-score-to-barcode method, resulted in a two-class classifier for malignancy, and a multi-class classifier for normal-from-healthy, normal-adjacent-tumour (from cancer patients), and breast tumour samples with 100% accuracy. Most notably, a normal-adjacent-tumour gene expression signature emerged, which differentiated it from normal breast tissues in healthy individuals.
CONCLUSION: A potentially novel method for microarray and RNA-Seq data transformation, feature selection and classifier development was established. The universal application of the microarray signatures and validity of the z-score-to-barcode method was proven with 95% accurate classification of RNA-Seq barcoded samples with a microarray discovered gene expression signature. The results from this comprehensive study into the discovery of robust gene expression signatures holds immense potential for further R&F towards implementation at the clinical endpoint, and translation to simpler and cost-effective laboratory methods such as qtPCR-based tests
Recommended from our members
Vulnerability and robustness in the essential gene complement of two bacterial species, profiled with CRISPRi
Bacterial essential genes contribute to the most fundamental processes of cellular life. The study of their functions in vivo has long been intractable to systematic genetic approaches, which are fundamental to understanding pathway level connections that govern cellular life and are a requirement for dissecting the complex cellular processes to which essential genes contribute. In Chapter 1 of this work I review recent advances in mapping gene-phenotype relationships in bacteria using the CRISPR-based technology, CRISPR interference (CRISPRi) for titratable gene knockdowns, focusing on their applications to the studies of essential genes, the exploration of chemical-genetic interactions, and the prospects for disentangling complex phenotypes in diverse bacterial species. In Chapter 2 I describe my analysis of the essential gene functions in the model Gram-negative bacterium Escherichia coli and the model Gram-positive Bacillus subtilis using datasets from paired chemical-genetic screens. In this work I identify both shared and Gram-negative specific mechanisms of collateral sensitization to antibiotic action. In Chapter 3 I investigate a fundamental property of essential genes, which is the relationship between their expression level and the cellular growth rate. Here, further developing CRISPRi tools in bacteria to predictably titrate knockdown efficacy, I interpret the knockdown-fitness relationships of each essential gene in E. coli and B. subtilis, discovering broad conservation of constraints setting and maintaining expression levels across these diverged species
Recommended from our members
Probabilistic Modeling for Whole Metagenome Profiling
To address the shortcomings in existing Markov model implementations in handling large amount of metagenomic data with comparable or better accuracy in classification, we developed a new algorithm based on pseudo-count supplemented standard Markov model (SMM), which leverages the power of higher order models to more robustly classify reads at different taxonomic levels. Assessment on simulated metagenomic datasets demonstrated that overall SMM was more accurate in classifying reads to their respective taxa at all ranks compared to the interpolated methods. Higher order SMMs (9th order or greater) also outperformed BLAST alignments in assigning taxonomic labels to metagenomic reads at different taxonomic ranks (genus and higher) on tests that masked the read originating species (genome models) in the database. Similar results were obtained by masking at other taxonomic ranks in order to simulate the plausible scenarios of non-representation of the source of a read at different taxonomic levels in the genome database. The performance gap became more pronounced with higher taxonomic levels. To eliminate contaminations in datasets and to further improve our alignment-free approach, we developed a new framework based on a genome segmentation and clustering algorithm. This framework allowed removal of adapter sequences and contaminant DNA, as well as generation of clusters of similar segments, which were then used to sample representative read fragments to constitute training datasets. The parameters of a logistic regression model were learnt from these training datasets using a Bayesian optimization procedure. This allowed us to establish thresholds for classifying metagenomic reads by SMM. This led to the development of a Python-based frontend that combines our SMM algorithm with the logistic regression optimization, named POSMM (Python Optimized Standard Markov Model). POSMM provides a much-needed alternative to metagenome profiling programs. Our algorithm that builds the genome models on the fly, and thus obviates the need to build a database, complements alignment-based classification and can thus be used in concert with alignment-based classifiers to raise the bar in metagenome profiling
Natural variation in Drosophila melanogaster
This work is dedicated to studying natural variation in D. melanogaster at the DNA sequence and gene expression level. In addition I present a new version of the DNA polymorphism analysis program VariScan, which includes significant improvements.
In CHAPTER 1 I describe a genome scan of single nucleotide polymorphism in two natural D. melanogaster populations (from Africa and Europe) on the third chromosome. Together with polymorphism data previously published for the X chromosome of the same populations, this allows a comparative study of the polymorphism patterns of the X chromosome and an autosome. The frequency spectrum of mutations and the patterns of linkage disequilibrium are investigated. The observed patterns indicate that there is a significant difference in the behavior of the two chromosomes, as has already been suggested by previous studies. To uncover the reasons for this a coalescent based maximum likelihood method is applied that incorporates the effects of demographic history and unequal sex ratios. For the African population the differential behavior of the chromosomes can be explained by its demographic history and an excess of females. In Europe, a population bottleneck and an excess of males alone cannot explain the patterns we observe. The additional action of positive selection in this population is proposed as a possible explanation.
In CHAPTER 2 I investigate the variation in gene expression of the two aforementioned populations. Whole-genome microarrays are used to study levels of expression for 88% of all known genes in eight adult males from both populations. The observed levels of expression variation are equal in Africa and Europe, despite the fact that DNA sequence variation is much higher in Africa. This is evidence for the action of stabilizing selection governing levels of expression polymorphism. Supporting this view, genes involved in many different functions, and are therefore on strong selective constraint, show less variation than do genes with only few functions. The experimental design allows the search for genes which differ in their expression patterns between Europe and Africa and might therefore have undergone adaptive evolution. Detected candidates include genes putatively involved in insecticide resistance and food choice. Surprisingly, many genes over-expressed in Africa are involved in the formation and function of the flying apparatus.
In CHAPTER 3 I present version 2 of the program VariScan. This program was designed to analyse patterns of DNA sequence polymorphism on a chromosomal scale. The functionality of the core analysis tool, the wavelet decomposition, is described. In addition, multiple improvements to the previous version are presented. The program now supports the “pairwise deletion” option. This is essential for analysing data at the chromosome scale, since such data often contains incomplete information. It is now possible to add outgroup information, which allows the calculation of additional statistics. Furthermore, the separate analysis of different predefined chromosomal regions is added as an option. To increase the user friendliness, a graphical user interface is now included as part of the software package. Finally, VariScan is applied to published and computer-generated data and the ability of the wavelet-based analysis to uncover chromosomal regions with interesting DNA polymorphism patterns is demonstrated
Mapping and Functional Analysis of cis-Regulatory Elements in Mouse Photoreceptors
Photoreceptors are light-sensitive neurons that mediate vision, and they are the most commonly affected cell type in genetic forms of blindness. In mice, there are two basic types of photoreceptors, rods and cones, which mediate vision in dim and bright environments, respectively. The transcription factors (TFs) that control rod and cone development have been studied in detail, but the cis-regulatory elements (CREs) through which these TFs act are less well understood. To comprehensively identify photoreceptor CREs in mice and to understand their relationship with gene expression, we performed open chromatin (ATAC-seq) and transcriptome (RNA-seq) profiling of FACS-purified rods and cones. We find that rods have significantly fewer regions of open chromatin than cones (as well as \u3e60 additional cell types and tissues), and we demonstrate that this uniquely closed chromatin architecture depends on the rod master regulator Nrl. Finally, we find that regions of rod- and cone-specific open chromatin are enriched for distinct sets of TF binding sites, providing insight into the cis-regulatory grammar of these cell types.
We also sought to understand how the regulatory activity of rod and cone open chromatin regions is encoded in DNA sequence. Cone-rod homeobox (CRX) is a paired-like homeodomain TF and master regulator of both rod and cone development, and CRX binding sites are by far the most enriched TF binding sites in photoreceptor CREs. The in vitro DNA binding preferences of CRX have been extensively characterized, but how well in vitro models of TF binding site affinity predict in vivo regulatory activity is not known. In addition, paired-class homeodomain TFs bind DNA as both monomers and dimers, but whether monomeric and dimeric CRX binding sites have distinct regulatory activities is not known. To address these questions, we used a massively parallel reporter assay to quantify the activity of thousands native and mutant CRX binding sites in explanted mouse retinas. These data reveal that dimeric CRX binding sites encode stronger enhancers than monomeric CRX binding sites. Moreover, the activity of half-sites within dimeric CRX binding sites is cooperative and spacing-dependent. In addition, saturating mutagenesis of 195 CRX binding sites reveals that, while TF binding site affinity and activity are moderately correlated across mutations within individual CREs, they are poorly correlated across mutations from distinct CREs. Accordingly, we show that accounting for baseline CRE activity improves the prediction of the effects of mutations in regulatory DNA from sequence-based models. Taken together, these data demonstrate that the activity of CRX binding sites depends on multiple layers of sequence context, providing insight into photoreceptor gene regulation and illustrating functional principles of homeodomain TF binding sites
Bayesian clustering of curves and the search of the partition space
This thesis is concerned with the study of a Bayesian clustering algorithm, proposed by Heard et al. (2006), used successfully for microarray experiments over time. It focuses not only on the development of new ways of setting hyperparameters so that inferences both reflect the scientific needs and contribute to the inferential stability of the search, but also on the design of new fast algorithms for the search over the partition space. First we use the explicit forms of the associated Bayes factors to demonstrate that such methods can be unstable under common settings of the associated hyperparameters. We then prove that the regions of instability can be removed by setting the hyperparameters in an unconventional way. Moreover, we demonstrate that MAP (maximum a posteriori) search is satisfied when a utility function is defined according to the scientific interest of the clusters. We then focus on the search over the partition space. In model-based clustering a comprehensive search for the highest scoring partition is usually impossible, due to the huge number of partitions of even a moderately sized dataset. We propose two methods for the partition search. One method encodes the clustering as a weighted MAX-SAT problem, while the other views clusterings as elements of the lattice of partitions. Finally, this thesis includes the full analysis of two microarray experiments for identifying circadian genes
- …