207 research outputs found

    Utilization of two sample t-test statistics from redundant probe sets to evaluate different probe set algorithms in GeneChip studies

    Get PDF
    BACKGROUND: The choice of probe set algorithms for expression summary in a GeneChip study has a great impact on subsequent gene expression data analysis. Spiked-in cRNAs with known concentration are often used to assess the relative performance of probe set algorithms. Given the fact that the spiked-in cRNAs do not represent endogenously expressed genes in experiments, it becomes increasingly important to have methods to study whether a particular probe set algorithm is more appropriate for a specific dataset, without using such external reference data. RESULTS: We propose the use of the probe set redundancy feature for evaluating the performance of probe set algorithms, and have presented three approaches for analyzing data variance and result bias using two sample t-test statistics from redundant probe sets. These approaches are as follows: 1) analyzing redundant probe set variance based on t-statistic rank order, 2) computing correlation of t-statistics between redundant probe sets, and 3) analyzing the co-occurrence of replicate redundant probe sets representing differentially expressed genes. We applied these approaches to expression summary data generated from three datasets utilizing individual probe set algorithms of MAS5.0, dChip, or RMA. We also utilized combinations of options from the three probe set algorithms. We found that results from the three approaches were similar within each individual expression summary dataset, and were also in good agreement with previously reported findings by others. We also demonstrate the validity of our findings by independent experimental methods. CONCLUSION: All three proposed approaches allowed us to assess the performance of probe set algorithms using the probe set redundancy feature. The analyses of redundant probe set variance based on t-statistic rank order and correlation of t-statistics between redundant probe sets provide useful tools for data variance analysis, and the co-occurrence of replicate redundant probe sets representing differentially expressed genes allows estimation of result bias. The results also suggest that individual probe set algorithms have dataset-specific performance

    Involvement of genes and non-coding RNAs in cancer: profiling using microarrays

    Get PDF
    MicroRNAs (miRNAs) are small noncoding RNAs (ncRNAs, RNAs that do not code for proteins) that regulate the expression of target genes. MiRNAs can act as tumor suppressor genes or oncogenes in human cancers. Moreover, a large fraction of genomic ultraconserved regions (UCRs) encode a particular set of ncRNAs whose expression is altered in human cancers. Bioinformatics studies are emerging as important tools to identify associations between miRNAs/ncRNAs and CAGRs (Cancer Associated Genomic Regions). ncRNA profiling, the use of highly parallel devices like microarrays for expression, public resources like mapping, expression, functional databases, and prediction algorithms have allowed the identification of specific signatures associated with diagnosis, prognosis and response to treatment of human tumors

    Enabling Data-Guided Evaluation of Bioinformatics Workflow Quality

    Get PDF
    Bioinformatics can be divided into two phases, the first phase is conversion of raw data into processed data and the second phase is using processed data to obtain scientific results. It is important to consider the first “workflow” phase carefully, as there are many paths on the way to a final processed dataset. Some workflow paths may be different enough to influence the second phase, thereby, leading to ambiguity in the scientific literature. Workflow evaluation in bioinformatics enables the investigator to carefully plan how to process their data. A system that uses real data to determine the quality of a workflow can be based on the inherent biological relationships in the data itself. To our knowledge, a general software framework that performs real data-driven evaluation of bioinformatics workflows does not exist. The Evaluation and Utility of workFLOW (EUFLOW) decision-theoretic framework, developed and tested on gene expression data, enables users of bioinformatics workflows to evaluate alternative workflow paths using inherent biological relationships. EUFLOW is implemented as an R package to enable users to evaluate workflow data. EUFLOW is a framework which also permits user-guided utility and loss functions, which enables the type of analysis to be considered in the workflow path decision. This framework was originally developed to address the quality of identifier mapping services between UNIPROT accessions and Affymetrix probesets to facilitate integrated analysis1. An extension to this framework evaluates Affymetrix probeset filtering methods on real data from endometrial cancer and TCGA ovarian serous carcinoma samples.2 Further evaluation of RNASeq workflow paths demonstrates generalizability of the EUFLOW framework. Three separate evaluations are performed including: 1) identifier filtering of features with biological attributes, 2) threshold selection parameter choice for low gene count features, and 3) commonly utilized RNASeq data workflow paths on The Cancer Genome Atlas data. The EUFLOW decision-theoretic framework developed and tested in my dissertation enables users of bioinformatics workflows to evaluate alternative workflow paths guided by inherent biological relationships and user utility

    Strong approximations of level exceedences related to multiple hypothesis testing

    Full text link
    Particularly in genomics, but also in other fields, it has become commonplace to undertake highly multiple Student's tt-tests based on relatively small sample sizes. The literature on this topic is continually expanding, but the main approaches used to control the family-wise error rate and false discovery rate are still based on the assumption that the tests are independent. The independence condition is known to be false at the level of the joint distributions of the test statistics, but that does not necessarily mean, for the small significance levels involved in highly multiple hypothesis testing, that the assumption leads to major errors. In this paper, we give conditions under which the assumption of independence is valid. Specifically, we derive a strong approximation that closely links the level exceedences of a dependent ``studentized process'' to those of a process of independent random variables. Via this connection, it can be seen that in high-dimensional, low sample-size cases, provided the sample size diverges faster than the logarithm of the number of tests, the assumption of independent tt-tests is often justified.Comment: Published in at http://dx.doi.org/10.3150/09-BEJ220 the Bernoulli (http://isi.cbs.nl/bernoulli/) by the International Statistical Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm

    Improving data extraction methods for large molecular biology datasets.

    Get PDF
    In the past, an experiment involving a pair wise comparison normally involved one or a few dependant variables. Now, 1000s of dependent variables can be measured simultaneously in a single experiment, be it detecting genes via a microarray experiment, sequencing genomes, or detecting microbial species based on DNA fragments using molecular techniques. How we analyze such large collections of data will be a major scientific focus over the next decade. Statistical methods that were once acceptable for comparing a few conditions are being revised to handle 1000?s of experiments. Molecular biology techniques that explored 1 gene or species have evolved and are now capable of generating complex datasets requiring new strategies and ways of thinking in order to discover biologically meaningful results. The central theme of this dissertation is to develop strategies that deal with a number of issues that are present in these large scale datasets. In chapter 1, I describe a microarray analytical method that can be applied to low replicate experiments. In chapter?s 2-4, the focus is how to best analyze data from ARISA (a PCR based molecular method for rapidly generating a finger print of microbial diversity). Chapter 2 focuses on qualifying ARISA data so that data will best represent its biological source, prior to further analysis. Chapter 3 focuses on how to best compare ARISA profiles to one another. Chapter 4 focuses on developing a software tool that implements the data processing and clustering strategies from chapter?s 2 and 3. The findings described herein provide the scientific community with improved analytical strategies in both the microarray and ARISA research areas

    Gene expression patterns of encapsulated microbial cells

    Get PDF
    To design hybrid cellular/synthetic devices such as sensors and vaccines, understanding of how the metabolic state of living cells changes upon physical confinement within three-dimensional matrices is vital. We analyze the gene expression patterns of stationary phase Saccharomyces cerevisiae (S. cerevisiae) cells encapsulated within three distinct nanostructured silica matrices and relate those patterns to known naturally occurring metabolic states. It was found that the cells for all three-encapsulated methods enter quiescent states characteristic of response to stress, albeit to different degrees and with differences in detail. By the measure of enrichment of stress-related Gene Ontology categories, we find that the AqS+g encapsulation more amenable to the cells than CDA and SD encapsulation. We hypothesize that this differential response in the AqS+g encapsulation is related to four properties of the encapsulating gel: 1) oxygen permeability, 2) relative softness of the material, 3) development of a protective sheath around individual cells, and 4) the presence of glycerol in the gel, which has been previously noted to serve as a protectant for encapsulated cells and can serve as the sole carbon source for S. cerevisiae under aerobic conditions. This work represents a combination of experiment and analysis aimed at the design and development of 3D encapsulation procedures to induce, and perhaps control, well-defined physiological behaviors. We also report on the temporal pattern of yeast gene expression patterns during encapsulation in silica matrices via a cell-directed assembly process, and upon release. Three broad classes of patterns are seen. A major shift in expression patterns is seen upon encapsulation, relative to the beginning stationary state, similar to previously reported stress response. Significant continuing shifts are seen by sampling at different intervals during a one week encapsulation. Upon release from encapsulation and reincubation in growth medium, the cells are in a state significantly different from the state prior to encapsulation and similar to the state during encapsulation. Implications are drawn for the use of encapsulated micro-organism as sensors and effectors, and for the persister state of such organisms. Ordinarily Gene Ontology (GO) enrichment analysis is subject to an arbitrary threshold for defining significance of enriched classes. In this paper, we consider replacing an arbitrary threshold with F-measure optimization to define the p-value that divides “significant enrichment” from “non-significant”. It is found that evaluation of false negatives (essential for computing recall and thus F-measure) requires a heuristic (but reasonable) assumption. We apply F-measure optimization to two sets of genes from different organisms and use Benjamini-Hochberg and random resampling to evaluate the number of false positives. It is found that the uncorrected p- value that produces optimum F-measure varies widely from one data set to another. It is also found that all three methods of FDR calculation diverge from each other within a range of uncorrected p-values that provide F-measure optimum p-values. This study includes in Appendix II a pipeline for using resampling and F-measure optimization to create lists of enriched GO classes that provide for variable weights of precision and recall

    Effects of selection for low residual feed intake and feed restriction on gene expression profiles and thyroid axis in pigs

    Get PDF
    The overall purpose of this thesis project was to identify genetic mechanisms associated with differences in and regulation of feed intake and feed efficiency in pigs. The long-term goal of this research is to use this knowledge to improve feed efficiency in pigs, the largest variable cost in pork production, through genetic selection or management. The central hypothesis is that we can discover and characterize genetic pathways that control economically important traits related to feed intake and feed efficiency through transcriptional profiling of specific and relevant tissues during the growth period. Profiling studies were be based on two complementary animal-level contrasts and their interactions, including quantitative differences in a specific measure of feed efficiency, residual feed intake (RFI), and genetic differences for a candidate gene with known impact on feed intake and energy partitioning. In addition to the two profiling studies, we examined the effect of RFI selection on the thyroid axis as the thyroid hormones are critical component for growth and development of animals. We identified series of genes, pathways, and transcription factors that may underlie feed efficiency and feed intake in pigs using transcriptional profiling tools and specifically studied the thyroid axis and determined that the increased concentration of triiodothyronine that is peripherally produced may contribute to the decreased feed intake and increased efficiency observed in the low RFI pigs

    Expression microarray technology as a tool in cancer research

    Get PDF
    DNA microarray technology has in a decade been rapidly adopted by biomedical researchers and emerged as a very prominent research tool. In this study, microarray technology, together with supporting methods, was utilized in studies of human cancer. The study focused on two types of cancer, a hereditary syndrome called Hereditary Leiomyomatosis and Renal Cell Cancer (HLRCC) and on colorectal cancer (CRC). HLRCC is a disease caused by mutations in the Krebs cycle gene fumarase, where some of the patients develop an aggressive and early-onset renal cancer or uterine leiomyosarcoma. CRC is one of the leading causes of death in the Western world. In the first study, yeast models with fumarase mutations were subjected to microarray profiling and functional experiments to reveal changes caused by two different fumarase mutations and to find potential candidate genes for the renal cancer observed in some of the HLRCC patients. No significant differences in fumarase gene or protein expressions or in enzyme activities were observed. This indicated that modifying genes, rather than genotype-phenotype effects, play a role in the formation of the malign tumors. In the second study, Dukes' C stage colorectal tumors with good and bad prognosis were studied using microarray profiling, and a molecular signature separating these two groups with differing prognoses was identified. The study showed that gene expression profiling of surgical samples can predict the recurrence of Dukes' C patients. In the third study, serrated colorectal carcinomas, which differ morphologically from conventional colorectal carcinomas, were distinguished from each other using expression microarrays. The separation by unsupervised clustering indicated that serrated tumors differ biologically from conventional ones. Statistical analyses were used to identify key genes with differential expression between these two tumor types and the results were further validated by immunohistochemical analyses. A key gene, EPHB2, revealed by the expression data analysis of serrated CRC, was further characterized in the last two studies to find out more about the relevance of this gene to colorectal tumorigenesis. Germline mutations in EPHB2 were found in few CRC patients, but did not appear to be a major contributor in CRC susceptibility. Aberrant promoter hypermethylation and frameshift mutations in a repetitive track of the gene were, however, found to be frequent mechanisms of EPHB2 inactivation in CRC. In general, it was observed that the use of combined research methods greatly enhance the power of microarray studies, and enable focusing of the analyses. Although the technology is presently used primarily in basic research, clinical applications are foreseeable and slowly emerging.DNA-mikrosiruteknologia on vuosikymmenen kuluessa omaksuttu nopeasti osaksi biolääketieteellistä tutkimusta ja noussut lupaavaksi menetelmäksi syöpätutkimuksessa. Tässä työssä tutkittiin mikrosiruteknologian hyväksikäyttöä kahdella syöpätyypillä, periytyvällä HLRCC -syndroomalla sekä kolorektaalisyövällä (CRC). HLRCC aiheutuu muutoksista Krebsin syklin fumaraasi geenissä. Osalla potilaista ilmenee kohdun leiomyosarkooma tai aggressiivinen ja nuorella iällä todettu munuaissyöpä, jonka epäillään johtuvan vielä tuntemattoman geenin vaikutuksesta. CRC on yksi yleisimmistä syöpätyypeistä ja johtavista kuolinsyistä länsimaissa. Työn ensimmäisessä osassa tutkittiin kahden eri fumaraasimutaation vaikutusta hiivamallissa mikrosiruanalyysien ja funktionaalisten kokeiden avulla. Eri mutaatioiden todettiin olevan vaikutukseltaan keskenään samanlaisia, viitaten mahdollisten muiden geenien vaikutukseen munuaissyövän synnyssä. Toisessa osajulkaisussa vertailtiin mikrosiruteknologialla keskenään hyvän ja huonon ennusteen saaneita Dukes' C -luokituksen omaavia kolorektaalikasvaimia. Nämä kaksi ryhmää kyettiin molekyyligeneettisten erojensa perusteella erottamaan toisistaan, mikä osoitti leikkausnäytteiden perusteella tehtävän geeniekspressio-profiloinnin olevan näissä syövissä mahdollista ja kykenevän ennustamaan tämän syöpätyypin uusiutumista potilailla. Kolmannessa osajulkaisussa mikrosiruteknologiaa käytettiin hyväksi luokiteltaessa CRC:n sahalaitaista muotoa molekyyligeneettisen profiilinsa perusteella tavanomaisesta CRC:stä. Geneettinen profilointi jakoi nämä kaksi ryhmää omiksi alatyypeikseen, viitaten näiden rakenteellisesti tavanomaisesta CRC:stä poikkeavien sahalaitaisten kasvainten erilaiseen biologiseen taustaan. Tilastollisten analyysien perusteella sahalaitaisissa CRC:ssä erilaisimmin ilmeneviä geenejä valittiin immunohistokemiallisiin jatkotutkimuksiin, joiden avulla todennettiin mikrosiruanalyysin löydökset. Viimeisissä osajulkaisuissa tutkittiin mikrosiruanalyysien perusteella esiin tulleen EPHB2 geenin vaientamisen mekanismeja sekä merkitystä CRC:n muodostumisen alttiudessa. Geenin ituratamuutoksia löydettiin muutamasta tutkitusta CRC näytteestä, mutta niiden vaikutus CRC:n alttiuteen katsottiin vähäiseksi. Promoottorialueen hypermetylaation sekä geenin lukualueella sijaitsevan toistojakson muutosten aiheuttaman geenin vaientumisen sen sijaan havaittiin olevan yleistä. Yleisellä tasolla mikrosiruanalyysien havaittiin hyötyvän samanaikaisesti tehtävistä muista tutkimusmenetelmistä, joiden avulla tutkimuksen kohteita kyettiin rajaamaan. Vaikka teknologiaa nykyisellään sovelletaan lähinnä perustutkimukseen, on lupauksia kliinisistä käyttökohteita nähtävissä.reviewe

    An integrated approach to enhancing functional annotation of sequences for data analysis of a transcriptome

    Get PDF
    Given the ever increasing quantity of sequence data, functional annotation of new gene sequences persists as being a significant challenge for bioinformatics. This is a particular problem for transcriptomics studies in crop plants where large genomes and evolutionarily distant model organisms, means that identifying the function of a given gene used on a microarray, is often a non-trivial task. Information pertinent to gene annotations is spread across technically and semantically heterogeneous biological databases. Combining and exploiting these data in a consistent way has the potential to improve our ability to assign functions to new or uncharacterised genes. Methods: The Ondex data integration framework was further developed to integrate databases pertinent to plant gene annotation, and provide data inference tools. The CoPSA annotation pipeline was created to provide automated annotation of novel plant genes using this knowledgebase. CoPSA was used to derive annotations for Affymetrix GeneChips available for plant species. A conjoint approach was used to align GeneChip sequences to orthologous proteins, and identify protein domain regions. These proteins and domains were used together with multiple evidences to predict functional annotations for sequences on the GeneChip. Quality was assessed with reference to other annotation pipelines. These improved gene annotations were used in the analysis of a time-series transcriptomics study of the differential responses of durum wheat varieties to water stress. Results and Conclusions: The integration of plant databases using the Ondex showed that it was possible to increase the overall quantity and quality of information available, and thereby improve the resulting annotation. Direct data aggregation benefits were observed, as well as new information derived from inference across databases. The CoPSA pipeline was shown to improve coverage of the wheat microarray compared to the NetAffx and BLAST2GO pipelines. Leverage of these annotations during the analysis of data from a transcriptomics study of the durum wheat water stress responses, yielded new biological insights into water stress and highlighted potential candidate genes that could be used by breeders to improve drought response
    • …
    corecore