12 research outputs found

    Selection of informative clusters from hierarchical cluster tree with gene classes

    Get PDF
    BACKGROUND: A common clustering method in the analysis of gene expression data has been hierarchical clustering. Usually the analysis involves selection of clusters by cutting the tree at a suitable level and/or analysis of a sorted gene list that is obtained with the tree. Cutting of the hierarchical tree requires the selection of a suitable level and it results in the loss of information on the other level. Sorted gene lists depend on the sorting method of the joined clusters. Author proposes that the clusters should be selected using the gene classifications. RESULTS: This article presents a simple method for searching for clusters with the strongest enrichment of gene classes from a cluster tree. The clusters found are presented in the estimated order of importance. The method is demonstrated with a yeast gene expression data set and with two database classifications. The obtained clusters demonstrated a very strong enrichment of functional classes. The obtained clusters are also able to present similar gene groups to those that were observed from the data set in the original analysis and also many gene groups that were not reported in the original analysis. Visualization of the results on top of a cluster tree shows that the method finds informative clusters from several levels of the cluster tree and indicates that the clusters found could not have been obtained by simply cutting the cluster tree. Results were also used in the comparison of cluster trees from different clustering methods. CONCLUSION: The presented method should facilitate the exploratory analysis of big data sets when the associated categorical data is available

    Novel comparison of evaluation metrics for gene ontology classifiers reveals drastic performance differences

    Get PDF
    Author summary In the biosciences, predictive methods are becoming increasingly necessary as novel sequences are generated at an ever-increasing rate. The volume of sequence data necessitates Automated Function Prediction (AFP) as manual curation is often impossible. Unfortunately, selecting the best AFP method is complicated by researchers using different evaluation metrics. Furthermore, many commonly-used metrics can give misleading results. We argue that the use of poor metrics in AFP evaluation is a result of the lack of methods to benchmark the metrics themselves. We propose an approach called Artificial Dilution Series (ADS). ADS uses existing data sets to generate multiple artificial AFP results, where each result has a controlled error rate. We use ADS to understand whether different metrics can distinguish between results with known quantities of error. Our results highlight dramatic differences in performance between evaluation metrics. Automated protein annotation using the Gene Ontology (GO) plays an important role in the biosciences. Evaluation has always been considered central to developing novel annotation methods, but little attention has been paid to the evaluation metrics themselves. Evaluation metrics define how well an annotation method performs and allows for them to be ranked against one another. Unfortunately, most of these metrics were adopted from the machine learning literature without establishing whether they were appropriate for GO annotations. We propose a novel approach for comparing GO evaluation metrics called Artificial Dilution Series (ADS). Our approach uses existing annotation data to generate a series of annotation sets with different levels of correctness (referred to as their signal level). We calculate the evaluation metric being tested for each annotation set in the series, allowing us to identify whether it can separate different signal levels. Finally, we contrast these results with several false positive annotation sets, which are designed to expose systematic weaknesses in GO assessment. We compared 37 evaluation metrics for GO annotation using ADS and identified drastic differences between metrics. We show that some metrics struggle to differentiate between different signal levels, while others give erroneously high scores to the false positive data sets. Based on our findings, we provide guidelines on which evaluation metrics perform well with the Gene Ontology and propose improvements to several well-known evaluation metrics. In general, we argue that evaluation metrics should be tested for their performance and we provide software for this purpose (). ADS is applicable to other areas of science where the evaluation of prediction results is non-trivial.Peer reviewe

    Mlh1 deficiency in normal mouse colon mucosa associates with chromosomally unstable colon cancer

    Get PDF
    Colorectal cancer (CRC) genome is unstable and different types of instabilities, such as chromosomal instability (CIN) and microsatellite instability (MSI) are thought to reflect distinct cancer initiating mechanisms. Although 85% of sporadic CRC reveal CIN, 15% reveal mismatch repair (MMR) malfunction and MSI, the hallmarks of Lynch syndrome with inherited heterozygous germline mutations in MMR genes. Our study was designed to comprehensively follow genome-wide expression changes and their implications during colon tumorigenesis. We conducted a long-term feeding experiment in the mouse to address expression changes arising in histologically normal colonic mucosa as putative cancer preceding events, and the effect of inherited predisposition (Mlh1(+/-)) and Western-style diet (WD) on those. During the 21-month experiment, carcinomas developed mainly in WD-fed mice and were evenly distributed between genotypes. Unexpectedly, the heterozygote (B6.129-Mlh1tm1Rak) mice did not show MSI in their CRCs. Instead, both wildtype and heterozygote CRC mice showed a distinct mRNA expression profile and shortage of several chromosomal segregation gene-specific transcripts (Mlh1, Bub1, Mis18a, Tpx2, Rad9a, Pms2, Cenpe, Ncapd3, Odf2 and Dclre1b) in their colon mucosa, as well as an increased mitotic activity and abundant numbers of unbalanced/atypical mitoses in tumours. Our genome-wide expression profiling experiment demonstrates that cancer preceding changes are already seen in histologically normal colon mucosa and that decreased expressions of Mlh1 and other chromosomal segregation genes may form a field-defect in mucosa, which trigger MMR-proficient, chromosomally unstable CRC.Peer reviewe

    Selection of informative clusters from hierarchical cluster tree with gene classes-6

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "Selection of informative clusters from hierarchical cluster tree with gene classes"</p><p>BMC Bioinformatics 2004;5():32-32.</p><p>Published online 25 Mar 2004</p><p>PMCID:PMC407846.</p><p>Copyright © 2004 Toronen; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.</p>axis) and complete linkage (Y-axis) cluster trees. The figure also includes x = y line to ease the comparison. Notice that although many clusters show quite similar results the clusters with bigger negative log-p-values show bigger log-p-values in average linkage method. Red circles show comparison of results obtained with SGD classes and the blue circles show the same comparison obtained with MIPS classes. B. Comparison of negative log-p-values for correlating clusters between Ward's method (X-axis) and complete linkage (Y-axis) cluster trees. Note the similarity to the scatter in part A. C. Comparison of negative log-p-values for correlating clusters between average linkage (X-axis) and Ward's method (Y-axis) cluster trees. Here the scatter differs from other scatters (A and B) with clusters showing more scatter along the x = y line

    Selection of informative clusters from hierarchical cluster tree with gene classes-5

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "Selection of informative clusters from hierarchical cluster tree with gene classes"</p><p>BMC Bioinformatics 2004;5():32-32.</p><p>Published online 25 Mar 2004</p><p>PMCID:PMC407846.</p><p>Copyright © 2004 Toronen; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.</p> pair in each of the clusters from one randomization for average results with MIPS classes are shown. A peak of zero results from clusters with too small size as they are given the value 0 automatically (see results). B. 99percentile log-p-values from different randomizations for all methods (red = complete, green = Ward, blue = average) and for both the SGD (marked with '*', three higher profiles in the plot) and MIPS (marked with 'o', three lower profiles in the plot) classifications. Note that the correct value for 99percentile would be 2. C. Previous 99percentile values after Bonferroni-correction. Different results are marked similarly as before

    Selection of informative clusters from hierarchical cluster tree with gene classes-2

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "Selection of informative clusters from hierarchical cluster tree with gene classes"</p><p>BMC Bioinformatics 2004;5():32-32.</p><p>Published online 25 Mar 2004</p><p>PMCID:PMC407846.</p><p>Copyright © 2004 Toronen; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.</p> are presented for comparison with fig. . Clusters are linked to table 4 [] by the ordinal numbers in the data file. Three coloured cluster groups are the same as in figure and they are shown in detail in table (SGD/GO classes). Notice that although the clusters are placed differently than in fig. , similar functional classes are often associated to the same cluster tree branches. The tree is analyzed more in detail in the text. The figure is shown also as a postscript file for separate zooming and printing in figure 9 [see ]

    Selection of informative clusters from hierarchical cluster tree with gene classes-1

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "Selection of informative clusters from hierarchical cluster tree with gene classes"</p><p>BMC Bioinformatics 2004;5():32-32.</p><p>Published online 25 Mar 2004</p><p>PMCID:PMC407846.</p><p>Copyright © 2004 Toronen; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.</p> are linked to table by the ordinal numbers in table . Three coloured cluster groups are: blue, protein synthesis; red, energy and carbohydrate metabolism; green, cell cycle, differentiation and growth, nucleus, chromosome structure and mRNA processing. The rest of the clusters are black. Coloured groups are shown in detail in table (MIPS classes). The figure is shown also as a postscript file for separate zooming and printing in figure 8 []. The tree is analyzed more in detail in the text

    Analysis of gene expression data using self-organizing maps

    Get PDF
    DNA microarray technologies together with rapidly increasing genomic sequence information is leading to an explosion in available gene expression data. Currently there is a great need for efficient methods to analyze and visualize these massive data sets. A self-organizing map (SOM) is an unsupervised neural network learning algorithm which has been successfully used for the analysis and organization of large data files. We have here applied the SOM algorithm to analyze published data of yeast gene expression and show that SOM is an excellent tool for the analysis and visualization of gene expression profiles
    corecore