15 research outputs found
Clust_100_GE_datasets
<p>100 microarray and RNA-seq gene expression datasets from five model species (human, mouse, fruit fly, arabidopsis plants, and baker's yeast). These datasets represent the benchmark set that was used to test our <em>clust</em> clustering method and to compare it with seven widely used clustering methods (Cross-Clustering, <em>k</em>-means, self-organising maps, MCL, hierarchical clustering, CLICK, and WGCNA). This data resource includes raw data files, pre-processed data files, clustering results, clustering results evaluation, and scripts.</p>
<p>The files are split into eight zipped parts, 100Datasets_0.zip to 100Datasets_7.zip. The contents of the three zipped files should be extracted to a single folder (e.g. 100Datasets).</p>
<p>Below is a thorough description of the files and folders in this data resource.</p>
<p><strong>Scripts</strong></p>
<p>The scripts used to apply each one of the clustering methods to each one of the 100 datasets and to evaluate their results are all included in the folder (scripts/).</p>
<p><strong>Datasets and clustering results (folders starting with D)</strong></p>
<p>The datasets are labelled as D001 to D100. Each dataset has two folders: D###/ and D###_Res/, where ### is the number of the dataset. The first folder only includes the raw dataset while the second folder includes the results of applying the clustering methods to that dataset. The files ending with _B.tsv include clustering results in the form of a partition matrix. The files ending with _E include metrics evaluating the clustering results. The files ending with _go and _go_E respectively include the enriched GO terms in the clustering results and evaluation metrics of these GO terms. The files ending with _REACTOME and _REACTOME_E are similar to the GO term files but for the REACTOME pathway enrichment analysis. Each of these D###_Res/ folders includes a sub-folder "ParamSweepClust" which includes the results of applying <em>clust</em> multiple times to the same dataset while sweeping some parameters.</p>
<p><strong>Large datasets analysis results</strong></p>
<p>The folder LargeDatasets/ includes data and results for what we refer to as "large" datasets. These are 19 datasets that have more than 50 samples including replicates and have not therefore been included in the set of 100 datasets. However, they fit all of the other dataset selection criteria. We have compared <em>clust</em> with the other clustering methods over these datasets to demonstrate that <em>clust</em> still outperforms other datasets over larger datasets. This folder includes folders LD001/ to LD019/ and LD001_Res/ to LD019_Res/. These have similar format and contents as the D###/ and D###_Res/ folders described above.</p>
<p><strong>Simultaneous analysis of multiple datasets (folders starting with MD)</strong></p>
<p>As our <em>clust</em> method is design to be able to extract clusters from multiple datasets simultaneously, we also tested it over multiple datasets. All folders starting with MD_ are related to "multiple datasets (MD)" results. Each MD experiment simultaneously analyses <em>d</em> randomly selected datasets either out of a set of 10 arabidopsis datasets or out of a set of 10 yeast datasets. For each one of the two species, all <em>d</em> values from 2 to 10 were tested, and at each one of these <em>d</em> values, 10 different runs were conducted, where at each run a different subset of <em>d</em> datasets is selected randomly.</p>
<p>The folders MD_10A and MD_10Y include the full sets of 10 arabidposis or 10 yeast datasets, respectively. Each folder with the format MD_10#_d#_Res## includes the results of applying the eight clustering methods at one of the 10 random runs of one of the selected <em>d</em> values. For example, the "MD_10A_d4_Res03/" folder includes the clustering results of the 3<sup>rd</sup> random selection of 4 arabidopsis datasets (the letter A in the folder's name refers to arabidopsis).</p>
<p>Our <em>clust</em> method is applied directly over multiple datasets where each dataset is in a separate data file. Each "MD_10#_d#_Res##" folder includes these individual files in a sub-folder named "Processed_Data/". However, the other clustering methods only accept a single input data file. Therefore, the datasets are merged first before being submitted to these methods. Each "MD_10#_d#_Res##" folder includes a file "X_merged.tsv" for the merged data.</p>
<p><strong>Evaluation metrics (folders starting with Metrics)</strong></p>
<p>Each clustering results folder (D##_Res or MD_10#_d#_Res##) includes some clustering evaluation files ending with _E. This information is combined into tables for all datasets, and these tables appear in the folders starting with "Metrics_".</p>
<p><strong>Other files and folders</strong></p>
<p>The GO folder includes the reference GO term annotations for arabidopsis and yeast. Similarly, the REACTOME folder includes the reference REACTOME pathway annotations for arabidopsis and yeast. The Datasets file includes a TAB delimited table describing the 100 datasets. The SearchCriterion file includes the objective methodology of searching the NCBI database to select these 100 datasets. The Specials file includes some special considerations for couple of datasets that differ a bit from what is described in the SearchCriterion file. The Norm### files and the files in the Reps/ folder describe normalisation codes and replicate structures for the datasets and were fed to the <em>clust</em> method as inputs. The Plots/ folder includes plots of the gene expression profiles of the individual genes in the clusters generated by each one of the eight methods over each one of the 100 datasets. Only up to 14 clusters per method are plotted.</p>
<p> </p
Additional file 8: of In vitro downregulated hypoxia transcriptome is associated with poor prognosis in breast cancer
This Figure shows the survival curves of all of the sub-clusters of C1 and C2 as well as the 51-gene and the 20-gene signatures. Figure 7 in the main manuscript includes the most significant part of this Supplementary Figure. (PDF 91Â kb
Additional file 9: of In vitro downregulated hypoxia transcriptome is associated with poor prognosis in breast cancer
These heat maps or Tables show the correlation amongst the different clusters, sub-clusters, and signatures based on each one of the four considered clinical datasets. (XLSX 19Â kb
Additional file 7: of In vitro downregulated hypoxia transcriptome is associated with poor prognosis in breast cancer
These Tables show the transcript factor analysis of the various clusters, sub-clusters, and hypoxia signatures. (XLSX 544Â kb
Additional file 6: of In vitro downregulated hypoxia transcriptome is associated with poor prognosis in breast cancer
This Figure shows histograms of the expression values of the genes in C1 and C2 based on the TCGA dataset. (PDF 23Â kb
Additional file 10: of In vitro downregulated hypoxia transcriptome is associated with poor prognosis in breast cancer
This text describes further comparisons between the UNCLES method and the iCluster method. (PDF 260Â kb
Correctly assigned genes.
<p>The number of correctly assigned genes at the y-axis is plotted versus the 16 binarization configurations at the x-axis for three representative synthetic datasets out of 60. It should be noted that the binarization configurations are not entirely ordered according to their tightness.</p
Fuzzy membership values for the CLN3 gene.
<p>Fuzzy membership values for the CLN3 gene.</p
Comparison between Bi-CoPaM and many existing clustering methods.
<p>Comparison between Bi-CoPaM and many existing clustering methods.</p
False-positives index (<i>FPI</i>).
<p>False-positives index (<i>FPI</i>) is plotted in log scale versus a subset of binarization configurations for three representative synthetic datasets out of 60. It should be noted that the binarization configurations are not entirely ordered according to their tightness.</p