159 research outputs found

    Gene selection for classification of microarray data based on the Bayes error

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>With DNA microarray data, selecting a compact subset of discriminative genes from thousands of genes is a critical step for accurate classification of phenotypes for, e.g., disease diagnosis. Several widely used gene selection methods often select top-ranked genes according to their individual discriminative power in classifying samples into distinct categories, without considering correlations among genes. A limitation of these gene selection methods is that they may result in gene sets with some redundancy and yield an unnecessary large number of candidate genes for classification analyses. Some latest studies show that incorporating gene to gene correlations into gene selection can remove redundant genes and improve classification accuracy.</p> <p>Results</p> <p>In this study, we propose a new method, Based Bayes error Filter (BBF), to select relevant genes and remove redundant genes in classification analyses of microarray data. The effectiveness and accuracy of this method is demonstrated through analyses of five publicly available microarray datasets. The results show that our gene selection method is capable of achieving better accuracies than previous studies, while being able to effectively select relevant genes, remove redundant genes and obtain efficient and small gene sets for sample classification purposes.</p> <p>Conclusion</p> <p>The proposed method can effectively identify a compact set of genes with high classification accuracy. This study also indicates that application of the Bayes error is a feasible and effective wayfor removing redundant genes in gene selection.</p

    Endovascular Stent Treatment for Symptomatic Benign Iliofemoral Venous Occlusive Disease: Long-Term Results 1987–2009

    Get PDF
    Venous stenting has been shown to effectively treat iliofemoral venous obstruction with good short- and mid-term results. The aim of this study was to investigate long-term clinical outcome and stent patency. Twenty patients were treated with venous stenting for benign disease at our institution between 1987 and 2005. Fifteen of 20 patients (15 female, mean age at time of stent implantation 38 years [range 18–66]) returned for a clinical visit, a plain X-ray of the stent, and a Duplex ultrasound. Four patients were lost to follow-up, and one patient died 277 months after stent placement although a good clinical result was documented 267 months after stent placement. Mean follow-up after stent placement was 167.8 months (13.9 years) (range 71 (6 years) to 267 months [22 years]). No patient needed an additional venous intervention after stent implantation. No significant difference between the circumference of the thigh on the stented side (mean 55.1 cm [range 47.0–70.0]) compared with the contralateral thigh (mean 54.9 cm [range 47.0–70.0]) (p = 0.684) was seen. There was a nonsignificant trend toward higher flow velocities within the stent (mean 30.8 cm/s [range 10.0–48.0]) and the corresponding vein segment on the contralateral side (mean 25.2 cm/s [range 12.0–47.0]) (p = 0.065). Stent integrity was confirmed in 14 of 15 cases. Only one stent showed a fracture, as documented on x-ray, without any impairment of flow. Venous stenting using Wallstents showed excellent long-term clinical outcome and primary patency rate

    A comparison of four clustering methods for brain expression microarray data

    Get PDF
    Background DNA microarrays, which determine the expression levels of tens of thousands of genes from a sample, are an important research tool. However, the volume of data they produce can be an obstacle to interpretation of the results. Clustering the genes on the basis of similarity of their expression profiles can simplify the data, and potentially provides an important source of biological inference, but these methods have not been tested systematically on datasets from complex human tissues. In this paper, four clustering methods, CRC, k-means, ISA and memISA, are used upon three brain expression datasets. The results are compared on speed, gene coverage and GO enrichment. The effects of combining the clusters produced by each method are also assessed. Results k-means outperforms the other methods, with 100% gene coverage and GO enrichments only slightly exceeded by memISA and ISA. Those two methods produce greater GO enrichments on the datasets used, but at the cost of much lower gene coverage, fewer clusters produced, and speed. The clusters they find are largely different to those produced by k-means. Combining clusters produced by k-means and memISA or ISA leads to increased GO enrichment and number of clusters produced (compared to k-means alone), without negatively impacting gene coverage. memISA can also find potentially disease-related clusters. In two independent dorsolateral prefrontal cortex datasets, it finds three overlapping clusters that are either enriched for genes associated with schizophrenia, genes differentially expressed in schizophrenia, or both. Two of these clusters are enriched for genes of the MAP kinase pathway, suggesting a possible role for this pathway in the aetiology of schizophrenia. Conclusion Considered alone, k-means clustering is the most effective of the four methods on typical microarray brain expression datasets. However, memISA and ISA can add extra high-quality clusters to the set produced by k-means, so combining these three methods is the method of choice

    ANMM4CBR: a case-based reasoning method for gene expression data classification

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Accurate classification of microarray data is critical for successful clinical diagnosis and treatment. The "curse of dimensionality" problem and noise in the data, however, undermines the performance of many algorithms.</p> <p>Method</p> <p>In order to obtain a robust classifier, a novel Additive Nonparametric Margin Maximum for Case-Based Reasoning (ANMM4CBR) method is proposed in this article. ANMM4CBR employs a case-based reasoning (CBR) method for classification. CBR is a suitable paradigm for microarray analysis, where the rules that define the domain knowledge are difficult to obtain because usually only a small number of training samples are available. Moreover, in order to select the most informative genes, we propose to perform feature selection via additively optimizing a nonparametric margin maximum criterion, which is defined based on gene pre-selection and sample clustering. Our feature selection method is very robust to noise in the data.</p> <p>Results</p> <p>The effectiveness of our method is demonstrated on both simulated and real data sets. We show that the ANMM4CBR method performs better than some state-of-the-art methods such as support vector machine (SVM) and <it>k </it>nearest neighbor (<it>k</it>NN), especially when the data contains a high level of noise.</p> <p>Availability</p> <p>The source code is attached as an additional file of this paper.</p

    Genome-wide common and rare variant analysis provides novel insights into clozapine-associated neutropenia

    Get PDF
    The antipsychotic clozapine is uniquely effective in the management of schizophrenia; however, its use is limited by its potential to induce agranulocytosis. The causes of this, and of its precursor neutropenia, are largely unknown, although genetic factors have an important role. We sought risk alleles for clozapine-associated neutropenia in a sample of 66 cases and 5583 clozapine-treated controls, through a genome-wide association study (GWAS), imputed human leukocyte antigen (HLA) alleles, exome array and copy-number variation (CNV) analyses. We then combined associated variants in a meta-analysis with data from the Clozapine-Induced Agranulocytosis Consortium (up to 163 cases and 7970 controls). In the largest combined sample to date, we identified a novel association with rs149104283 (odds ratio (OR)=4.32, P=1.79 × 10−8), intronic to transcripts of SLCO1B3 and SLCO1B7, members of a family of hepatic transporter genes previously implicated in adverse drug reactions including simvastatin-induced myopathy and docetaxel-induced neutropenia. Exome array analysis identified gene-wide associations of uncommon non-synonymous variants within UBAP2 and STARD9. We additionally provide independent replication of a previously identified variant in HLA-DQB1 (OR=15.6, P=0.015, positive predictive value=35.1%). These results implicate biological pathways through which clozapine may act to cause this serious adverse effec

    Multiclass classification of microarray data samples with a reduced number of genes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Multiclass classification of microarray data samples with a reduced number of genes is a rich and challenging problem in Bioinformatics research. The problem gets harder as the number of classes is increased. In addition, the performance of most classifiers is tightly linked to the effectiveness of mandatory gene selection methods. Critical to gene selection is the availability of estimates about the maximum number of genes that can be handled by any classification algorithm. Lack of such estimates may lead to either computationally demanding explorations of a search space with thousands of dimensions or classification models based on gene sets of unrestricted size. In the former case, unbiased but possibly overfitted classification models may arise. In the latter case, biased classification models unable to support statistically significant findings may be obtained.</p> <p>Results</p> <p>A novel bound on the maximum number of genes that can be handled by binary classifiers in binary mediated multiclass classification algorithms of microarray data samples is presented. The bound suggests that high-dimensional binary output domains might favor the existence of accurate and sparse binary mediated multiclass classifiers for microarray data samples.</p> <p>Conclusions</p> <p>A comprehensive experimental work shows that the bound is indeed useful to induce accurate and sparse multiclass classifiers for microarray data samples.</p

    A voting approach to identify a small number of highly predictive genes using multiple classifiers

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Microarray gene expression profiling has provided extensive datasets that can describe characteristics of cancer patients. An important challenge for this type of data is the discovery of gene sets which can be used as the basis of developing a clinical predictor for cancer. It is desirable that such gene sets be compact, give accurate predictions across many classifiers, be biologically relevant and have good biological process coverage.</p> <p>Results</p> <p>By using a new type of multiple classifier voting approach, we have identified gene sets that can predict breast cancer prognosis accurately, for a range of classification algorithms. Unlike a wrapper approach, our method is not specialised towards a single classification technique. Experimental analysis demonstrates higher prediction accuracies for our sets of genes compared to previous work in the area. Moreover, our sets of genes are generally more compact than those previously proposed. Taking a biological viewpoint, from the literature, most of the genes in our sets are known to be strongly related to cancer.</p> <p>Conclusion</p> <p>We show that it is possible to obtain superior classification accuracy with our approach and obtain a compact gene set that is also biologically relevant and has good coverage of different biological processes.</p

    MALDI Profiling of Human Lung Cancer Subtypes

    Get PDF
    Proteomics is expected to play a key role in cancer biomarker discovery. Although it has become feasible to rapidly analyze proteins from crude cell extracts using mass spectrometry, complex sample composition hampers this type of measurement. Therefore, for effective proteome analysis, it becomes critical to enrich samples for the analytes of interest. Despite that one-third of the proteins in eukaryotic cells are thought to be phosphorylated at some point in their life cycle, only a low percentage of intracellular proteins is phosphorylated at a given time.In this work, we have applied chromatographic phosphopeptide enrichment techniques to reduce the complexity of human clinical samples. A novel method for high-throughput peptide profiling of human tumor samples, using Parallel IMAC and MALDI-TOF MS, is described. We have applied this methodology to analyze human normal and cancer lung samples in the search for new biomarkers. Using a highly reproducible spectral processing algorithm to produce peptide mass profiles with minimal variability across the samples, lineal discriminant-based and decision tree–based classification models were generated. These models can distinguish normal from tumor samples, as well as differentiate the various non–small cell lung cancer histological subtypes.A novel, optimized sample preparation method and a careful data acquisition strategy is described for high-throughput peptide profiling of small amounts of human normal lung and lung cancer samples. We show that the appropriate combination of peptide expression values is able to discriminate normal lung from non-small cell lung cancer samples and among different histological subtypes. Our study does emphasize the great potential of proteomics in the molecular characterization of cancer

    Very Important Pool (VIP) genes – an application for microarray-based molecular signatures

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Advances in DNA microarray technology portend that molecular signatures from which microarray will eventually be used in clinical environments and personalized medicine. Derivation of biomarkers is a large step beyond hypothesis generation and imposes considerably more stringency for accuracy in identifying informative gene subsets to differentiate phenotypes. The inherent nature of microarray data, with fewer samples and replicates compared to the large number of genes, requires identifying informative genes prior to classifier construction. However, improving the ability to identify differentiating genes remains a challenge in bioinformatics.</p> <p>Results</p> <p>A new hybrid gene selection approach was investigated and tested with nine publicly available microarray datasets. The new method identifies a Very Important Pool (VIP) of genes from the broad patterns of gene expression data. The method uses a bagging sampling principle, where the re-sampled arrays are used to identify the most informative genes. Frequency of selection is used in a repetitive process to identify the VIP genes. The putative informative genes are selected using two methods, t-statistic and discriminatory analysis. In the t-statistic, the informative genes are identified based on p-values. In the discriminatory analysis, disjoint Principal Component Analyses (PCAs) are conducted for each class of samples, and genes with high discrimination power (DP) are identified. The VIP gene selection approach was compared with the p-value ranking approach. The genes identified by the VIP method but not by the p-value ranking approach are also related to the disease investigated. More importantly, these genes are part of the pathways derived from the common genes shared by both the VIP and p-ranking methods. Moreover, the binary classifiers built from these genes are statistically equivalent to those built from the top 50 p-value ranked genes in distinguishing different types of samples.</p> <p>Conclusion</p> <p>The VIP gene selection approach could identify additional subsets of informative genes that would not always be selected by the p-value ranking method. These genes are likely to be additional true positives since they are a part of pathways identified by the p-value ranking method and expected to be related to the relevant biology. Therefore, these additional genes derived from the VIP method potentially provide valuable biological insights.</p

    Absence of a specific radiation signature in post-Chernobyl thyroid cancers

    Get PDF
    Thyroid cancers have been the main medical consequence of the Chernobyl accident. On the basis of their pathological features and of the fact that a large proportion of them demonstrate RET-PTC translocations, these cancers are considered as similar to classical sporadic papillary carcinomas, although molecular alterations differ between both tumours. We analysed gene expression in post-Chernobyl cancers, sporadic papillary carcinomas and compared to autonomous adenomas used as controls. Unsupervised clustering of these data did not distinguish between the cancers, but separates both cancers from adenomas. No gene signature separating sporadic from post-Chernobyl PTC (chPTC) could be found using supervised and unsupervised classification methods although such a signature is demonstrated for cancers and adenomas. Furthermore, we demonstrate that pooled RNA from sporadic and chPTC are as strongly correlated as two independent sporadic PTC pools, one from Europe, one from the US involving patients not exposed to Chernobyl radiations. This result relies on cDNA and Affymetrix microarrays. Thus, platform-specific artifacts are controlled for. Our findings suggest the absence of a radiation fingerprint in the chPTC and support the concept that post-Chernobyl cancer data, for which the cancer-causing event and its date are known, are a unique source of information to study naturally occurring papillary carcinomas
    corecore