197,840 research outputs found
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fuzzy Cluster based Nearest Neighbor Classifier
In many diseases classification an accurate gene analysis is needed, for which selection of most informative genes is very important and it require a technique of decision in complex context of ambiguity. The traditional methods include for selecting most significant gene includes some of the statistical analysis namely 2-Sample-T-test (2STT), Entropy, Signal to Noise Ratio (SNR). This paper evaluates gene selection and classification on the basis of accurate gene selection using structured complex decision technique (SCDT) and classifies it using fuzzy cluster based nearest neighborclassifier (FC-NNC). The effectiveness of the proposed SCDT and FC-NNC is evaluated for leave one out cross validation metric(LOOCV) along with sensitivity, specificity, precision and F1-score with four different classifiers namely 1) Radial Basis Function (RBF), 2) Multi-layer perception(MLP), 3) Feed Forward(FF) and 4) Support vector machine(SVM) for three different datasets of DLBCL, Leukemia and Prostate tumor. The proposed SCDT &FC-NNC exhibits superior result for being considered more accurate decision mechanism
Mining SOM expression portraits: Feature selection and integrating concepts of molecular function
Background: 
Self organizing maps (SOM) enable the straightforward portraying of high-dimensional data of large sample collections in terms of sample-specific images. The analysis of their texture provides so-called spot-clusters of co-expressed genes which require subsequent significance filtering and functional interpretation. We address feature selection in terms of the gene ranking problem and the interpretation of the obtained spot-related lists using concepts of molecular function.

Results: 
Different expression scores based either on simple fold change-measures or on regularized Students t-statistics are applied to spot-related gene lists and compared with special emphasis on the error characteristics of microarray expression data. The spot-clusters are analyzed using different methods of gene set enrichment analysis with the focus on overexpression and/or overrepresentation of predefined sets of genes. Metagene-related overrepresentation of selected gene sets was mapped into the SOM images to assign gene function to different regions. Alternatively we estimated set-related overexpression profiles over all samples studied using a gene set enrichment score. It was also applied to the spot-clusters to generate lists of enriched gene sets. We used the tissue body index data set, a collection of expression data of human tissues, as an illustrative example. We found that tissue related spots typically contain enriched populations of gene sets well corresponding to molecular processes in the respective tissues. In addition, we display special sets of housekeeping and of consistently weak and highly expressed genes using SOM data filtering. 

Conclusions:
The presented methods allow the comprehensive downstream analysis of SOM-transformed expression data in terms of cluster-related gene lists and enriched gene sets for functional interpretation. SOM clustering implies the ability to define either new gene sets using selected SOM spots or to verify and/or to amend existing ones
Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes
BACKGROUND: In the clinical context, samples assayed by microarray are often classified by cell line or tumour type and it is of interest to discover a set of genes that can be used as class predictors. The leukemia dataset of Golub et al. [1] and the NCI60 dataset of Ross et al. [2] present multiclass classification problems where three tumour types and nine cell lines respectively must be identified. We apply an evolutionary algorithm to identify the near-optimal set of predictive genes that classify the data. We also examine the initial gene selection step whereby the most informative genes are selected from the genes assayed. RESULTS: In the absence of feature selection, classification accuracy on the training data is typically good, but not replicated on the testing data. Gene selection using the RankGene software [3] is shown to significantly improve performance on the testing data. Further, we show that the choice of feature selection criteria can have a significant effect on accuracy. The evolutionary algorithm is shown to perform stably across the space of possible parameter settings – indicating the robustness of the approach. We assess performance using a low variance estimation technique, and present an analysis of the genes most often selected as predictors. CONCLUSION: The computational methods we have developed perform robustly and accurately, and yield results in accord with clinical knowledge: A Z-score analysis of the genes most frequently selected identifies genes known to discriminate AML and Pre-T ALL leukemia. This study also confirms that significantly different sets of genes are found to be most discriminatory as the sample classes are refined
Recommended from our members
The robust selection of predictive genes via a simple classifier
Identifying genes that direct the mechanism of a disease from expression data is extremely useful in understanding how that mechanism works.
This in turn may lead to better diagnoses and potentially can lead to a cure for that disease. This task becomes extremely challenging when the
data are characterised by only a small number of samples and a high number of dimensions, as it is often the case with gene expression data.
Motivated by this challenge, we present a general framework that focuses on simplicity and data perturbation. These are the keys for the robust
identification of the most predictive features in such data. Within this framework, we propose a simple selective na¨ıve Bayes classifier discovered using a global search technique, and combine it with data perturbation to
increase its robustness to small sample sizes.
An extensive validation of the method was carried out using two applied datasets from the field of microarrays and a simulated dataset, all
confounded by small sample sizes and high dimensionality. The method has been shown capable of identifying genes previously confirmed or associated with prostate cancer and viral infections
Stable Feature Selection for Biomarker Discovery
Feature selection techniques have been used as the workhorse in biomarker
discovery applications for a long time. Surprisingly, the stability of feature
selection with respect to sampling variations has long been under-considered.
It is only until recently that this issue has received more and more attention.
In this article, we review existing stable feature selection methods for
biomarker discovery using a generic hierarchal framework. We have two
objectives: (1) providing an overview on this new yet fast growing topic for a
convenient reference; (2) categorizing existing methods under an expandable
framework for future research and development
Strong Purifying Selection at Synonymous Sites in D. melanogaster
Synonymous sites are generally assumed to be subject to weak selective
constraint. For this reason, they are often neglected as a possible source of
important functional variation. We use site frequency spectra from deep
population sequencing data to show that, contrary to this expectation, 22% of
four-fold synonymous (4D) sites in D. melanogaster evolve under very strong
selective constraint while few, if any, appear to be under weak constraint.
Linking polymorphism with divergence data, we further find that the fraction of
synonymous sites exposed to strong purifying selection is higher for those
positions that show slower evolution on the Drosophila phylogeny. The function
underlying the inferred strong constraint appears to be separate from splicing
enhancers, nucleosome positioning, and the translational optimization
generating canonical codon bias. The fraction of synonymous sites under strong
constraint within a gene correlates well with gene expression, particularly in
the mid-late embryo, pupae, and adult developmental stages. Genes enriched in
strongly constrained synonymous sites tend to be particularly functionally
important and are often involved in key developmental pathways. Given that the
observed widespread constraint acting on synonymous sites is likely not limited
to Drosophila, the role of synonymous sites in genetic disease and adaptation
should be reevaluated
Gene ranking and biomarker discovery under correlation
Biomarker discovery and gene ranking is a standard task in genomic high
throughput analysis. Typically, the ordering of markers is based on a
stabilized variant of the t-score, such as the moderated t or the SAM
statistic. However, these procedures ignore gene-gene correlations, which may
have a profound impact on the gene orderings and on the power of the subsequent
tests.
We propose a simple procedure that adjusts gene-wise t-statistics to take
account of correlations among genes. The resulting correlation-adjusted
t-scores ("cat" scores) are derived from a predictive perspective, i.e. as a
score for variable selection to discriminate group membership in two-class
linear discriminant analysis. In the absence of correlation the cat score
reduces to the standard t-score. Moreover, using the cat score it is
straightforward to evaluate groups of features (i.e. gene sets). For
computation of the cat score from small sample data we propose a shrinkage
procedure. In a comparative study comprising six different synthetic and
empirical correlation structures we show that the cat score improves estimation
of gene orderings and leads to higher power for fixed true discovery rate, and
vice versa. Finally, we also illustrate the cat score by analyzing metabolomic
data.
The shrinkage cat score is implemented in the R package "st" available from
URL http://cran.r-project.org/web/packages/st/Comment: 18 pages, 5 figures, 1 tabl
Differential gene expression graphs: A data structure for classification in DNA microarrays
This paper proposes an innovative data structure to be used as a backbone in designing microarray phenotype sample classifiers. The data structure is based on graphs and it is built from a differential analysis of the expression levels of healthy and diseased tissue samples in a microarray dataset. The proposed data structure is built in such a way that, by construction, it shows a number of properties that are perfectly suited to address several problems like feature extraction, clustering, and classificatio
Friendship and Natural Selection
More than any other species, humans form social ties to individuals who are
neither kin nor mates, and these ties tend to be with similar people. Here, we
show that this similarity extends to genotypes. Across the whole genome,
friends' genotypes at the SNP level tend to be positively correlated
(homophilic); however, certain genotypes are negatively correlated
(heterophilic). A focused gene set analysis suggests that some of the overall
correlation can be explained by specific systems; for example, an olfactory
gene set is homophilic and an immune system gene set is heterophilic. Finally,
homophilic genotypes exhibit significantly higher measures of positive
selection, suggesting that, on average, they may yield a synergistic fitness
advantage that has been helping to drive recent human evolution
- …