8,211 research outputs found
Feature subset selection problem on microarray data
Recent advance of technology gave birth to tools such as microarray chips. The use of microarray chips enabled the scientists to measure the amount of protein production from their genes in a cell, known as the gene expression data. The classification of cell samples by means of their gene expression data is a hot research area. The data used for the analysis is massive and therefore the features, i.e., the genes, must be reduced to a reasonable level due to the computational cost of experiments and the possibility of misleading irrelevant genes. Therefore, usually, the analysis based on the classification of cell samples includes a feature subset selection phase. This thesis aims to develop a tool that can be used during the feature subset selection phase of such analyses. Three novel algorithms are proposed for the gene selection problem based on basic association rule mining. The first algorithm starts with fuzzy partitioning of the gene expression data and discovers highly confident IF-THEN rules that enable the classification of sample tissues. The second algorithm search the possible IFTHEN rules based on a heuristic pruning approach which is based on the beam search algorithm. Finally, the third algorithm focuses on the hierarchical information carried through gene expressions by constructing decision trees based on different performance measures. We found satisfactory results in Leukemia Dataset. In addition, in colon cancer dataset, algorithm that is based on construction of decision trees showed good performance
Integrative Model-based clustering of microarray methylation and expression data
In many fields, researchers are interested in large and complex biological
processes. Two important examples are gene expression and DNA methylation in
genetics. One key problem is to identify aberrant patterns of these processes
and discover biologically distinct groups. In this article we develop a
model-based method for clustering such data. The basis of our method involves
the construction of a likelihood for any given partition of the subjects. We
introduce cluster specific latent indicators that, along with some standard
assumptions, impose a specific mixture distribution on each cluster. Estimation
is carried out using the EM algorithm. The methods extend naturally to multiple
data types of a similar nature, which leads to an integrated analysis over
multiple data platforms, resulting in higher discriminating power.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS533 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
A cDNA Microarray Gene Expression Data Classifier for Clinical Diagnostics Based on Graph Theory
Despite great advances in discovering cancer molecular profiles, the proper application of microarray technology to routine clinical diagnostics is still a challenge. Current practices in the classification of microarrays' data show two main limitations: the reliability of the training data sets used to build the classifiers, and the classifiers' performances, especially when the sample to be classified does not belong to any of the available classes. In this case, state-of-the-art algorithms usually produce a high rate of false positives that, in real diagnostic applications, are unacceptable. To address this problem, this paper presents a new cDNA microarray data classification algorithm based on graph theory and is able to overcome most of the limitations of known classification methodologies. The classifier works by analyzing gene expression data organized in an innovative data structure based on graphs, where vertices correspond to genes and edges to gene expression relationships. To demonstrate the novelty of the proposed approach, the authors present an experimental performance comparison between the proposed classifier and several state-of-the-art classification algorithm
Recommended from our members
The robust selection of predictive genes via a simple classifier
Identifying genes that direct the mechanism of a disease from expression data is extremely useful in understanding how that mechanism works.
This in turn may lead to better diagnoses and potentially can lead to a cure for that disease. This task becomes extremely challenging when the
data are characterised by only a small number of samples and a high number of dimensions, as it is often the case with gene expression data.
Motivated by this challenge, we present a general framework that focuses on simplicity and data perturbation. These are the keys for the robust
identification of the most predictive features in such data. Within this framework, we propose a simple selective na¨ıve Bayes classifier discovered using a global search technique, and combine it with data perturbation to
increase its robustness to small sample sizes.
An extensive validation of the method was carried out using two applied datasets from the field of microarrays and a simulated dataset, all
confounded by small sample sizes and high dimensionality. The method has been shown capable of identifying genes previously confirmed or associated with prostate cancer and viral infections
Multi-test Decision Tree and its Application to Microarray Data Classification
Objective:
The desirable property of tools used to investigate biological data is
easy to understand models and predictive decisions.
Decision trees are particularly promising in this regard due to their comprehensible nature that resembles the hierarchical process of human decision making. However, existing algorithms for learning decision trees have tendency to underfit gene expression data. The main aim of this work is to improve the performance and stability of decision trees with only a small increase in their complexity.
Methods:
We propose a multi-test decision tree (MTDT); our main contribution is the application of several univariate tests in each non-terminal node of the decision tree. We also search for alternative, lower-ranked features in order to obtain more stable and reliable predictions.
Results:
Experimental validation was performed on several real-life gene expression datasets. Comparison results with eight classifiers show that MTDT has a statistically significantly higher accuracy than popular decision tree classifiers, and it was highly competitive with ensemble learning algorithms. The proposed solution managed to outperform its baseline algorithm on datasets by an average percent. A study performed on one of the datasets showed that the discovered genes used in the MTDT classification model
are supported by biological evidence in the literature.
Conclusion:
This paper introduces a new type of decision tree which is more suitable for solving biological problems.
MTDTs are relatively easy to analyze and much more powerful in modeling high dimensional microarray data than their popular counterparts
Multi-TGDR: a regularization method for multi-class classification in microarray experiments
Background
With microarray technology becoming mature and popular, the selection and use
of a small number of relevant genes for accurate classification of samples is a
hot topic in the circles of biostatistics and bioinformatics. However, most of
the developed algorithms lack the ability to handle multiple classes, which
arguably a common application. Here, we propose an extension to an existing
regularization algorithm called Threshold Gradient Descent Regularization
(TGDR) to specifically tackle multi-class classification of microarray data.
When there are several microarray experiments addressing the same/similar
objectives, one option is to use meta-analysis version of TGDR (Meta-TGDR),
which considers the classification task as combination of classifiers with the
same structure/model while allowing the parameters to vary across studies.
However, the original Meta-TGDR extension did not offer a solution to the
prediction on independent samples. Here, we propose an explicit method to
estimate the overall coefficients of the biomarkers selected by Meta-TGDR. This
extension permits broader applicability and allows a comparison between the
predictive performance of Meta-TGDR and TGDR using an independent testing set.
Results
Using real-world applications, we demonstrated the proposed multi-TGDR
framework works well and the number of selected genes is less than the sum of
all individualized binary TGDRs. Additionally, Meta-TGDR and TGDR on the
batch-effect adjusted pooled data approximately provided same results. By
adding Bagging procedure in each application, the stability and good predictive
performance are warranted.
Conclusions
Compared with Meta-TGDR, TGDR is less computing time intensive, and requires
no samples of all classes in each study. On the adjusted data, it has
approximate same predictive performance with Meta-TGDR. Thus, it is highly
recommended
- …