137 research outputs found
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fuzzy Cluster based Nearest Neighbor Classifier
In many diseases classification an accurate gene analysis is needed, for which selection of most informative genes is very important and it require a technique of decision in complex context of ambiguity. The traditional methods include for selecting most significant gene includes some of the statistical analysis namely 2-Sample-T-test (2STT), Entropy, Signal to Noise Ratio (SNR). This paper evaluates gene selection and classification on the basis of accurate gene selection using structured complex decision technique (SCDT) and classifies it using fuzzy cluster based nearest neighborclassifier (FC-NNC). The effectiveness of the proposed SCDT and FC-NNC is evaluated for leave one out cross validation metric(LOOCV) along with sensitivity, specificity, precision and F1-score with four different classifiers namely 1) Radial Basis Function (RBF), 2) Multi-layer perception(MLP), 3) Feed Forward(FF) and 4) Support vector machine(SVM) for three different datasets of DLBCL, Leukemia and Prostate tumor. The proposed SCDT &FC-NNC exhibits superior result for being considered more accurate decision mechanism
Recommended from our members
ManiNetCluster: a novel manifold learning approach to reveal the functional links between gene networks.
BACKGROUND:The coordination of genomic functions is a critical and complex process across biological systems such as phenotypes or states (e.g., time, disease, organism, environmental perturbation). Understanding how the complexity of genomic function relates to these states remains a challenge. To address this, we have developed a novel computational method, ManiNetCluster, which simultaneously aligns and clusters gene networks (e.g., co-expression) to systematically reveal the links of genomic function between different conditions. Specifically, ManiNetCluster employs manifold learning to uncover and match local and non-linear structures among networks, and identifies cross-network functional links. RESULTS:We demonstrated that ManiNetCluster better aligns the orthologous genes from their developmental expression profiles across model organisms than state-of-the-art methods (p-value <2.2Ă—10-16). This indicates the potential non-linear interactions of evolutionarily conserved genes across species in development. Furthermore, we applied ManiNetCluster to time series transcriptome data measured in the green alga Chlamydomonas reinhardtii to discover the genomic functions linking various metabolic processes between the light and dark periods of a diurnally cycling culture. We identified a number of genes putatively regulating processes across each lighting regime. CONCLUSIONS:ManiNetCluster provides a novel computational tool to uncover the genes linking various functions from different networks, providing new insight on how gene functions coordinate across different conditions. ManiNetCluster is publicly available as an R package at https://github.com/daifengwanglab/ManiNetCluster
Gene Expression Analysis Methods on Microarray Data a A Review
In recent years a new type of experiments are changing the way that biologists and other specialists analyze many problems. These are called high throughput experiments and the main difference with those that were performed some years ago is mainly in the quantity of the data obtained from them. Thanks to the technology known generically as microarrays, it is possible to study nowadays in a single experiment the behavior of all the genes of an organism under different conditions. The data generated by these experiments may consist from thousands to millions of variables and they pose many challenges to the scientists who have to analyze them. Many of these are of statistical nature and will be the center of this review. There are many types of microarrays which have been developed to answer different biological questions and some of them will be explained later. For the sake of simplicity we start with the most well known ones: expression microarrays
Algorithmic and Statistical Perspectives on Large-Scale Data Analysis
In recent years, ideas from statistics and scientific computing have begun to
interact in increasingly sophisticated and fruitful ways with ideas from
computer science and the theory of algorithms to aid in the development of
improved worst-case algorithms that are useful for large-scale scientific and
Internet data analysis problems. In this chapter, I will describe two recent
examples---one having to do with selecting good columns or features from a (DNA
Single Nucleotide Polymorphism) data matrix, and the other having to do with
selecting good clusters or communities from a data graph (representing a social
or information network)---that drew on ideas from both areas and that may serve
as a model for exploiting complementary algorithmic and statistical
perspectives in order to solve applied large-scale data analysis problems.Comment: 33 pages. To appear in Uwe Naumann and Olaf Schenk, editors,
"Combinatorial Scientific Computing," Chapman and Hall/CRC Press, 201
Cell Type Classification Via Deep Learning On Single-Cell Gene Expression Data
Single-cell sequencing is a recently advanced revolutionary technology which enables researchers to obtain genomic, transcriptomic, or multi-omics information through gene expression analysis. It gives the advantage of analyzing highly heterogenous cell type information compared to traditional sequencing methods, which is gaining popularity in the biomedical area. Moreover, this analysis can help for early diagnosis and drug development of tumor cells, and cancer cell types. In the workflow of gene expression data profiling, identification of the cell types is an important task, but it faces many challenges like the curse of dimensionality, sparsity, batch effect, and overfitting. However, these challenges can be overcome by performing a feature selection technique which selects more relevant features by reducing feature dimensions. In this research work, recurrent neural network-based feature selection model is proposed to extract relevant features from high dimensional, and low sample size data. Moreover, a deep learning-based gene embedding model is also proposed to reduce data sparsity of single-cell data for cell type identification. The proposed frameworks have been implemented with different architectures of recurrent neural networks, and demonstrated via real-world micro-array datasets and single-cell RNA-seq data and observed that the proposed models perform better than other feature selection models. A semi-supervised model is also implemented using the same workflow of gene embedding concept since labeling data is very cumbersome, time consuming, and requires manual effort and expertise in the field. Therefore, different ratios of labeled data are used in the experiment to validate the concept. Experimental results show that the proposed semi-supervised approach represents very encouraging performance even though a limited number of labeled data is used via the gene embedding concept. In addition, graph attention based autoencoder model has also been studied to learn the latent features by incorporating prior knowledge with gene expression data for cell type classification.
Index Terms — Single-Cell Gene Expression Data, Gene Embedding, Semi-Supervised model, Incorporate Prior Knowledge, Gene-gene Interaction Network, Deep Learning, Graph Auto Encode
Multiple Instance Learning: A Survey of Problem Characteristics and Applications
Multiple instance learning (MIL) is a form of weakly supervised learning
where training instances are arranged in sets, called bags, and a label is
provided for the entire bag. This formulation is gaining interest because it
naturally fits various problems and allows to leverage weakly labeled data.
Consequently, it has been used in diverse application fields such as computer
vision and document classification. However, learning from bags raises
important challenges that are unique to MIL. This paper provides a
comprehensive survey of the characteristics which define and differentiate the
types of MIL problems. Until now, these problem characteristics have not been
formally identified and described. As a result, the variations in performance
of MIL algorithms from one data set to another are difficult to explain. In
this paper, MIL problem characteristics are grouped into four broad categories:
the composition of the bags, the types of data distribution, the ambiguity of
instance labels, and the task to be performed. Methods specialized to address
each category are reviewed. Then, the extent to which these characteristics
manifest themselves in key MIL application areas are described. Finally,
experiments are conducted to compare the performance of 16 state-of-the-art MIL
methods on selected problem characteristics. This paper provides insight on how
the problem characteristics affect MIL algorithms, recommendations for future
benchmarking and promising avenues for research
Lung Cancer Genomic Signatures
Background:Lung cancer (LC) is the dominant cause of death by cancer in the world, being responsible for more than a million deaths annually. It is a highly lethal common tumor that is frequently diagnosed in advanced stages for which effective alternative therapeutics do not exist. In view of this, there is an urgent need to improve the diagnostic, prognostic, and therapeutic classification systems, currently based on clinicopathological criteria that do not adequately translate the enormous biologic complexity of this disease.Methods:The advent of the human genome sequencing project and the concurrent development of many genomic-based technologies have allowed scientists to explore the possibility of using expression profiles to identify homogenous tumor subtypes, new prognostic factors of human cancer, response to a particular treatment, etc. and thereby select the best possible therapies while decreasing the risk of toxicities for the patients. Therefore, it is becoming increasingly important to identify the complete catalog of genes that are altered in cancer and to discriminate tumors accurately on the basis of their genetic background.Results and Discussion:In this article, we present some of the works that has applied high-throughput technologies to LC research. In addition, we will give an overview of recent results in the field of LC genomics, with their effect on patient care, and discuss challenges and the potential future developments of this area
Doctor of Philosophy
dissertationStatistical learning theory has garnered attention during the last decade because it provides the theoretical and mathematical framework for solving pattern recognition problems, such as dimensionality reduction, clustering, and shape analysis. In statis
- …