Search CORE

261 research outputs found

Selecting dissimilar genes for multi-class classification, an application in cancer subtyping

Author: Cai Zhipeng
Goebel Randy
Lin Guohui
Salavatipour Mohammad R
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Supervised Relevance-Redundancy assessments for feature selection in omics-based classification scenarios

Author: Cascianelli Silvia
Galzerano Arianna
Masseroli Marco
Publication venue
Publication date: 01/01/2023
Field of study

Background and objective: Many classification tasks in translational bioinformatics and genomics are characterized by the high dimensionality of potential features and unbalanced sample distribution among classes. This can affect classifier robustness and increase the risk of overfitting, curse of dimensionality and generalization leaks; furthermore and most importantly, this can prevent obtaining adequate patient stratification required for precision medicine in facing complex diseases, like cancer. Setting up a feature selection strategy able to extract only proper predictive features by removing irrelevant, redundant, and noisy ones is crucial to achieving valuable results on the desired task. Methods: We propose a new feature selection approach, called ReRa, based on supervised Relevance-Redundancy assessments. ReRa consists of a customized step of relevance-based filtering, to identify a reduced subset of meaningful features, followed by a supervised similarity-based procedure to minimize redundancy. This latter step innovatively uses a combination of global and class-specific similarity assessments to remove redundant features while preserving those differentiated across classes, even when these classes are strongly unbalanced. Results: We compared ReRa with several existing feature selection methods to obtain feature spaces on which performing breast cancer patient subtyping using several classifiers: we considered two use cases based on gene or transcript isoform expression. In the vast majority of the assessed scenarios, when using ReRa-selected feature spaces, the performances were significantly increased compared to simple feature filtering, LASSO regularization, or even MRmr - another Relevance-Redundancy method. The two use cases represent an insightful example of translational application, taking advantage of ReRa capabilities to investigate and enhance a clinically-relevant patient stratification task, which could be easily applied also to other cancer types and diseases. Conclusions: ReRa approach has the potential to improve the performance of machine learning models used in an unbalanced classification scenario. Compared to another Relevance-Redundancy approach like MRmr, ReRa does not require tuning the number of preserved features, ensures efficiency and scalability over huge initial dimensionalities and allows re-evaluation of all previously selected features at each iteration of the redundancy assessment, to ultimately preserve only the most relevant and class-differentiated features

Archivio istituzionale della ricerca - Politecnico di Milano

DNA Sequence Classification: It’s Easier Than You Think: An open-source k-mer based machine learning tool for fast and accurate classification of a variety of genomic datasets

Author: Solis-Reyes Stephen
Publication venue: Scholarship@Western
Publication date: 09/10/2018
Field of study

Supervised classification of genomic sequences is a challenging, well-studied problem with a variety of important applications. We propose an open-source, supervised, alignment-free, highly general method for sequence classification that operates on k-mer proportions of DNA sequences. This method was implemented in a fully standalone general-purpose software package called Kameris, publicly available under a permissive open-source license. Compared to competing software, ours provides key advantages in terms of data security and privacy, transparency, and reproducibility. We perform a detailed study of its accuracy and performance on a wide variety of classification tasks, including virus subtyping, taxonomic classification, and human haplogroup assignment. We demonstrate the success of our method on whole mitochondrial, nuclear, plastid, plasmid, and viral genomes, as well as randomly sampled eukaryote genomes and transcriptomes. Further, we perform head-to-head evaluations on the tasks of HIV-1 virus subtyping and bacterial taxonomic classification with a number of competing state-of-the-art software solutions, and show that we match or exceed all other tested software in terms of accuracy and speed

Scholarship@Western

An embedded two-layer feature selection approach for microarray data analysis

Author: Yang Pengyi
Zhang Zili
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

Feature selection is an important technique in dealing with application problems with large number of variables and limited training samples, such as image processing, combinatorial chemistry, and microarray analysis. Commonly employed feature selection strategies can be divided into filter and wrapper. In this study, we propose an embedded two-layer feature selection approach to combining the advantages of filter and wrapper algorithms while avoiding their drawbacks. The hybrid algorithm, called GAEF (Genetic Algorithm with embedded filter), divides the feature selection process into two stages. In the first stage, Genetic Algorithm (GA) is employed to pre-select features while in the second stage a filter selector is used to further identify a small feature subset for accurate sample classification. Three benchmark microarray datasets are used to evaluate the proposed algorithm. The experimental results suggest that this embedded two-layer feature selection strategy is able to improve the stability of the selection results as well as the sample classification accuracy.<br /

CiteSeerX

Deakin Research Online

Consensus clustering applied to multi-omics disease subtyping

Author: BRIERE Marie-Galadriel
DARBO Elodie
THEBAULT Patricia
URICARU Raluca
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

Background: Facing the diversity of omics data and the difficulty of selecting one result over all those produced by several methods, consensus strategies have the potential to reconcile multiple inputs and to produce robust results. Results: Here, we introduce ClustOmics, a generic consensus clustering tool that we use in the context of cancer subtyping. ClustOmics relies on a non-relational graph database, which allows for the simultaneous integration of both multiple omics data and results from various clustering methods. This new tool conciliates input clusterings, regardless of their origin, their number, their size or their shape. ClustOmics implements an intuitive and flexible strategy, based upon the idea of evidence accumulation clustering. ClustOmics computes co-occurrences of pairs of samples in input clusters and uses this score as a similarity measure to reorganize data into consensus clusters. Conclusion: We applied ClustOmics to multi-omics disease subtyping on real TCGA cancer data from ten different cancer types. We showed that ClustOmics is robust to heterogeneous qualities of input partitions, smoothing and reconciling preliminary predictions into high-quality consensus clusters, both from a computational and a biological point of view. The comparison to a state-of-the-art consensus-based integration tool, COCA, further corroborated this statement. However, the main interest of ClustOmics is not to compete with other tools, but rather to make profit from their various predictions when no gold-standard metric is available to assess their significance. Availability: The ClustOmics source code, released under MIT license, and the results obtained on TCGA cancer data are available on GitHub: https://github.com/galadrielbriere/ClustOmics

Directory of Open Access Journals

A Robust Unified Graph Model Based on Molecular Data Binning for Subtype Discovery in High-dimensional Spaces

Author: Hassan Zada M.
Publication venue: College of Science and Engineering, University of Derby
Publication date: 01/01/2023
Field of study

Machine learning (ML) is a subfield of artificial intelligence (AI) that has already revolutionised the world around us. It is a widely employed process for discovering patterns and groups within datasets. It has a wide range of applications including disease subtyping, which aims to discover intrinsic subtypes of disease in large-scale unlabelled data. Whilst the groups discovered in multi-view high-dimensional data by ML algorithms are promising, their capacity to identify pertinent and meaningful groups is limited by the presence of data variability and outliers. Since outlier values represent potential but unlikely outcomes, they are statistically and philosophically fascinating. Therefore, the primary aim of this thesis was to propose a robust approach that discovers meaningful groups while considering the presence of data variability and outliers in the data. To achieve this aim, a novel robust approach (ROMDEX) was developed that utilised the proposed intermediate graph models (IMGs) for robust computation of proximity between observations in the data. Finally, a robust multi-view graph-based clustering approach was developed based on ROMDEX that improved the discovery of meaningful groups that were hidden behind the noise in the data. The proposed approach was validated on real-world, and synthetic data for disease subtyping. Additionally, the stability of the approach was assessed by evaluating its performance across different levels of noise in clustering data. The results were evaluated through Kaplan-Meier survival time analysis for disease subtyping. Also, the concordance index (CI) and normalised mutual information (NMI) are used to evaluate the predictive ability of the proposed clustering model. Additionally, the accuracy, Kappa statistic and rand index are computed to evaluate the clustering stability against various levels of Gaussian noise. The proposed approach outperformed the existing state-of-the-art approaches MRGC, PINS, SNF, Consensus Clustering, and Icluster+ on these datasets. The findings for all datasets were outstanding, demonstrating the predictive ability of the proposed unsupervised graph-based clustering approach

UDORA - University of Derby Online Research Archive

Examining applying high performance genetic data feature selection and classification algorithms for colon cancer diagnosis

Author: Al-Rajab Murad
Lu Joan
Qiang Xu
Publication venue: 'Elsevier BV'
Publication date: 01/07/2017
Field of study

Background and Objectives: This paper examines the accuracy and efficiency (time complexity) of high performance genetic data feature selection and classification algorithms for colon cancer diagnosis. The need for this research derives from the urgent and increasing need for accurate and efficient algorithms. Colon cancer is a leading cause of death worldwide, hence it is vitally important for the cancer tissues to be expertly identified and classified in a rapid and timely manner, to assure both a fast detection of the disease and to expedite the drug discovery process. Methods: In this research, a three-phase approach was proposed and implemented: Phases One and Two examined the feature selection algorithms and classification algorithms employed separately, and Phase Three examined the performance of the combination of these. Results: It was found from Phase One that the Particle Swarm Optimization (PSO) algorithm performed best with the colon dataset as a feature selection (29 genes selected) and from Phase Two that the Sup- port Vector Machine (SVM) algorithm outperformed other classifications, with an accuracy of almost 86%. It was also found from Phase Three that the combined use of PSO and SVM surpassed other algorithms in accuracy and performance, and was faster in terms of time analysis (94%). Conclusions: It is concluded that applying feature selection algorithms prior to classification algorithms results in better accuracy than when the latter are applied alone. This conclusion is important and significant to industry and society

Crossref

University of Huddersfield Repository

Huddersfield Research Portal

Computational Approaches to Assessing Clinical Relevance Of Preclinical Cancer Models

Author: Uzun Vladimir
Publication venue: 'University of Sheffield Conference Proceedings'
Publication date: 01/11/2018
Field of study

Preclinical cancer models, such as tumour-derived cell lines and animal models, are essential in cancer research. Consistently used as a platform to investigate mechanism of action, they can identify potential biomarkers prior to clinical trials where similar exploration is more complicated and expensive. However, whilst cell lines are the most used preclinical model, their applicability in certain settings is questioned because of the difficulty of aligning the appropriate cell lines with a clinically relevant disease segment. I developed a methodology for systematic cancer cell line scoring based on patient sample subtypes and analysis of the causative elements of the subtype differentiation in cancer. Machine learning classifiers I tailored to multi-omics nature of cancer have been highly accurate in predicting the subtype of new patient samples. Applying those models to cancer cell lines reslted in a clinically based cancer cell line relevance score. The majority of cell line scores were in line with the literature, but there were several misclassified cells. Exploring the causative elements of the underlying biology, I confirmed the oncogenic nature of the features driving the classification. Additionally, through differential expression analysis, the nature of some of the misclassified breast cancer cell lines was elucidated–they were poorly representative of their receptor-positive type despite having HER2 receptor expressed. One of those cell lines, JIMT-1, has been shown to be resistant to HER2-targeted treatment, thus making the misclassification of my model more clinically relevant than the receptor statuses of the cell line itself. Through several distance metrics I have expanded on the binary nature of the classifying methods and identified more and less suitable cell lines not just by their score, but also by how close they are to the patient samples. The core aspects of my methodology have been implemented as an online tool, a Shiny application, in order to allow others to leverage my methods and findings

White Rose E-theses Online

Towards Personalized Medicine Using Systems Biology And Machine Learning

Author: Voichita Calin
Publication venue: DigitalCommons@WayneState
Publication date: 02/01/2013
Field of study

The rate of acquiring biological data has greatly surpassed our ability to interpret it. At the same time, we have started to understand that evolution of many diseases such as cancer, are the results of the interplay between the disease itself and the immune system of the host. It is now well accepted that cancer is not a single disease, but a “complex collection of distinct genetic diseases united by common hallmarks”. Understanding the differences between such disease subtypes is key not only in providing adequate treatments for known subtypes but also identifying new ones. These unforeseen disease subtypes are one of the main reasons high-profile clinical trials fail. To identify such cases, we proposed a classification technique, based on Support Vector Machines, that is able to automatically identify samples that are dissimilar from the classes used for training. We assessed the performance of this approach both with artificial data and data from the UCI machine learning repository. Moreover, we showed in a leukemia experiment that our method is able to identify 65% of the MLL patients when it was trained only on AML vs. ALL. In addition, to augment our ability to understand the disease mechanism in each subgroup, we proposed a systems biology approach able to consider all measured gene expressing changes, thus eliminating the possibility that small but important gene changes (e.g. transcription factors) are omitted from the analysis. We showed that this approach provides consistent results that do not depend on the choice of an arbitrary threshold for the differential regulation. We also showed in a multiple sclerosis study that this approach is able to obtain consistent results across multiple experiments performed by different groups on different technologies, that could not be achieved based solely using differential expression. The cut-off free impact analysis was released as part of the ROntoTools Bioconductor package

Digital Commons@Wayne State University

Unsupervised Algorithms for Microarray Sample Stratification

Author: Cattelani Luca
Federico Antonio
Fratello Michele
Greco Dario
Pavel Alisa
Scala Giovanni
Serra Angela
Publication venue: Springer, UK
Publication date: 01/01/2022
Field of study

The amount of data made available by microarrays gives researchers the opportunity to delve into the complexity of biological systems. However, the noisy and extremely high-dimensional nature of this kind of data poses significant challenges. Microarrays allow for the parallel measurement of thousands of molecular objects spanning different layers of interactions. In order to be able to discover hidden patterns, the most disparate analytical techniques have been proposed. Here, we describe the basic methodologies to approach the analysis of microarray datasets that focus on the task of (sub)group discovery.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Trepo - Institutional Repository of Tampere University