Search CORE

181 research outputs found

Machine Learning and Integrative Analysis of Biomedical Big Data.

Author: Choi Howard
Chung Neo Christopher
Mirza Bilal
Ping Peipei
Wang Jie
Wang Wei
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

Multidisciplinary Digital Publishing Institute

Ezid

Directory of Open Access Journals

eScholarship - University of California

A Machine Learning Framework for Identifying Molecular Biomarkers from Transcriptomic Cancer Data

Author: Mamun Md Abdullah Al
Publication venue: FIU Digital Commons
Publication date: 23/03/2022
Field of study

Cancer is a complex molecular process due to abnormal changes in the genome, such as mutation and copy number variation, and epigenetic aberrations such as dysregulations of long non-coding RNA (lncRNA). These abnormal changes are reflected in transcriptome by turning oncogenes on and tumor suppressor genes off, which are considered cancer biomarkers. However, transcriptomic data is high dimensional, and finding the best subset of genes (features) related to causing cancer is computationally challenging and expensive. Thus, developing a feature selection framework to discover molecular biomarkers for cancer is critical. Traditional approaches for biomarker discovery calculate the fold change for each gene, comparing expression profiles between tumor and healthy samples, thus failing to capture the combined effect of the whole gene set. Also, these approaches do not always investigate cancer-type prediction capabilities using discovered biomarkers. In this work, we proposed a machine learning-based framework to address all of the above challenges in discovering lncRNA biomarkers. First, we developed a machine learning pipeline that takes lncRNA expression profiles of cancer samples as input and outputs a small set of key lncRNAs that can accurately predict multiple cancer types. A significant innovation of our work is its ability to identify biomarkers without using healthy samples. However, this initial framework cannot identify cancer-specific lncRNAs. Second, we extended our framework to identify cancer type and subtype-specific lncRNAs. Third, we proposed to use a state-of-the-art deep learning algorithm concrete autoencoder (CAE) in an unsupervised setting, which efficiently identifies a subset of the most informative features. However, CAE does not identify reproducible features in different runs due to its stochastic nature. Thus, we proposed a multi-run CAE (mrCAE) to identify a stable set of features to address this issue. Our deep learning-based pipeline significantly extended the previous state-of-the-art feature selection techniques. Finally, we showed that discovered biomarkers are biologically relevant using literature review and prognostically significant using survival analyses. The discovered novel biomarkers could be used as a screening tool for different cancer diagnoses and as therapeutic targets

DigitalCommons@Florida International University

Benchmark study of feature selection strategies for multi-omics data

Author: Du Shangming
Hornung Roman
Li Yingxia
Mansmann Ulrich
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

BACKGROUND: In the last few years, multi-omics data, that is, datasets containing different types of high-dimensional molecular variables for the same samples, have become increasingly available. To date, several comparison studies focused on feature selection methods for omics data, but to our knowledge, none compared these methods for the special case of multi-omics data. Given that these data have specific structures that differentiate them from single-omics data, it is unclear whether different feature selection strategies may be optimal for such data. In this paper, using 15 cancer multi-omics datasets we compared four filter methods, two embedded methods, and two wrapper methods with respect to their performance in the prediction of a binary outcome in several situations that may affect the prediction results. As classifiers, we used support vector machines and random forests. The methods were compared using repeated fivefold cross-validation. The accuracy, the AUC, and the Brier score served as performance metrics. RESULTS: The results suggested that, first, the chosen number of selected features affects the predictive performance for many feature selection methods but not all. Second, whether the features were selected by data type or from all data types concurrently did not considerably affect the predictive performance, but for some methods, concurrent selection took more time. Third, regardless of which performance measure was considered, the feature selection methods mRMR, the permutation importance of random forests, and the Lasso tended to outperform the other considered methods. Here, mRMR and the permutation importance of random forests already delivered strong predictive performance when considering only a few selected features. Finally, the wrapper methods were computationally much more expensive than the filter and embedded methods. CONCLUSIONS: We recommend the permutation importance of random forests and the filter method mRMR for feature selection using multi-omics data, where, however, mRMR is considerably more computationally costly. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04962-x

Open Access LMU

PubMed Central

A Novel Chaos Quasi-Oppositional based Flamingo Search Algorithm with Simulated Annealing for Feature Selection

Author: Devarakonda Nagaraju
Durgam Revathi
Publication venue: Auricle Global Society of Education and Research
Publication date: 18/08/2023
Field of study

In present situations feature selection is one of the most vital tasks in machine learning. Diminishing the feature set helps to increase the accuracy of the classifier. Due to large number of information’s present in the dataset it is a tremendous process to select the necessary features from the dataset. So, to solve this problem a novel Chaos Quasi-Oppositional based Flamingo Search Algorithm with Simulated Annealing algorithm (CQOFSA-SA) is proposed for feature selection and to select the optimal feature set from the datasets and thus it shrinks the dimension of the dataset. The FSA approach is used to choose the optimal feature subset from the dataset. In each iteration, the optimal solution of FSA is enriched by Simulated Annealing (SA). TheChaos Quasi-Oppositional based learning (CQOBL) included in the initialization of FSA improves the convergence rate and increases the searching capability of the FSA approach in choosing the optimal feature set. From the experimental outcomes, it is proved that the proposed CQOFSA-SA outperforms other feature selection approaches in terms of accuracy, optimal reduced feature set, fast convergence and fitness value

International Journal on Recent and Innovation Trends in Computing and Communication

Recommended from our members

Evolutionary computation-based feature selection for finding a stable set of features in high-dimensional data

Author: Salesi Mousaabadi S
Publication venue
Publication date: 01/09/2019
Field of study

Evolutionary Computation (EC) algorithms have proved to work well for feature selection because they are powerful search techniques and can produce multiple good solutions. However, they suﬀer from some limitations for real world applications. Firstly, ECs require high computation time as they evaluate many solutions at each iteration. Secondly, a classiﬁer is usually used as their ﬁtness function which causes the selected subset to perform well only on the utilised classiﬁer (e.g. classiﬁer-bias). Lastly, ECs, as stochastic search methods, return a diﬀerent ﬁnal subset in diﬀerent runs which poses a problem for ﬁnding a stable set of features (e.g. stability issue). To address computation time and classiﬁer-bias limitations, this thesis proposes a new two-stage selection approach called ﬁlter/ﬁlter in which two ﬁlter feature selection algorithms are combined. In the ﬁrst stage, a ranking algorithm forms a reduced dataset by selecting the most informative features from the original dataset. In the second stage, the reduced dataset is fed to a novel EC algorithm to select ﬁnal feature subset. This new EC algorithm is a Tabu search hybridised with an Asexual Genetic Algorithm called TAGA. TAGA beneﬁts from new search components and solution representation which can eﬀectively reduce computation time. To select a classiﬁer-unbiased ﬁnal subset, a statistical criterion is used as the ﬁtness function which evaluates the subset independent of any classiﬁer. Experiments show that the proposed ﬁlter/ﬁlter requires an acceptable computation time and selects more classiﬁer-unbiased features compared to the state-of-the-arts. To ﬁnd a stable set of features, a novel Generalisation Power Index (GPI) is proposed to analyse the generalisation power of ﬁnal subsets of an EC in several runs. Generalisation power refers to performance capability of a subset over wide range of classiﬁers. Computation results conﬁrm that GPI is able to ﬁnd a stable set of features which achieves near optimal accuracy when used to train various classiﬁers. To ex amine the suitability of the proposed methods for real-world applications, the ﬁlter/ﬁlter approach and GPI are integrated to select a stable set of features for METABRIC breast cancer subtype classiﬁcation problem. Experimental results show that this integration not only can address the limitations of ECs for a real-world biomedical feature selection problem but it performs better than alternatives methods

Nottingham Trent Institutional Repository (IRep)

A Systematic Evaluation of Feature Selection and Classification Algorithms Using Simulated and Real miRNA Sequencing Data

Author: Fang Shao
Feng Chen
Li Guo
Sheng Yang
Yang Zhao
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2015
Field of study

Sequencing is widely used to discover associations between microRNAs (miRNAs) and diseases. However, the negative binomial distribution (NB) and high dimensionality of data obtained using sequencing can lead to low-power results and low reproducibility. Several statistical learning algorithms have been proposed to address sequencing data, and although evaluation of these methods is essential, such studies are relatively rare. The performance of seven feature selection (FS) algorithms, including baySeq, DESeq, edgeR, the rank sum test, lasso, particle swarm optimistic decision tree, and random forest (RF), was compared by simulation under different conditions based on the difference of the mean, the dispersion parameter of the NB, and the signal to noise ratio. Real data were used to evaluate the performance of RF, logistic regression, and support vector machine. Based on the simulation and real data, we discuss the behaviour of the FS and classification algorithms. The Apriori algorithm identified frequent item sets (mir-133a, mir-133b, mir-183, mir-937, and mir-96) from among the deregulated miRNAs of six datasets from The Cancer Genomics Atlas. Taking these findings altogether and considering computational memory requirements, we propose a strategy that combines edgeR and DESeq for large sample sizes

Crossref

Directory of Open Access Journals

An integrated approach for identifying wrongly labeled samples when performing classification in microarray data

Author: Chang CQ
Hung YS
LEUNG YY
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2012
Field of study

published_or_final_versio

Directory of Open Access Journals

PubMed Central

HKU Scholars Hub

Specialized Named Entity Recognition For Breast Cancer Subtyping

Author: Hawblitzel Griffith Scheyer
Publication venue: DigitalCommons@CalPoly
Publication date: 01/06/2022
Field of study

The amount of data and analysis being published and archived in the biomedical research community is more than can feasibly be sifted through manually, which limits the information an individual or small group can synthesize and integrate into their own research. This presents an opportunity for using automated methods, including Natural Language Processing (NLP), to extract important information from text on various topics. Named Entity Recognition (NER), is one way to automate knowledge extraction of raw text. NER is defined as the task of identifying named entities from text using labels such as people, dates, locations, diseases, and proteins. There are several NLP tools that are designed for entity recognition, but rely on large established corpus for training data. Biomedical research has the potential to guide diagnostic and therapeutic decisions, yet the overwhelming density of publications acts as a barrier to getting these results into a clinical setting. An exceptional example of this is the field of breast cancer biology where over 2 million people are diagnosed worldwide every year and billions of dollars are spent on research. Breast cancer biology literature and research relies on a highly specific domain with unique language and vocabulary, and therefore requires specialized NLP tools which can generate biologically meaningful results. This thesis presents a novel annotation tool, that is optimized for quickly creating training data for spaCy pipelines as well as exploring the viability of said data for analyzing papers with automated processing. Custom pipelines trained on these annotations are shown to be able to recognize custom entities at levels comparable to large corpus based recognition

DigitalCommons@CalPoly

Systems Analytics and Integration of Big Omics Data

Author: Hardiman Gary
Publication venue: 'MDPI AG'
Publication date: 01/01/2020
Field of study

A “genotype"" is essentially an organism's full hereditary information which is obtained from its parents. A ""phenotype"" is an organism's actual observed physical and behavioral properties. These may include traits such as morphology, size, height, eye color, metabolism, etc. One of the pressing challenges in computational and systems biology is genotype-to-phenotype prediction. This is challenging given the amount of data generated by modern Omics technologies. This “Big Data” is so large and complex that traditional data processing applications are not up to the task. Challenges arise in collection, analysis, mining, sharing, transfer, visualization, archiving, and integration of these data. In this Special Issue, there is a focus on the systems-level analysis of Omics data, recent developments in gene ontology annotation, and advances in biological pathways and network biology. The integration of Omics data with clinical and biomedical data using machine learning is explored. This Special Issue covers new methodologies in the context of gene–environment interactions, tissue-specific gene expression, and how external factors or host genetics impact the microbiome

Directory of Open Access Books (DOAB)